Copyright | (c) 2021 Andrew Lelechenko |
---|---|
License | MIT |
Safe Haskell | Safe-Inferred |
Language | Haskell2010 |
Test.Tasty.Bench
Description
Featherlight benchmark framework (only one file!) for performance
measurement with API mimicking
criterion
and
gauge
. A prominent feature is
built-in comparison against previous runs and between benchmarks.
How lightweight is it?
There is only one source file Test.Tasty.Bench and no non-boot
dependencies except tasty
. So
if you already depend on tasty
for a test suite, there is nothing else
to install.
Compare this to criterion
(10+ modules, 50+ dependencies) and gauge
(40+ modules, depends on basement
and vector
). A build on a clean
machine is up to 16x faster than criterion
and up to 4x faster than
gauge
. A build without dependencies is up to 6x faster than
criterion
and up to 8x faster than gauge
.
tasty-bench
is a native Haskell library and works everywhere, where
GHC does, including WASM. We support a full range of architectures
(i386
, amd64
, armhf
, arm64
, ppc64le
, s390x
) and operating
systems (Linux, Windows, macOS, FreeBSD, OpenBSD, NetBSD), plus any GHC
from 8.0 to 9.10 (and earlier releases stretch back to GHC 7.0).
How is it possible?
Our benchmarks are literally regular tasty
tests, so we can leverage
all existing machinery for command-line options, resource management,
structuring, listing and filtering benchmarks, running and reporting
results. It also means that tasty-bench
can be used in conjunction
with other tasty
ingredients.
Unlike criterion
and gauge
we use a very simple statistical model
described below. This is arguably a questionable choice, but it works
pretty well in practice. A rare developer is sufficiently well-versed in
probability theory to make sense and use of all numbers generated by
criterion
.
How to switch?
Cabal mixins
allow to taste tasty-bench
instead of criterion
or gauge
without
changing a single line of code:
cabal-version: 2.0 benchmark foo ... build-depends: tasty-bench mixins: tasty-bench (Test.Tasty.Bench as Criterion, Test.Tasty.Bench as Criterion.Main, Test.Tasty.Bench as Gauge, Test.Tasty.Bench as Gauge.Main)
This works vice versa as well: if you use tasty-bench
, but at some
point need a more comprehensive statistical analysis, it is easy to
switch temporarily back to criterion
.
How to write a benchmark?
Benchmarks are declared in a separate section of cabal
file:
cabal-version: 2.0 name: bench-fibo version: 0.0 build-type: Simple synopsis: Example of a benchmark benchmark bench-fibo main-is: BenchFibo.hs type: exitcode-stdio-1.0 build-depends: base, tasty-bench ghc-options: "-with-rtsopts=-A32m" if impl(ghc >= 8.6) ghc-options: -fproc-alignment=64
And here is BenchFibo.hs
:
import Test.Tasty.Bench fibo :: Int -> Integer fibo n = if n < 2 then toInteger n else fibo (n - 1) + fibo (n - 2) main :: IO () main = defaultMain [ bgroup "Fibonacci numbers" [ bench "fifth" $ nf fibo 5 , bench "tenth" $ nf fibo 10 , bench "twentieth" $ nf fibo 20 ] ]
Since tasty-bench
provides an API compatible with criterion
, one can
refer to
its documentation
for more examples.
How to read results?
Running the example above (cabal
bench
or stack
bench
) results in
the following output:
All Fibonacci numbers fifth: OK (2.13s) 63 ns ± 3.4 ns tenth: OK (1.71s) 809 ns ± 73 ns twentieth: OK (3.39s) 104 μs ± 4.9 μs All 3 tests passed (7.25s)
The output says that, for instance, the first benchmark was repeatedly executed for 2.13 seconds (wall-clock time), its predicted mean CPU time was 63 nanoseconds and means of individual samples do not often diverge from it further than ±3.4 nanoseconds (double standard deviation). Take standard deviation numbers with a grain of salt; there are lies, damned lies, and statistics.
Wall-clock time vs. CPU time
What time are we talking about? Both criterion
and gauge
by default
report wall-clock time, which is affected by any other application which
runs concurrently. Ideally benchmarks are executed on a dedicated server
without any other load, but — let’s face the truth — most of developers
run benchmarks on a laptop with a hundred other services and a window
manager, and watch videos while waiting for benchmarks to finish. That’s
the cause of a notorious “variance introduced by outliers: 88% (severely
inflated)” warning.
To alleviate this issue tasty-bench
measures CPU time by getCPUTime
instead of wall-clock time by default. It does not provide a perfect
isolation from other processes (e. g., if CPU cache is spoiled by others,
populating data back from RAM is your burden), but is a bit more stable.
Caveat: this means that for multithreaded algorithms tasty-bench
reports total elapsed CPU time across all cores, while criterion
and
gauge
print maximum of core’s wall-clock time. It also means that by
default tasty-bench
does not measure time spent out of process, e. g.,
calls to other executables. To work around this limitation use
--time-mode
command-line option or set it locally via TimeMode
option.
Statistical model
Here is a procedure used by tasty-bench
to measure execution time:
- Set \(n \leftarrow 1\).
- Measure execution time \(t_n\) of \(n\) iterations and execution time \(t_{2n}\) of \(2n\) iterations.
- Find \(t\) which minimizes deviation of \((nt,2nt)\) from \((t_n,t_{2n})\), namely \(t \leftarrow (t_n + 2t_{2n}) / 5n\).
- If deviation is small enough (see
--stdev
below) or time is running out soon (see--timeout
below), return \(t\) as a mean execution time. - Otherwise set \(n \leftarrow 2n\) and jump back to Step 2.
This is roughly similar to the linear regression approach which
criterion
takes, but we fit only two last points. This allows us to
simplify away all heavy-weight statistical analysis. More importantly,
earlier measurements, which are presumably shorter and noisier, do not
affect overall result. This is in contrast to criterion
, which fits
all measurements and is biased to use more data points corresponding to
shorter runs (it employs \(n \leftarrow 1.05n\) progression).
Mean time and its deviation does not say much about the distribution of individual timings. E. g., imagine a computation which (according to a coarse system timer) takes either 0 ms or 1 ms with equal probability. While one would be able to establish that its mean time is 0.5 ms with a very small deviation, this does not imply that individual measurements are anywhere near 0.5 ms. Even assuming an infinite precision of a system timer, the distribution of individual times is not known to be normal.
Obligatory disclaimer: statistics is a tricky matter, there is no
one-size-fits-all approach. In the absence of a good theory simplistic
approaches are as (un)sound as obscure ones. Those who seek statistical
soundness should rather collect raw data and process it themselves using
a proper statistical toolbox. Data reported by tasty-bench
is only of
indicative and comparative significance.
Memory usage
Configuring RTS to collect GC statistics (e. g., via
cabal
bench
--benchmark-options
'+RTS
-T'
or
stack
bench
--ba
'+RTS
-T'
) enables tasty-bench
to estimate and
report memory usage:
All Fibonacci numbers fifth: OK (2.13s) 63 ns ± 3.4 ns, 223 B allocated, 0 B copied, 2.0 MB peak memory tenth: OK (1.71s) 809 ns ± 73 ns, 2.3 KB allocated, 0 B copied, 4.0 MB peak memory twentieth: OK (3.39s) 104 μs ± 4.9 μs, 277 KB allocated, 59 B copied, 5.0 MB peak memory All 3 tests passed (7.25s)
This data is reported as per GHC.Stats.RTSStats fields:
Total size of data ever allocated since the start of the benchmark iteration. Even if data was immediately garbage collected and freed, it still counts.
Total size of data ever copied by GC (because it was alive and kicking) since the start of the benchmark iteration. Note that zero bytes often mean that the benchmark was too short to trigger GC at all.
Peak size of live data since the very start of the process. This is a global metric, it cumulatively grows and does not say much about individual benchmarks, but rather characterizes heap environment in which they are executed.
Combining tests and benchmarks
When optimizing an existing function, it is important to check that its
observable behavior remains unchanged. One can rebuild both tests and
benchmarks after each change, but it would be more convenient to run
sanity checks within benchmark itself. Since our benchmarks are
compatible with tasty
tests, we can easily do so.
Imagine you come up with a faster function myFibo
to generate
Fibonacci numbers:
import Test.Tasty.Bench import Test.Tasty.QuickCheck -- from tasty-quickcheck package fibo :: Int -> Integer fibo n = if n < 2 then toInteger n else fibo (n - 1) + fibo (n - 2) myFibo :: Int -> Integer myFibo n = if n < 3 then toInteger n else myFibo (n - 1) + myFibo (n - 2) main :: IO () main = Test.Tasty.Bench.defaultMain -- not Test.Tasty.defaultMain [ bench "fibo 20" $ nf fibo 20 , bench "myFibo 20" $ nf myFibo 20 , testProperty "myFibo = fibo" $ \n -> fibo n === myFibo n ]
This outputs:
All fibo 20: OK (3.02s) 104 μs ± 4.9 μs myFibo 20: OK (1.99s) 71 μs ± 5.3 μs myFibo = fibo: FAIL *** Failed! Falsified (after 5 tests and 1 shrink): 2 1 /= 2 Use --quickcheck-replay=927711 to reproduce. 1 out of 3 tests failed (5.03s)
We see that myFibo
is indeed significantly faster than fibo
, but
unfortunately does not do the same thing. One should probably look for
another way to speed up generation of Fibonacci numbers.
Troubleshooting
If benchmarks take too long, set
--timeout
to limit execution time of individual benchmarks, andtasty-bench
will do its best to fit into a given time frame. Without--timeout
we rerun benchmarks until achieving a target precision set by--stdev
, which in a noisy environment of a modern laptop with GUI may take a lot of time.While
criterion
runs each benchmark at least for 5 seconds,tasty-bench
is happy to conclude earlier, if it does not compromise the quality of results. In our experimentstasty-bench
suites tend to finish earlier, even if some individual benchmarks take longer than withcriterion
.A common source of noisiness is garbage collection. Setting a larger allocation area (nursery) is often a good idea, either via
cabal
bench
--benchmark-options
'+RTS
-A32m'
orstack
bench
--ba
'+RTS
-A32m'
. Alternatively bake it intocabal
file asghc-options:
"-with-rtsopts=-A32m"
.
- Never compile benchmarks with
-fstatic-argument-transformation
, because it breaks a trick we use to force GHC into reevaluation of the same function application over and over again. If benchmark results look malformed like below, make sure that you are invoking
Test.Tasty.Bench.
defaultMain
and notTest.Tasty.
defaultMain
(the difference isconsoleBenchReporter
vs.consoleTestReporter
):All fibo 20: OK (1.46s) WithLoHi (Estimate {estMean = Measurement {measTime = 41529118775, measAllocs = 0, measCopied = 0, measMaxMem = 0}, estStdev = 1595055320}) (-Infinity) Infinity
If benchmarks fail with an error message
Unhandled resource. Probably a bug in the runner you're using.
or
Unexpected state of the resource (NotCreated) in getResource. Report as a tasty bug.
this is likely caused by
env
orenvWithCleanup
affecting benchmarks structure. You can useenv
to read test data fromIO
, but not to read benchmark names or affect their hierarchy in other way. This is a fundamental restriction oftasty
to list and filter benchmarks without launching missiles.Strict pattern-matching on resource is also prohibited. For instance, if it is a tuple, the second argument of
env
should use a lazy pattern match\~(a, b) -> ...
- If benchmarks fail with
Test dependencies form a loop
orTest dependencies have cycles
, this is likely because ofbcompare
, which compares a benchmark with itself. Locating a benchmark in a global environment may be tricky, please refer totasty
documentation for details and consider usinglocateBenchmark
. When seeing
This benchmark takes more than 100 seconds. Consider setting --timeout, if this is unexpected (or to silence this warning).
do follow the advice: abort benchmarks and pass
-t100
or similar. Unless you are benchmarking a very computationally expensive function, a single benchmark should stabilize after a couple of seconds. This warning is a sign that your environment is too noisy, in which casetasty-bench
will continue trying with exponentially longer intervals, often unproductively.The following error can be thrown when benchmarks are built with
ghc-options: -threaded
:Benchmarks must not be run concurrently. Please pass -j1 and/or avoid +RTS -N.
The underlying cause is that
tasty
runs tests concurrently, which is harmful for reliable performance measurements. Make sure to usetasty-bench >= 0.3.4
and invokeTest.Tasty.Bench.
defaultMain
and notTest.Tasty.
defaultMain
. Note thatlocalOption
(NumThreads
1) quashes the warning, but does not eliminate the cause.- If benchmarks using GHC 9.4.4+ segfault on Windows, check that you
are not using non-moving garbage collector
--nonmoving-gc
. This is likely caused by GHC issue. Previous releases oftasty-bench
recommended enabling--nonmoving-gc
to stabilise benchmarks, but it’s discouraged now. If you see
<stdout>: commitBuffer: invalid argument (cannot encode character '\177')
or
Uncaught exception ghc-internal:GHC.Internal.IO.Exception.IOException: <stdout>: commitBuffer: invalid argument (cannot encode character '\956')
it means that your locale does not support UTF-8.
tasty-bench
makes an effort to force locale to UTF-8, but sometimes, when benchmarks are a part of a larger application, it’s impossible to do so. In such case runlocale
-a
to list available locales and set a UTF-8-capable one (e. g.,export
LANG=C.UTF-8
) before starting benchmarks.
Isolating interfering benchmarks
One difficulty of benchmarking in Haskell is that it is hard to isolate
benchmarks so that they do not interfere. Changing the order of
benchmarks or skipping some of them has an effect on heap’s layout and
thus affects garbage collection. This issue is well attested in
both
criterion
and
gauge
.
Usually (but not always) skipping some benchmarks speeds up remaining ones. That’s because once a benchmark allocated heap which for some reason was not promptly released afterwards (e. g., it forced a top-level thunk in an underlying library), all further benchmarks are slowed down by garbage collector processing this additional amount of live data over and over again.
There are several mitigation strategies. First of all, giving garbage
collector more breathing space by +RTS
-A32m
(or more) is often good
enough.
Further, avoid using top-level bindings to store large test data. Once
such thunks are forced, they remain allocated forever, which affects
detrimentally subsequent unrelated benchmarks. Treat them as external
data, supplied via env
: instead of
largeData :: String largeData = replicate 1000000 'a' main :: IO () main = defaultMain [ bench "large" $ nf length largeData, ... ]
use
import Control.DeepSeq (force) import Control.Exception (evaluate) main :: IO () main = defaultMain [ env (evaluate (force (replicate 1000000 'a'))) $ \largeData -> bench "large" $ nf length largeData, ... ]
Finally, as an ultimate measure to reduce interference between benchmarks, one can run each of them in a separate process. We do not quite recommend this approach, but if you are desperate, here is how:
cabal run -v0 all:benches -- -l | sed -e 's/[\"]/\\\\\\&/g' | while read -r name; do cabal run -v0 all:benches -- -p '$0 == "'"$name"'"'; done
This assumes that there is a single benchmark suite in the project and that benchmark names do not contain newlines.
Comparison against baseline
One can compare benchmark results against an earlier run in an automatic way.
When using this feature, it’s especially important to compile benchmarks
with
ghc-options:
-fproc-alignment
=64
,
otherwise results could be skewed by intermittent changes in cache-line
alignment.
Firstly, run tasty-bench
with --csv
FILE
key to dump results to
FILE
in CSV format (it could be a good idea to set smaller --stdev
,
if possible):
Name,Mean (ps),2*Stdev (ps) All.Fibonacci numbers.fifth,48453,4060 All.Fibonacci numbers.tenth,637152,46744 All.Fibonacci numbers.twentieth,81369531,3342646
Now modify implementation and rerun benchmarks with --baseline
FILE
key. This produces a report as follows:
All Fibonacci numbers fifth: OK (0.44s) 53 ns ± 2.7 ns, 8% more than baseline tenth: OK (0.33s) 641 ns ± 59 ns, same as baseline twentieth: OK (0.36s) 77 μs ± 6.4 μs, 5% less than baseline All 3 tests passed (1.50s)
You can also fail benchmarks, which deviate too far from baseline, using
--fail-if-slower
and --fail-if-faster
options. For example, setting
both of them to 6 will fail the first benchmark above (because it is
more than 6% slower), but the last one still succeeds (even while it is
measurably faster than baseline, deviation is less than 6%). Consider
also using --hide-successes
to show only problematic benchmarks, or
even tasty-rerun
package to focus on rerunning failing items only.
If you wish to compare two CSV reports non-interactively, here is a
handy awk
incantation:
awk 'BEGIN{FS=",";OFS=",";print "Name,Old,New,Ratio"}FNR==1{trueNF=NF;next}NF<trueNF{print "Benchmark names should not contain newlines";exit 1}FNR==NR{oldTime=$(NF-trueNF+2);NF-=trueNF-1;a[$0]=oldTime;next}{newTime=$(NF-trueNF+2);NF-=trueNF-1;if(a[$0]){print $0,a[$0],newTime,newTime/a[$0];gs+=log(newTime/a[$0]);gc++}}END{if(gc>0)print "Geometric mean,,",exp(gs/gc)}' old.csv new.csv
A larger shell snippet to compare two git
commits can be found in
compare_benches.sh
.
Note that columns in CSV report are different from what criterion
or
gauge
would produce. If names do not contain commas, missing columns
can be faked this way:
awk 'BEGIN{FS=",";OFS=",";print "Name,Mean,MeanLB,MeanUB,Stddev,StddevLB,StddevUB"}NR==1{trueNF=NF;next}NF<trueNF{print $0;next}{mean=$(NF-trueNF+2);stddev=$(NF-trueNF+3);NF-=trueNF-1;print $0,mean/1e12,mean/1e12,mean/1e12,stddev/2e12,stddev/2e12,stddev/2e12}'
To fake gauge
in --csvraw
mode use
awk 'BEGIN{FS=",";OFS=",";print "name,iters,time,cycles,cpuTime,utime,stime,maxrss,minflt,majflt,nvcsw,nivcsw,allocated,numGcs,bytesCopied,mutatorWallSeconds,mutatorCpuSeconds,gcWallSeconds,gcCpuSeconds"}NR==1{trueNF=NF;next}NF<trueNF{print $0;next}{mean=$(NF-trueNF+2);fourth=$(NF-trueNF+4);fifth=$(NF-trueNF+5);sixth=$(NF-trueNF+6);NF-=trueNF-1;print $0,1,mean/1e12,0,mean/1e12,mean/1e12,0,sixth+0,0,0,0,0,fourth+0,0,fifth+0,0,0,0,0}'
Comparison between benchmarks
You can also compare benchmarks to each other without any external tools, all in the comfort of your terminal.
import Test.Tasty.Bench fibo :: Int -> Integer fibo n = if n < 2 then toInteger n else fibo (n - 1) + fibo (n - 2) main :: IO () main = defaultMain [ bgroup "Fibonacci numbers" [ bcompare "tenth" $ bench "fifth" $ nf fibo 5 , bench "tenth" $ nf fibo 10 , bcompare "tenth" $ bench "twentieth" $ nf fibo 20 ] ]
This produces a report, comparing mean times of fifth
and twentieth
to tenth
:
All Fibonacci numbers fifth: OK (16.56s) 121 ns ± 2.6 ns, 0.08x tenth: OK (6.84s) 1.6 μs ± 31 ns twentieth: OK (6.96s) 203 μs ± 4.1 μs, 128.36x
To locate a baseline benchmark in a larger suite use locateBenchmark
.
One can leverage comparisons between benchmarks to implement portable
performance tests, expressing properties like “this algorithm must be at
least twice faster than that one” or “this operation should not be more
than thrice slower than that”. This can be achieved with
bcompareWithin
, which takes an acceptable interval of performance as
an argument.
Plotting results
Users can dump results into CSV with --csv
FILE
and plot them using
gnuplot
or other software. But for convenience there is also a
built-in quick-and-dirty SVG plotting feature, which can be invoked by
passing --svg
FILE
. Here is a sample of its output:
Build flags
Build flags are a brittle subject and users do not normally need to touch them.
- If you find yourself in an environment, where
tasty
is not available and you have access to boot packages only, you can still usetasty-bench
! Just copyTest/Tasty/Bench.hs
to your project (imagine it like a header-only C library). It will provide you with functions to buildBenchmarkable
and run them manually viameasureCpuTime
. This mode of operation can be also configured by disabling Cabal flagtasty
.
Command-line options
Use --help
to list all command-line options.
-p
,--pattern
This is a standard
tasty
option, which allows filtering benchmarks by a pattern orawk
expression. Please refer totasty
documentation for details.-t
,--timeout
This is a standard
tasty
option, setting timeout for individual benchmarks in seconds. Use it when benchmarks tend to take too long:tasty-bench
will make an effort to report results (even if of subpar quality) before timeout. Setting timeout too tight (insufficient for at least three iterations) will result in a benchmark failure. One can adjust it locally for a group of benchmarks, e. g.,localOption
(mkTimeout
100000000) for 100 seconds.--stdev
Target relative standard deviation of measurements in percents (5% by default). Large values correspond to fast and loose benchmarks, and small ones to long and precise. It can also be adjusted locally for a group of benchmarks, e. g.,
localOption
(RelStDev
0.02). If benchmarking takes far too long, consider setting--timeout
, which will interrupt benchmarks, potentially before reaching the target deviation.--csv
File to write results in CSV format.
--baseline
File to read baseline results in CSV format (as produced by
--csv
).--fail-if-slower
,--fail-if-faster
Upper bounds of acceptable slow down / speed up in percents. If a benchmark is unacceptably slower / faster than baseline (see
--baseline
), it will be reported as failed. Can be used in conjunction with a standardtasty
option--hide-successes
to show only problematic benchmarks. Both options can be adjusted locally for a group of benchmarks, e. g.,localOption
(FailIfSlower
0.10).--svg
File to plot results in SVG format.
--time-mode
Whether to measure CPU time (
cpu
, default) or wall-clock time (wall
).+RTS
-T
Estimate and report memory usage.
Custom command-line options
As usual with tasty
, it is easy to extend benchmarks with custom
command-line options. Here is an example:
import Data.Proxy import Test.Tasty.Bench import Test.Tasty.Ingredients.Basic import Test.Tasty.Options import Test.Tasty.Runners newtype RandomSeed = RandomSeed Int instance IsOption RandomSeed where defaultValue = RandomSeed 42 parseValue = fmap RandomSeed . safeRead optionName = pure "seed" optionHelp = pure "Random seed used in benchmarks" main :: IO () main = do let customOpts = [Option (Proxy :: Proxy RandomSeed)] ingredients = includingOptions customOpts : benchIngredients opts <- parseOptions ingredients benchmarks let RandomSeed seed = lookupOption opts defaultMainWithIngredients ingredients benchmarks benchmarks :: Benchmark benchmarks = bgroup "All" []
Synopsis
- defaultMain :: [Benchmark] -> IO ()
- type Benchmark = TestTree
- bench :: String -> Benchmarkable -> Benchmark
- bgroup :: String -> [Benchmark] -> Benchmark
- bcompare :: String -> Benchmark -> Benchmark
- bcompareWithin :: Double -> Double -> String -> Benchmark -> Benchmark
- env :: NFData env => IO env -> (env -> Benchmark) -> Benchmark
- envWithCleanup :: NFData env => IO env -> (env -> IO a) -> (env -> Benchmark) -> Benchmark
- newtype Benchmarkable = Benchmarkable {
- unBenchmarkable :: Word64 -> IO ()
- nf :: NFData b => (a -> b) -> a -> Benchmarkable
- whnf :: (a -> b) -> a -> Benchmarkable
- nfIO :: NFData a => IO a -> Benchmarkable
- whnfIO :: IO a -> Benchmarkable
- nfAppIO :: NFData b => (a -> IO b) -> a -> Benchmarkable
- whnfAppIO :: (a -> IO b) -> a -> Benchmarkable
- measureCpuTime :: Timeout -> RelStDev -> Benchmarkable -> IO Double
- measureCpuTimeAndStDev :: Timeout -> RelStDev -> Benchmarkable -> IO (Double, Double)
- benchIngredients :: [Ingredient]
- consoleBenchReporter :: Ingredient
- csvReporter :: Ingredient
- svgReporter :: Ingredient
- newtype RelStDev = RelStDev Double
- newtype FailIfSlower = FailIfSlower Double
- newtype FailIfFaster = FailIfFaster Double
- newtype CsvPath = CsvPath FilePath
- newtype BaselinePath = BaselinePath FilePath
- newtype SvgPath = SvgPath FilePath
- data TimeMode
- locateBenchmark :: [String] -> Expr
- mapLeafBenchmarks :: ([String] -> Benchmark -> Benchmark) -> Benchmark -> Benchmark
Running Benchmark
defaultMain :: [Benchmark] -> IO () Source #
Run benchmarks and report results, providing an interface
compatible with Criterion.
defaultMain
and
Gauge.
defaultMain
.
Since: 0.1
bench :: String -> Benchmarkable -> Benchmark Source #
Attach a name to Benchmarkable
.
This is actually a synonym of singleTest
to
provide an interface compatible with Criterion.
bench
and Gauge.
bench
.
Since: 0.1
Arguments
:: String |
|
-> Benchmark | Benchmark (or a group of benchmarks)
to be compared against the baseline benchmark by dividing measured mean times.
The result is reported by |
-> Benchmark |
Compare benchmarks, reporting relative speed up or slow down.
This function is a vague reminiscence of bcompare
, which existed in pre-1.0
versions of criterion
, but their types are incompatible. Under the hood
bcompare
is a thin wrapper over after
.
Here is a basic example:
import Test.Tasty.Bench fibo :: Int -> Integer fibo n = if n < 2 then toInteger n else fibo (n - 1) + fibo (n - 2) main :: IO () main = defaultMain [ bgroup "Fibonacci numbers" [ bcompare "tenth" $ bench "fifth" $ nf fibo 5 , bench "tenth" $ nf fibo 10 , bcompare "tenth" $ bench "twentieth" $ nf fibo 20 ] ]
More complex examples:
- https://hackage.haskell.org/package/chimera-0.3.3.0/src/bench/Bench.hs
- https://hackage.haskell.org/package/fast-digits-0.3.1.0/src/bench/Bench.hs
- https://hackage.haskell.org/package/unicode-data-0.4.0.1/src/bench/Main.hs
Since: 0.2.4
Arguments
:: Double | Lower bound of relative speed up. |
-> Double | Upper bound of relative speed up. |
-> String |
|
-> Benchmark | Benchmark to compare against baseline. |
-> Benchmark |
Same as bcompare
, but takes expected lower and upper bounds of
comparison. If the result is not within provided bounds, benchmark fails.
This allows to create portable performance tests: instead of comparing
to an absolute timeout or to previous runs, you can state that one implementation
of an algorithm must be faster than another.
E. g., bcompareWithin
2.0 3.0 passes only if a benchmark is at least 2x
and at most 3x slower than a baseline.
Since: 0.3.1
env :: NFData env => IO env -> (env -> Benchmark) -> Benchmark Source #
Run benchmarks in the given environment, usually reading large input data from file.
One might wonder why env
is needed,
when we can simply read all input data
before calling defaultMain
. The reason is that large data
dangling in the heap causes longer garbage collection
and slows down all benchmarks, even those which do not use it at all.
It is instrumental not only for proper IO
actions,
but also for a large statically-known data as well. Instead of a top-level
definition, which once evaluated will slow down garbage collection
during all subsequent benchmarks,
largeData :: String largeData = replicate 1000000 'a' main :: IO () main = defaultMain [ bench "large" $ nf length largeData, ... ]
use
main :: IO () main = defaultMain [ env (pure (replicate 1000000 'a')) $ \largeData -> bench "large" $ nf length largeData, ... ]
Even with env
, it's advisable to store input data in as few heap objects
as possible. ByteArray
(ideally pinned)
or unboxed Vector
are good, boxed arrays are worse, lists and trees are bad.
Test.Tasty.Bench.
env
is provided only for the sake of
compatibility with Criterion.
env
and
Gauge.
env
, and involves unsafePerformIO
. Consider using
withResource
instead.
When working with a mutable environment, bear in mind that it is threaded
through all iterations of a benchmark. tasty-bench
does not roll it back
or reset, it's user's resposibility. You might have better luck
with Criterion.
perBatchEnv
or Criterion.
perRunEnv
.
defaultMain
requires that the hierarchy of benchmarks and their names is
independent of underlying IO
actions. While executing IO
inside bench
via nfIO
is fine, and reading test data from files via env
is also fine,
using env
to choose benchmarks or their names depending on IO
side effects
will throw a rather cryptic error message:
Unhandled resource. Probably a bug in the runner you're using.
Strict pattern-matching on resource is also prohibited. For
instance, if it is a tuple, the second argument of env
should use
a lazy pattern match \~(a, b) -> ...
Since: 0.2
envWithCleanup :: NFData env => IO env -> (env -> IO a) -> (env -> Benchmark) -> Benchmark Source #
Similar to env
, but includes an additional argument
to clean up created environment.
Provided only for the sake of compatibility with
Criterion.
envWithCleanup
and
Gauge.
envWithCleanup
, and involves
unsafePerformIO
. Consider using withResource
instead.
Since: 0.2
Creating Benchmarkable
newtype Benchmarkable Source #
Something that can be benchmarked, produced by nf
, whnf
, nfIO
, whnfIO
,
nfAppIO
, whnfAppIO
below.
Drop-in replacement for Criterion.
Benchmarkable
and
Gauge.
Benchmarkable
.
Since: 0.1
Constructors
Benchmarkable | Since: 0.3 |
Fields
|
Instances
IsTest Benchmarkable Source # | |
Defined in Test.Tasty.Bench |
nf :: NFData b => (a -> b) -> a -> Benchmarkable Source #
nf
f
x
measures time to compute
a normal form (by means of rnf
, not force
)
of an application of f
to x
.
This does not include time to evaluate f
or x
themselves.
Ideally x
should be a primitive data type like Int
.
The same thunk of x
is shared by multiple calls of f
. We cannot evaluate
x
beforehand: there is no NFData
a
constraint, and potentially x
may
be an infinite structure. Thus x
will be evaluated in course of the first
application of f
. This noisy measurement is to be discarded soon,
but if x
is not a primitive data type, consider forcing its evaluation
separately, e. g., via env
or withResource
.
Here is a textbook anti-pattern: nf
sum
[1..1000000]
.
Since an input list is shared by multiple invocations of sum
,
it will be allocated in memory in full, putting immense pressure
on garbage collector. Also no list fusion will happen.
A better approach is nf
(\n
->
sum
[1..n]
) 1000000
.
It is preferable that the return type of the function under measurement
is inhabited enough to depend genuinely on all computations and is not simply b ~ ()
.
Otherwise GHC might get aggressive and optimise the payload away.
If you are measuring an inlinable function,
it is prudent to ensure that its invocation is fully saturated,
otherwise inlining will not happen. That's why one can often
see nf
(\n
->
f
n
) x
instead of nf
f
x
.
Same applies to rewrite rules.
While tasty-bench
is capable to perform micro- and even nanobenchmarks,
such measurements are noisy and involve an overhead. Results are more reliable
when f
x
takes at least several milliseconds.
Remember that forcing a normal form requires an additional
traverse of the structure. In certain scenarios (imagine benchmarking tail
),
especially when NFData
instance is badly written,
this traversal may take non-negligible time and affect results.
nf
f
is equivalent to whnf
(rnf
.
f
), but not to
whnf
(force
.
f
). The latter retains the result
in memory until it is fully evaluated, while the former allows
evaluated parts of the result to be garbage-collected immediately.
For users of {-# LANGUAGE LinearTypes #-}
: if f
is a linear function,
then nf
f
x
is ill-typed, but you can use nf
(\y -> f y)
x
instead.
Drop-in replacement for Criterion.
nf
and
Gauge.
nf
.
Since: 0.1
whnf :: (a -> b) -> a -> Benchmarkable Source #
whnf
f
x
measures time to compute
a weak head normal form of an application of f
to x
.
This does not include time to evaluate f
or x
themselves.
Ideally x
should be a primitive data type like Int
.
The same thunk of x
is shared by multiple calls of f
. We cannot evaluate
x
beforehand: there is no NFData
a
constraint, and potentially x
may
be an infinite structure. Thus x
will be evaluated in course of the first
application of f
. This noisy measurement is to be discarded soon,
but if x
is not a primitive data type, consider forcing its evaluation
separately, e. g., via env
or withResource
.
Computing only a weak head normal form is
rarely what intuitively is meant by "evaluation".
Beware that many educational materials contain examples with whnf
:
this is a wrong default.
Unless you understand precisely, what is measured,
it is recommended to use nf
instead.
Here is a textbook anti-pattern: whnf
(replicate
1000000
) 1
.
This will succeed in a matter of nanoseconds, because weak head
normal form forces only the first element of the list.
Drop-in replacement for Criterion.
whnf
and Gauge.
whnf
.
Since: 0.1
nfIO :: NFData a => IO a -> Benchmarkable Source #
nfIO
x
measures time to evaluate side-effects of x
and compute its normal form
(by means of rnf
, not force
).
Pure subexpression of an effectful computation x
may be evaluated only once and get cached. For example,
GHC is likely to float x
out of nfIO
(pure
x
) and
evaluate in only once, which leaves nfIO
to measure pure
only
with results in nanosecond range.
To avoid surprising results it is usually preferable
to use nfAppIO
instead.
Remember that forcing a normal form requires an additional
traverse of the structure. In certain scenarios,
especially when NFData
instance is badly written,
this traversal may take non-negligible time and affect results.
A typical use case is nfIO
(readFile
"foo.txt"
).
However, if your goal is not to benchmark I/O per se,
but just read input data from a file, it is cleaner to
use env
or withResource
.
Drop-in replacement for Criterion.
nfIO
and Gauge.
nfIO
.
Since: 0.1
whnfIO :: IO a -> Benchmarkable Source #
whnfIO
x
measures time to evaluate side-effects of x
and compute its weak head normal form.
Pure subexpression of an effectful computation x
may be evaluated only once and get cached. For example,
GHC is likely to float x
out of whnfIO
(pure
x
) and
evaluate in only once, which leaves whnfIO
to measure pure
only
with results in nanosecond range.
To avoid surprising results it is usually preferable
to use whnfAppIO
instead.
Computing only a weak head normal form is
rarely what intuitively is meant by "evaluation".
Unless you understand precisely, what is measured,
it is recommended to use nfIO
instead.
Lazy I/O is treacherous.
If your goal is not to benchmark I/O per se,
but just read input data from a file, it is cleaner to
use env
or withResource
.
Drop-in replacement for Criterion.
whnfIO
and Gauge.
whnfIO
.
Since: 0.1
nfAppIO :: NFData b => (a -> IO b) -> a -> Benchmarkable Source #
nfAppIO
f
x
measures time to evaluate side-effects of
an application of f
to x
and compute its normal form
(by means of rnf
, not force
).
This does not include time to evaluate f
or x
themselves.
Ideally x
should be a primitive data type like Int
.
The same thunk of x
is shared by multiple calls of f
. We cannot evaluate
x
beforehand: there is no NFData
a
constraint, and potentially x
may
be an infinite structure. Thus x
will be evaluated in course of the first
application of f
. This noisy measurement is to be discarded soon,
but if x
is not a primitive data type, consider forcing its evaluation
separately, e. g., via env
or withResource
.
Remember that forcing a normal form requires an additional
traverse of the structure. In certain scenarios,
especially when NFData
instance is badly written,
this traversal may take non-negligible time and affect results.
A typical use case is nfAppIO
readFile
"foo.txt"
.
However, if your goal is not to benchmark I/O per se,
but just read input data from a file, it is cleaner to
use env
or withResource
.
Drop-in replacement for Criterion.
nfAppIO
and Gauge.
nfAppIO
.
Since: 0.1
whnfAppIO :: (a -> IO b) -> a -> Benchmarkable Source #
whnfAppIO
f
x
measures time to evaluate side-effects of
an application of f
to x
and compute its weak head normal form.
This does not include time to evaluate f
or x
themselves.
Ideally x
should be a primitive data type like Int
.
The same thunk of x
is shared by multiple calls of f
. We cannot evaluate
x
beforehand: there is no NFData
a
constraint, and potentially x
may
be an infinite structure. Thus x
will be evaluated in course of the first
application of f
. This noisy measurement is to be discarded soon,
but if x
is not a primitive data type, consider forcing its evaluation
separately, e. g., via env
or withResource
.
Computing only a weak head normal form is
rarely what intuitively is meant by "evaluation".
Unless you understand precisely, what is measured,
it is recommended to use nfAppIO
instead.
Lazy I/O is treacherous.
If your goal is not to benchmark I/O per se,
but just read input data from a file, it is cleaner to
use env
or withResource
.
Drop-in replacement for Criterion.
whnfAppIO
and Gauge.
whnfAppIO
.
Since: 0.1
measureCpuTime :: Timeout -> RelStDev -> Benchmarkable -> IO Double Source #
An internal routine to measure CPU execution time in seconds
for a given timeout (put NoTimeout
, or mkTimeout
100000000 for 100 seconds)
and a target relative standard deviation
(put RelStDev
0.05 for 5% or RelStDev
(1/0) to run only one iteration).
Timeout
takes soft priority over RelStDev
: this function prefers
to finish in time even if at cost of precision. However, timeout is guidance
not guarantee: measureCpuTime
can take longer, if there is not enough time
to run at least thrice or an iteration takes unusually long.
Since: 0.3
measureCpuTimeAndStDev :: Timeout -> RelStDev -> Benchmarkable -> IO (Double, Double) Source #
Same as measureCpuTime
, but returns both CPU execution time
and its standard deviation.
Since: 0.3.4
Ingredients
benchIngredients :: [Ingredient] Source #
List of default benchmark ingredients. This is what defaultMain
runs.
Since: 0.2
consoleBenchReporter :: Ingredient Source #
Run benchmarks and report results
in a manner similar to consoleTestReporter
.
If --baseline
FILE
command line option is specified,
compare results against an earlier run and mark
too slow / too fast benchmarks as failed in accordance to
bounds specified by --fail-if-slower
PERCENT
and --fail-if-faster
PERCENT
.
Since: 0.2
csvReporter :: Ingredient Source #
Run benchmarks and save results in CSV format.
It activates when --csv
FILE
command line option is specified.
Since: 0.1
svgReporter :: Ingredient Source #
Run benchmarks and plot results in SVG format.
It activates when --svg
FILE
command line option is specified.
Since: 0.2.4
In addition to --stdev
command-line option,
one can adjust target relative standard deviation
for individual benchmarks and groups of benchmarks
using adjustOption
and localOption
.
E. g., set target relative standard deviation to 2% as follows:
import Test.Tasty (localOption) localOption (RelStDev 0.02) (bgroup [...])
If you set RelStDev
to infinity,
a benchmark will be executed
only once and its standard deviation will be recorded as zero.
This is rather a blunt approach, but it might be a necessary evil
for extremely long benchmarks. If you wish to run all benchmarks
only once, use command-line option --stdev
Infinity
.
Since: 0.2
Instances
Num RelStDev Source # | Since: 0.4 |
Read RelStDev Source # | |
Fractional RelStDev Source # | Since: 0.4 |
Show RelStDev Source # | |
Eq RelStDev Source # | Since: 0.4 |
Ord RelStDev Source # | Since: 0.4 |
Defined in Test.Tasty.Bench | |
IsOption RelStDev Source # | |
Defined in Test.Tasty.Bench Methods parseValue :: String -> Maybe RelStDev # optionName :: Tagged RelStDev String # optionHelp :: Tagged RelStDev String # showDefaultValue :: RelStDev -> Maybe String # |
newtype FailIfSlower Source #
In addition to --fail-if-slower
command-line option,
one can adjust an upper bound of acceptable slow down
in comparison to baseline for
individual benchmarks and groups of benchmarks
using adjustOption
and localOption
.
E. g., set upper bound of acceptable slow down to 10% as follows:
import Test.Tasty (localOption) localOption (FailIfSlower 0.10) (bgroup [...])
Since: 0.2
Constructors
FailIfSlower Double |
Instances
newtype FailIfFaster Source #
In addition to --fail-if-faster
command-line option,
one can adjust an upper bound of acceptable speed up
in comparison to baseline for
individual benchmarks and groups of benchmarks
using adjustOption
and localOption
.
E. g., set upper bound of acceptable speed up to 10% as follows:
import Test.Tasty (localOption) localOption (FailIfFaster 0.10) (bgroup [...])
Since: 0.2
Constructors
FailIfFaster Double |
Instances
A path to write results in CSV format, populated by --csv
.
This is an option of csvReporter
and can be set only globally.
Modifying it via adjustOption
or localOption
does not have any effect.
One can however pass it to tryIngredients
benchIngredients
. For example,
here is how to set a default CSV location:
import Data.Maybe import System.Exit import Test.Tasty.Bench import Test.Tasty.Ingredients import Test.Tasty.Options import Test.Tasty.Runners main :: IO () main = do let benchmarks = bgroup "All" ... opts <- parseOptions benchIngredients benchmarks let opts' = changeOption (Just . fromMaybe (CsvPath "foo.csv")) opts case tryIngredients benchIngredients opts' benchmarks of Nothing -> exitFailure Just mb -> mb >>= \b -> if b then exitSuccess else exitFailure
Since: 0.3
newtype BaselinePath Source #
A path to read baseline results in CSV format, populated by --baseline
.
This is an option of csvReporter
and can be set only globally.
Modifying it via adjustOption
or localOption
does not have any effect.
One can however pass it to tryIngredients
benchIngredients
.
Since: 0.3
Constructors
BaselinePath FilePath |
Instances
Eq BaselinePath Source # | Since: 0.4 |
Defined in Test.Tasty.Bench | |
Ord BaselinePath Source # | Since: 0.4 |
Defined in Test.Tasty.Bench Methods compare :: BaselinePath -> BaselinePath -> Ordering # (<) :: BaselinePath -> BaselinePath -> Bool # (<=) :: BaselinePath -> BaselinePath -> Bool # (>) :: BaselinePath -> BaselinePath -> Bool # (>=) :: BaselinePath -> BaselinePath -> Bool # max :: BaselinePath -> BaselinePath -> BaselinePath # min :: BaselinePath -> BaselinePath -> BaselinePath # | |
IsOption (Maybe BaselinePath) Source # | |
Defined in Test.Tasty.Bench Methods defaultValue :: Maybe BaselinePath # parseValue :: String -> Maybe (Maybe BaselinePath) # optionName :: Tagged (Maybe BaselinePath) String # optionHelp :: Tagged (Maybe BaselinePath) String # |
A path to plot results in SVG format, populated by --svg
.
This is an option of svgReporter
and can be set only globally.
Modifying it via adjustOption
or localOption
does not have any effect.
One can however pass it to tryIngredients
benchIngredients
.
Since: 0.3
Whether to measure CPU time or wall-clock time.
Normally CpuTime
is a better option (and default),
but consider switching to WallTime
to measure multithreaded algorithms or time spent in external processes.
One can switch the default measurement mode globally
using --time-mode
command-line option,
but it is usually better to adjust the mode locally:
import Test.Tasty (localOption) localOption WallTime (bgroup [...])
You can measure both times and report their ratio with the following gadget:
bgroup "Foo" [ localOption WallTime $ bench "WallTime" foo , bcompare "Foo.WallTime" $ localOption CpuTime $ bench "CPUTime" foo ]
Since: 0.3.2
Instances
IsOption TimeMode Source # | |
Defined in Test.Tasty.Bench Methods parseValue :: String -> Maybe TimeMode # optionName :: Tagged TimeMode String # optionHelp :: Tagged TimeMode String # showDefaultValue :: TimeMode -> Maybe String # |
Utilities
locateBenchmark :: [String] -> Expr Source #
Construct an AWK expression to locate an individual element or elements in Benchmark
by the suffix of the path. Names are listed in reverse order:
from bench
's own name to a name of the outermost bgroup
.
This function is meant to be used in conjunction with bcompare
, e. g.,
bcompare
(printAwkExpr
(locateBenchmark
path
)).
See also mapLeafBenchmarks
.
Real world examples:
- https://hackage.haskell.org/package/chimera-0.3.3.0/src/bench/Bench.hs
- https://hackage.haskell.org/package/text-builder-linear-0.1.1/src/bench/Main.hs
Since: 0.3.2
mapLeafBenchmarks :: ([String] -> Benchmark -> Benchmark) -> Benchmark -> Benchmark Source #
Map leaf benchmarks (bench
, not bgroup
) with a provided function,
which has an access to leaf's reversed path.
This helper is useful for bulk application of bcompare
.
See also locateBenchmark
.
Real world examples:
- https://hackage.haskell.org/package/chimera-0.3.3.0/src/bench/Bench.hs
- https://hackage.haskell.org/package/text-builder-linear-0.1.1/src/bench/Main.hs
Since: 0.3.2