BSD-3-Clause licensed by Bryan O'Sullivan
Maintained by Ryan Scott
This version can be pinned in stack with:criterion-1.6.4.1@sha256:e17b280bf4e461a4a14bc1c0a8fa3df60c8e2ca313cb3f2f12d4475d19d10120,5104

Criterion: robust, reliable performance measurement

Hackage Build Status

criterion is a library that makes accurate microbenchmarking in Haskell easy.

Features

  • The simple API hides a lot of automation and details that you shouldn’t need to worry about.

  • Sophisticated, high-resolution analysis which can accurately measure operations that run in as little as a few hundred picoseconds.

  • Output to active HTML (with JavaScript charts), CSV, and JSON. Write your own report templates to customize exactly how your results are presented.

  • Linear regression model that allows measuring the effects of garbage collection and other factors.

  • Measurements are cross-validated to ensure that sources of significant noise (usually other activity on the system) can be identified.

To get started, read the tutorial below, and take a look at the programs in the examples directory.

Credits and contacts

This library is written by Bryan O’Sullivan ([email protected]) and maintained by Ryan Scott ([email protected]). Please report bugs via the GitHub issue tracker.

Tutorial

Getting started

Here’s Fibber.hs: a simple and complete benchmark, measuring the performance of the ever-ridiculous fib function.

{- cabal:
build-depends: base, criterion
-}

import Criterion.Main

-- The function we're benchmarking.
fib :: Int -> Int
fib m | m < 0     = error "negative!"
      | otherwise = go m
  where
    go 0 = 0
    go 1 = 1
    go n = go (n - 1) + go (n - 2)

-- Our benchmark harness.
main = defaultMain [
  bgroup "fib" [ bench "1"  $ whnf fib 1
               , bench "5"  $ whnf fib 5
               , bench "9"  $ whnf fib 9
               , bench "11" $ whnf fib 11
               ]
  ]

(examples/Fibber.hs)

The defaultMain function takes a list of Benchmark values, each of which describes a function to benchmark. (We’ll come back to bench and whnf shortly, don’t worry.)

To maximise our convenience, defaultMain will parse command line arguments and then run any benchmarks we ask. Let’s run our benchmark program (it might take some time if you never used Criterion before, since the library has to be downloaded and compiled).

$ cabal run Fibber.hs
benchmarking fib/1
time                 13.77 ns   (13.49 ns .. 14.07 ns)
                     0.998 R²   (0.997 R² .. 1.000 R²)
mean                 13.56 ns   (13.49 ns .. 13.70 ns)
std dev              305.1 ps   (64.14 ps .. 532.5 ps)
variance introduced by outliers: 36% (moderately inflated)

benchmarking fib/5
time                 173.9 ns   (172.8 ns .. 175.6 ns)
                     1.000 R²   (0.999 R² .. 1.000 R²)
mean                 173.8 ns   (173.1 ns .. 175.4 ns)
std dev              3.149 ns   (1.842 ns .. 5.954 ns)
variance introduced by outliers: 23% (moderately inflated)

benchmarking fib/9
time                 1.219 μs   (1.214 μs .. 1.228 μs)
                     1.000 R²   (1.000 R² .. 1.000 R²)
mean                 1.219 μs   (1.216 μs .. 1.223 μs)
std dev              12.43 ns   (9.907 ns .. 17.29 ns)

benchmarking fib/11
time                 3.253 μs   (3.246 μs .. 3.260 μs)
                     1.000 R²   (1.000 R² .. 1.000 R²)
mean                 3.248 μs   (3.243 μs .. 3.254 μs)
std dev              18.94 ns   (16.57 ns .. 21.95 ns)

Even better, the --output option directs our program to write a report to the file fibber.html.

$ cabal run Fibber.hs -- --output fibber.html
...similar output as before...

Click on the image to see a complete report. If you mouse over the data points in the charts, you’ll see that they are live, giving additional information about what’s being displayed.

Understanding charts

A report begins with a summary of all the numbers measured. Underneath is a breakdown of every benchmark, each with two charts and some explanation.

The chart on the left is a kernel density estimate (also known as a KDE) of time measurements. This graphs the probability of any given time measurement occurring. A spike indicates that a measurement of a particular time occurred; its height indicates how often that measurement was repeated.

[!NOTE] Why not use a histogram?

A more popular alternative to the KDE for this kind of display is the histogram. Why do we use a KDE instead? In order to get good information out of a histogram, you have to choose a suitable bin size. This is a fiddly manual task. In contrast, a KDE is likely to be informative immediately, with no configuration required.

The chart on the right contains the raw measurements from which the kernel density estimate was built. The $x$ axis indicates the number of loop iterations, while the $y$ axis shows measured execution time for the given number of iterations. The line “behind” the values is a linear regression generated from this data. Ideally, all measurements will be on (or very near) this line.

Understanding the data under a chart

Underneath the chart for each benchmark is a small table of information that looks like this.

lower bound estimate upper bound
OLS regression 31.0 ms 37.4 ms 42.9 ms
R² goodness-of-fit 0.887 0.942 0.994
Mean execution time 34.8 ms 37.0 ms 43.1 ms
Standard deviation 2.11 ms 6.49 ms 11.0 ms

The second row is the result of a linear regression run on the measurements displayed in the right-hand chart.

  • OLS regression” estimates the time needed for a single execution of the activity being benchmarked, using an ordinary least-squares regression model. This number should be similar to the “mean execution time” row a couple of rows beneath. The OLS estimate is usually more accurate than the mean, as it more effectively eliminates measurement overhead and other constant factors.

  • R² goodness-of-fit” is a measure of how accurately the linear regression model fits the observed measurements. If the measurements are not too noisy, R² should lie between 0.99 and 1, indicating an excellent fit. If the number is below 0.99, something is confounding the accuracy of the linear model. A value below 0.9 is outright worrisome.

  • Mean execution time” and “Standard deviation” are statistics calculated (more or less) from execution time divided by number of iterations.

On either side of the main column of values are greyed-out lower and upper bounds. These measure the accuracy of the main estimate using a statistical technique called bootstrapping. This tells us that when randomly resampling the data, 95% of estimates fell within between the lower and upper bounds. When the main estimate is of good quality, the lower and upper bounds will be close to its value.

Reading command line output

Before you look at HTML reports, you’ll probably start by inspecting the report that criterion prints in your terminal window.

benchmarking ByteString/HashMap/random
time                 4.046 ms   (4.020 ms .. 4.072 ms)
                     1.000 R²   (1.000 R² .. 1.000 R²)
mean                 4.017 ms   (4.010 ms .. 4.027 ms)
std dev              27.12 μs   (20.45 μs .. 38.17 μs)

The first column is a name; the second is an estimate. The third and fourth, in parentheses, are the 95% lower and upper bounds on the estimate.

  • time corresponds to the “OLS regression” field in the HTML table above.

  • is the goodness-of-fit metric for time.

  • mean and std dev have the same meanings as “Mean execution time” and “Standard deviation” in the HTML table.

How to write a benchmark suite

A criterion benchmark suite consists of a series of Benchmark values.

main = defaultMain [
  bgroup "fib" [ bench "1"  $ whnf fib 1
               , bench "5"  $ whnf fib 5
               , bench "9"  $ whnf fib 9
               , bench "11" $ whnf fib 11
               ]
  ]

We group related benchmarks together using the bgroup function. Its first argument is a name for the group of benchmarks.

bgroup :: String -> [Benchmark] -> Benchmark

All the magic happens with the bench function. The first argument to bench is a name that describes the activity we’re benchmarking.

bench :: String -> Benchmarkable -> Benchmark
bench = Benchmark

The Benchmarkable type is a container for code that can be benchmarked.

By default, criterion allows two kinds of code to be benchmarked.

  • Any IO action can be benchmarked directly.

  • With a little trickery, we can benchmark pure functions.

Benchmarking an IO action

This function shows how we can benchmark an IO action.

import Criterion.Main

main = defaultMain [
    bench "readFile" $ nfIO (readFile "GoodReadFile.hs")
  ]

(examples/GoodReadFile.hs)

We use nfIO to specify that after we run the IO action, its result must be evaluated to normal form, i.e. so that all of its internal constructors are fully evaluated, and it contains no thunks.

nfIO :: NFData a => IO a -> Benchmarkable

Rules of thumb for when to use nfIO:

  • Any time that lazy I/O is involved, use nfIO to avoid resource leaks.

  • If you’re not sure how much evaluation will have been performed on the result of an action, use nfIO to be certain that it’s fully evaluated.

IO and seq

In addition to nfIO, criterion provides a whnfIO function that evaluates the result of an action only deep enough for the outermost constructor to be known (using seq). This is known as weak head normal form (WHNF).

whnfIO :: IO a -> Benchmarkable

This function is useful if your IO action returns a simple value like an Int, or something more complex like a Map where evaluating the outermost constructor will do “enough work”.

Be careful with lazy I/O!

Experienced Haskell programmers don’t use lazy I/O very often, and here’s an example of why: if you try to run the benchmark below, it will probably crash.

import Criterion.Main

main = defaultMain [
    bench "whnfIO readFile" $ whnfIO (readFile "BadReadFile.hs")
  ]

(examples/BadReadFile.hs)

The reason for the crash is that readFile reads the contents of a file lazily: it can’t close the file handle until whoever opened the file reads the whole thing. Since whnfIO only evaluates the very first constructor after the file is opened, the benchmarking loop causes a large number of open files to accumulate, until the inevitable occurs:

$ ./BadReadFile
benchmarking whnfIO readFile
openFile: resource exhausted (Too many open files)

Beware “pretend” I/O!

GHC is an aggressive compiler. If you have an IO action that doesn’t really interact with the outside world, and it has just the right structure, GHC may notice that a substantial amount of its computation can be memoised via “let-floating”.

There exists a somewhat contrived example of this problem, where the first two benchmarks run between 40 and 40,000 times faster than they “should”.

As always, if you see numbers that look wildly out of whack, you shouldn’t rejoice that you have magically achieved fast performance—be skeptical and investigate!

[!TIP] Defeating let-floating

Fortunately for this particular misbehaving benchmark suite, GHC has an option named -fno-full-laziness that will turn off let-floating and restore the first two benchmarks to performing in line with the second two.

You should not react by simply throwing -fno-full-laziness into every GHC-and-criterion command line, as let-floating helps with performance more often than it hurts with benchmarking.

Benchmarking pure functions

Lazy evaluation makes it tricky to benchmark pure code. If we tried to saturate a function with all of its arguments and evaluate it repeatedly, laziness would ensure that we’d only do “real work” the first time through our benchmarking loop. The expression would be overwritten with that result, and no further work would happen on subsequent loops through our benchmarking harness.

We can defeat laziness by benchmarking an unsaturated function—one that has been given all but one of its arguments.

This is why the nf function accepts two arguments: the first is the almost-saturated function we want to benchmark, and the second is the final argument to give it.

nf :: NFData b => (a -> b) -> a -> Benchmarkable

As the NFData constraint suggests, nf applies the argument to the function, then evaluates the result to normal form.

The whnf function evaluates the result of a function only to weak head normal form (WHNF).

whnf :: (a -> b) -> a -> Benchmarkable

If we go back to our first example, we can now fully understand what’s going on.

main = defaultMain [
  bgroup "fib" [ bench "1"  $ whnf fib 1
               , bench "5"  $ whnf fib 5
               , bench "9"  $ whnf fib 9
               , bench "11" $ whnf fib 11
               ]
  ]

(examples/Fibber.hs)

We can get away with using whnf here because we know that an Int has only one constructor, so there’s no deeper buried structure that we’d have to reach using nf.

As with benchmarking IO actions, there’s no clear-cut case for when to use whfn versus nf, especially when a result may be lazily generated.

Guidelines for thinking about when to use nf or whnf:

  • If a result is a lazy structure (or a mix of strict and lazy, such as a balanced tree with lazy leaves), how much of it would a real-world caller use? You should be trying to evaluate as much of the result as a realistic consumer would. Blindly using nf could cause way too much unnecessary computation.

  • If a result is something simple like an Int, you’re probably safe using whnf—but then again, there should be no additional cost to using nf in these cases.

Using the criterion command line

By default, a criterion benchmark suite simply runs all of its benchmarks. However, criterion accepts a number of arguments to control its behaviour. Run your program with --help for a complete list.

Specifying benchmarks to run

The most common thing you’ll want to do is specify which benchmarks you want to run. You can do this by simply enumerating each benchmark.

$ ./Fibber 'fib/fib 1'

By default, any names you specify are treated as prefixes to match, so you can specify an entire group of benchmarks via a name like "fib/". Use the --match option to control this behaviour. There are currently four ways to configure --match:

  • --match prefix: Check if the given string is a prefix of a benchmark path. For instance, "foo" will match "foobar".

  • --match glob: Use the given string as a Unix-style glob pattern. Bear in mind that performing a glob match on benchmarks names is done as if they were file paths, so for instance both "*/ba*" and "*/*" will match "foo/bar", but neither "*" nor "*bar" will match "foo/bar".

  • --match pattern: Check if the given string is a substring (not necessarily just a prefix) of a benchmark path. For instance "ooba" will match "foobar".

  • --match ipattern: Check if the given string is a substring (not necessarily just a prefix) of a benchmark path, but in a case-insensitive fashion. For instance, "oObA" will match "foobar".

Listing benchmarks

If you’ve forgotten the names of your benchmarks, run your program with --list and it will print them all.

How long to spend measuring data

By default, each benchmark runs for 5 seconds.

You can control this using the --time-limit option, which specifies the minimum number of seconds (decimal fractions are acceptable) that a benchmark will spend gathering data. The actual amount of time spent may be longer, if more data is needed.

Writing out data

Criterion provides several ways to save data.

The friendliest is as HTML, using --output. Files written using --output are actually generated from Mustache-style templates. The only other template provided by default is json, so if you run with --template json --output mydata.json, you’ll get a big JSON dump of your data.

You can also write out a basic CSV file using --csv, a JSON file using --json, and a JUnit-compatible XML file using --junit. (The contents of these files are likely to change in the not-too-distant future.)

Linear regression

If you want to perform linear regressions on metrics other than elapsed time, use the --regress option. This can be tricky to use if you are not familiar with linear regression, but here’s a thumbnail sketch.

The purpose of linear regression is to predict how much one variable (the responder) will change in response to a change in one or more others (the predictors).

On each step through a benchmark loop, criterion changes the number of iterations. This is the most obvious choice for a predictor variable. This variable is named iters.

If we want to regress CPU time (cpuTime) against iterations, we can use cpuTime:iters as the argument to --regress. This generates some additional output on the command line:

time                 31.31 ms   (30.44 ms .. 32.22 ms)
                     0.997 R²   (0.994 R² .. 0.999 R²)
mean                 30.56 ms   (30.01 ms .. 30.99 ms)
std dev              1.029 ms   (754.3 μs .. 1.503 ms)

cpuTime:             0.997 R²   (0.994 R² .. 0.999 R²)
  iters              3.129e-2   (3.039e-2 .. 3.221e-2)
  y                  -4.698e-3  (-1.194e-2 .. 1.329e-3)

After the block of normal data, we see a series of new rows.

On the first line of the new block is an R² goodness-of-fit measure, so we can see how well our choice of regression fits the data.

On the second line, we get the slope of the cpuTime/iters curve, or (stated another way) how much cpuTime each iteration costs.

The last entry is the $y$-axis intercept.

Measuring garbage collector statistics

By default, GHC does not collect statistics about the operation of its garbage collector. If you want to measure and regress against GC statistics, you must explicitly enable statistics collection at runtime using +RTS -T.

Useful regressions

regression --regress notes
CPU cycles cycles:iters
Bytes allocated allocated:iters +RTS -T
Number of garbage collections numGcs:iters +RTS -T
CPU frequency cycles:time

Tips, tricks, and pitfalls

While criterion tries hard to automate as much of the benchmarking process as possible, there are some things you will want to pay attention to.

  • Measurements are only as good as the environment in which they’re gathered. Try to make sure your computer is quiet when measuring data.

  • Be judicious in when you choose nf and whnf. Always think about what the result of a function is, and how much of it you want to evaluate.

  • Simply rerunning a benchmark can lead to variations of a few percent in numbers. This variation can have many causes, including address space layout randomization, recompilation between runs, cache effects, CPU thermal throttling, and the phase of the moon. Don’t treat your first measurement as golden!

  • Keep an eye out for completely bogus numbers, as in the case of -fno-full-laziness above.

  • When you need trustworthy results from a benchmark suite, run each measurement as a separate invocation of your program. When you run a number of benchmarks during a single program invocation, you will sometimes see them interfere with each other.

How to sniff out bogus results

If some external factors are making your measurements noisy, criterion tries to make it easy to tell. At the level of raw data, noisy measurements will show up as “outliers”, but you shouldn’t need to inspect the raw data directly.

The easiest yellow flag to spot is the R² goodness-of-fit measure dropping below 0.9. If this happens, scrutinise your data carefully.

Another easy pattern to look for is severe outliers in the raw measurement chart when you’re using --output. These should be easy to spot: they’ll be points sitting far from the linear regression line (usually above it).

If the lower and upper bounds on an estimate aren’t “tight” (close to the estimate), this suggests that noise might be having some kind of negative effect.

A warning about “variance introduced by outliers” may be printed. This indicates the degree to which the standard deviation is inflated by outlying measurements, as in the following snippet (notice that the lower and upper bounds aren’t all that tight, too).

std dev              652.0 ps   (507.7 ps .. 942.1 ps)
variance introduced by outliers: 91% (severely inflated)

Generating (HTML) reports from previous benchmarks with criterion-report

If you want to post-process benchmark data before generating a HTML report you can use the criterion-report executable to generate HTML reports from criterion generated JSON. To store the benchmark results run criterion with the --json flag to specify where to store the results. You can then use: criterion-report data.json report.html to generate a HTML report of the data. criterion-report also accepts the --template flag accepted by criterion.

Changes

1.6.4.1

  • Merge tutorial into README.

1.6.4.0

  • Drop support for pre-8.0 versions of GHC.

1.6.3.0

  • Remove a use of the partial head function within criterion.

1.6.2.0

  • Require optparse-applicative-0.18.* as the minimum and add an explicit dependency on prettyprinter and prettyprinter-ansi-terminal.

1.6.1.0

  • Support building with optparse-applicative-0.18.*.

1.6.0.0

  • criterion-measurement-0.2.0.0 adds the measPeakMbAllocated field to Measured for reporting maximum megabytes allocated. Since criterion re-exports Measured from Criterion.Types, this change affects criterion as well. Naturally, this affects the behavior of Measured’s {To,From}JSON and Binary instances.
  • Fix a bug in which the --help text for the --match option was printed twice in criterion applications.

1.5.13.0

  • Allow building with optparse-applicative-0.17.*.

1.5.12.0

  • Fix a bug introduced in version 1.5.9.0 in which benchmark names that include double quotes would produce broken HTML reports.

1.5.11.0

  • Allow building with aeson-2.0.0.0.

1.5.10.0

  • Fix a bug in which the defaultMainWith function would not use the regressions values specified in the Config argument. This bug only affected criterion the library—uses of the --regressions flag from criterion executables themselves were unaffected.

1.5.9.0

  • Fix a bug where HTML reports failed to escape JSON properly.

1.5.8.0

  • The HTML reports have been reworked.

    • The flot plotting library (js-flot on Hackage) has been replaced by Chart.js (js-chart).
    • Most practical changes focus on improving the functionality of the overview chart:
      • It now supports logarithmic scale (#213). The scale can be toggled by clicking the x-axis.
      • Manual zooming has been replaced by clicking to focus a single bar.
      • It now supports a variety of sort orders.
      • The legend can now be toggled on/off and is hidden by default.
      • Clicking the name of a group in the legend shows/hides all bars in that group.
    • The regression line on the scatter plot shows confidence interval.
    • Better support for mobile and print.
    • JSON escaping has been made more robust by no longer directly injecting reports as JavaScript code.

1.5.7.0

  • Warn if an HTML report name contains newlines, and replace newlines with whitespace to avoid syntax errors in the report itself.

1.5.6.2

  • Use unescaped HTML in the json.tpl template.

1.5.6.1

  • Bundle criterion-examplesLICENSE file.

1.5.6.0

  • Allow building with base-compat-batteries-0.11.

1.5.5.0

  • Fix the build on old GHCs with the embed-data-files flag.
  • Require transformers-compat-0.6.4 or later.

1.5.4.0

  • Add parserWith, which allows creating a criterion command-line interface using a custom optparse-applicative Parser. This is usefule for sitations where one wants to add additional command-line arguments to the default ones that criterion provides.

    For an example of how to use parserWith, refer to examples/ExtensibleCLI.hs.

  • Tweak the way the graph in the HTML overview zooms:

    • Zooming all the way out resets to the default view (instead of continuing to zoom out towards empty space).
    • Panning all the way to the right resets to the default view in which zero is left-aligned (instead of continuing to pan off the edge of the graph).
    • Panning and zooming only affecs the x-axis, so all results remain in-frame.

1.5.3.0

  • Make more functions (e.g., runMode) able to print the µ character on non-UTF-8 encodings.

1.5.2.0

  • Fix a bug in which HTML reports would render incorrectly when including benchmark names containing apostrophes.

  • Only incur a dependency on fail on old GHCs.

1.5.1.0

  • Add a MonadFail Criterion instance.

  • Add some documentation in Criterion.Main about criterion-measurement’s new nfAppIO and whnfAppIO functions, which criterion reexports.

1.5.0.0

  • Move the measurement functionality of criterion into a standalone package, criterion-measurement. In particular, cbits/ and Criterion.Measurement are now in criterion-measurement, along with the relevant definitions of Criterion.Types and Criterion.Types.Internal (both of which are now under the Criterion.Measurement.* namespace). Consequently, criterion now depends on criterion-measurement.

    This will let other libraries (e.g. alternative statistical analysis front-ends) to import the measurement functionality alone as a lightweight dependency.

  • Fix a bug on macOS and Windows where using runAndAnalyse and other lower-level benchmarking functions would result in an infinite loop.

1.4.1.0

  • Use base-compat-batteries.

1.4.0.0

  • We now do three samples for statistics:

    • performMinorGC before the first sample, to ensure it’s up to date.
    • Take another sample after the action, without a garbage collection, so we can gather legitimate readings on GC-related statistics.
    • Then performMinorGC and sample once more, so we can get up-to-date readings on other metrics.

    The type of applyGCStatistics has changed accordingly. Before, it was:

       Maybe GCStatistics -- ^ Statistics gathered at the end of a run.
    -> Maybe GCStatistics -- ^ Statistics gathered at the beginning of a run.
    -> Measured -> Measured
    

    Now, it is:

       Maybe GCStatistics -- ^ Statistics gathered at the end of a run, post-GC.
    -> Maybe GCStatistics -- ^ Statistics gathered at the end of a run, pre-GC.
    -> Maybe GCStatistics -- ^ Statistics gathered at the beginning of a run.
    -> Measured -> Measured
    

    When diffing GCStatistics in applyGCStatistics, we carefully choose whether to diff against the end stats pre- or post-GC.

  • Use performMinorGC rather than performGC to update garbage collection statistics. This improves the benchmark performance of fast functions on large objects.

  • Fix a bug in the ToJSON Measured instance which duplicated the mutator CPU seconds where GC CPU seconds should go.

  • Fix a bug in sample analysis which incorrectly accounted for overhead causing runtime errors and invalid results. Accordingly, the buggy getOverhead function has been removed.

  • Fix a bug in Measurement.measure which inflated the reported time taken for perRun benchmarks.

  • Reduce overhead of nf, whnf, nfIO, and whnfIO by removing allocation from the central loops.

1.3.0.0

  • criterion was previously reporting the following statistics incorrectly on GHC 8.2 and later:

    • gcStatsBytesAllocated
    • gcStatsBytesCopied
    • gcStatsGcCpuSeconds
    • gcStatsGcWallSeconds

    This has been fixed.

  • The type signature of runBenchmarkable has changed from:

    Benchmarkable -> Int64 -> (a -> a -> a) -> (IO () -> IO a) -> IO a
    

    to:

    Benchmarkable -> Int64 -> (a -> a -> a) -> (Int64 -> IO () -> IO a) -> IO a
    

    The extra Int64 argument represents how many iterations are being timed.

  • Remove the deprecated getGCStats and applyGCStats functions (which have been replaced by getGCStatistics and applyGCStatistics).

  • Remove the deprecated forceGC field of Config, as well as the corresponding --no-gc command-line option.

  • The header in generated JSON output mistakenly used the string "criterio". This has been corrected to "criterion".

1.2.6.0

  • Add error bars and zoomable navigation to generated HTML report graphs.

    (Note that there have been reports that this feature can be somewhat unruly when using macOS and Firefox simultaneously. See https://github.com/flot/flot/issues/1554 for more details.)

  • Use a predetermined set of cycling colors for benchmark groups in HTML reports. This avoids a bug in earlier versions of criterion where benchmark group colors could be chosen that were almost completely white, which made them impossible to distinguish from the background.

1.2.5.0

  • Add an -fembed-data-files flag. Enabling this option will embed the data-files from criterion.cabal directly into the binary, producing a relocatable executable. (This has the downside of increasing the binary size significantly, so be warned.)

1.2.4.0

  • Fix issue where --help would display duplicate options.

1.2.3.0

  • Add a Semigroup instance for Outliers.

  • Improve the error messages that are thrown when forcing nonexistent benchmark environments.

  • Explicitly mark forceGC as deprecated. forceGC has not had any effect for several releases, and it will be removed in the next major criterion release.

1.2.2.0

  • Important bugfix: versions 1.2.0.0 and 1.2.1.0 were incorrectly displaying the lower and upper bounds for measured values on HTML reports.

  • Have criterion emit warnings if suspicious things happen during mustache template substitution when creating HTML reports. This can be useful when using custom templates with the --template flag.

1.2.1.0

  • Add GCStatistics, getGCStatistics, and applyGCStatistics to Criterion.Measurement. These are inteded to replace GCStats (which has been deprecated in base and will be removed in GHC 8.4), as well as getGCStats and applyGCStats, which have also been deprecated and will be removed in the next major criterion release.

  • Add new matchers for the --match flag:

    • --match pattern, which matches by searching for a given substring in benchmark paths.
    • --match ipattern, which is like --match pattern but case-insensitive.
  • Export Criterion.Main.Options.config.

  • Export Criterion.toBenchmarkable, which behaves like the Benchmarkable constructor did prior to criterion-1.2.0.0.

1.2.0.0

  • Use statistics-0.14.

  • Replace the hastache dependency with microstache.

  • Add support for per-run allocation/cleanup of the environment with perRunEnv and perRunEnvWithCleanup,

  • Add support for per-batch allocation/cleanup with perBatchEnv and perBatchEnvWithCleanup.

  • Add envWithCleanup, a variant of env with cleanup support.

  • Add the criterion-report executable, which creates reports from previously created JSON files.

1.1.4.0

  • Unicode output is now correctly printed on Windows.

  • Add Safe Haskell annotations.

  • Add --json option for writing reports in JSON rather than binary format. Also: various bugfixes related to this.

  • Use the js-jquery and js-flot libraries to substitute in JavaScript code into the default HTML report template.

  • Use the code-page library to ensure that criterion prints out Unicode characters (like ², which criterion uses in reports) in a UTF-8-compatible code page on Windows.

  • Give an explicit implementation for get in the Binary Regression instance. This should fix sporadic criterion failures with older versions of binary.

  • Use tasty instead of test-framework in the test suites.

  • Restore support for 32-bit Intel CPUs.

  • Restore build compatibilty with GHC 7.4.

1.1.1.0

  • If a benchmark uses Criterion.env in a non-lazy way, and you try to use --list to list benchmark names, you’ll now get an understandable error message instead of something cryptic.

  • We now flush stdout and stderr after printing messages, so that output is printed promptly even when piped (e.g. into a pager).

  • A new function runMode allows custom benchmarking applications to run benchmarks with control over the Mode used.

  • Added support for Linux on non-Intel CPUs.

  • This version supports GHC 8.

  • The --only-run option for benchmarks is renamed to --iters.

1.1.0.0

  • The dependency on the either package has been dropped in favour of a dependency on transformers-compat. This greatly reduces the number of packages criterion depends on. This shouldn’t affect the user-visible API.

  • The documentation claimed that environments were created only when needed, but this wasn’t implemented. (gh-76)

  • The package now compiles with GHC 7.10.

  • On Windows with a non-Unicode code page, printing results used to cause a crash. (gh-55)

1.0.2.0

  • Bump lower bound on optparse-applicative to 0.11 to handle yet more annoying API churn.

1.0.1.0

  • Added a lower bound of 0.10 on the optparse-applicative dependency, as there were major API changes between 0.9 and 0.10.