Create tests and benchmarks together

Latest on Hackage:

This package is not currently in any snapshots. If you're interested in using it, we recommend adding it to Stackage Nightly. Doing so will make builds more reliable, and allow stackage.org to host generated Haddocks.

MIT licensed by Ivan Lazar Miljenovic


Hackage Build Status

Test your benchmarks!

Benchmark your tests!

It’s too easy to accidentally try and benchmark apples and oranges together. Wouldn’t it be nice if you could somehow guarantee that your benchmarks satisfy some simple tests (e.g. a group of comparisons all return the same value)?

Furthermore, trying to compare multiple inputs/functions against each other requires a lot of boilerplate, making it even easier to accidentally compare the wrong things (e.g. using whnf instead of nf).

testbench aims to help solve these problems and more by making it easier to write unit tests and benchmarks together by stating up-front what requirements are needed and then using simple functions to state the next parameter to be tested/benchmarked.

This uses HUnit and criterion to create the tests and benchmarks respectively, and it’s possible to obtain these explicitly to embed them within existing test- or benchmark-suites. Alternatively, you can use the provided testBench function directly to first run the tests and then – if the tests all succeeded – run the benchmarks.


Please see the provided examples/ directory.


  • No availability of specifying an environment to run benchmarks in.

  • To be able to display the tree-like structure more readily for comparisons, the following limitations (currently) have to be made:

    • No detailed output, including no reports. In practice however, the detailed outputs produced by criterion don’t lend themselves well to comparisons.

Fortuitously Anticipated Queries

Why write this library?

The idea behind testbench came about because of two related dissatisfactions with criterion that I found:

  1. Even when the bcompare function was still available, it still seemed very difficult/clumsy to write comparison benchmarks since so much needed to be duplicated for each comparison.

  2. When trying to find examples of benchmarks that performed comparisons between different implementations, I came across some that seemingly did the same calculation on different inputs/implementations, but upon closer analysis the implementation that “won” was actually doing less work than the others (not by a large amount, but the difference was non-negligible in my opinion). This would have been easy to pick up if even a simple test was performed (e.g. using == would have led rise to a type mis-match, making it obvious they did different things).

testbench aims to solve these problems by making it easier to write comparisons up-front: by using the compareFunc function to specify what you are benchmarking and how, then using comp just to specify the input (without needing to also re-specify the function, evaluationg type, etc.).

Do I need to know HUnit or criterion to be able to use this?

No, for basic/default usage this library handles all that for you.

There are two overall hints for good benchmarks though:

  • Use the NFData variants (e.g. normalForm) where possible: this ensures the calculation is actually completed rather than laziness biting you.

  • If the variance is high, make the benchmark do more work to decrease it.

Why not use hspec/tasty/some-other-testing-framework?

Hopefully by the nature of this question it is obvious why I did not pick one over the others. HUnit is low-level enough that it can be utilised by any of the others if so required whilst keeping the dependencies required minimal.

Not to mention that these tests are more aimed at checking that the benchmarks are valid and are thus typically equality/predicate-based tests on the result from a simple function; as such it is more intended that they are quickly run as a verification stage rather than the basis for a large test-suite.

Why not use criterion directly for running benchmarks?

criterion currently does not lend itself well to visualising the results from comparison-style benchmarks:

  • A very limited internal tree-like structure which is not really apparent when results are displayed.

  • No easy way to actually compare benchmark values: there used to be a bcompare function but it hasn’t been available since version came out in August 2014. As such, comparisons must be done by hand by comparing the results visually.

  • Having more than a few benchmarks together produces a lot of output (either to the terminal or a resulting report): combined with the previous point, having more than a few benchmarks is discouraged.

Note that if however you wish to use criterion more directly (either for configurability or to be able to have reports), a combination of getTestBenches and flattenBenchForest will provide you with a Benchmark value that is accepted by criterion.

Changes (23 May, 2018)

  • Allow building with criterion-0.13.* and 0.14.*

  • Allow building with temporary-1.3.* (6 February, 2018)

  • Allow building with GHC 8.2.*

  • Now requires streaming-0.2.* (13 July, 2017)

  • Add compareFuncAllWith

  • Documentation improvements (especially on baselineWith). (12 July, 2017)

  • Accidentally deleted the type signatures of normalForm and normalFormIO, breaking compilation. (12 July, 2017)

  • Benchmarking results are now printed out incrementally rather than requiring all of them to be completed first.

  • Support for using the weigh library to measure memory usage.

  • Can now optionally provide a list of CompParam values to compareFunc and related functions rather than needing to use mappend or <> to manually combine them all.

  • Some of Criterion’s command-line options are now available, including the ability to output a CSV file (albeit with a different format).

  • Can now evaluate IO-based benchmarks.

  • Add compareFuncList, compareFuncAll (as well as primed and IO-based variants) and normalForm (and normalFormIO).

  • Remove compareFuncConstraint as it was found to not be very helpful and complicated the code base. The same functionality can be achieved using compareFuncList and related functions.

    • This includes removing SameAs and CUnion as they are now no longer needed.

    • The type of compareFunc is now different, but existing code shouldn’t needed to change.

  • The LabelTree data structure now includes the depth of each node. (22 May, 2016)

  • Initial release
comments powered byDisqus