BSD-3-Clause licensed by Tony Day, Marco Zocca
Maintained by [email protected]
This version can be pinned in stack with:perf-0.12.0.0@sha256:c95979d863055d193b975cb387f5d759dab6621011bc7c5a5a1aeade768c144d,4410
#+TITLE: perf
#+PROPERTY: header-args :exports both
#+PROPERTY: header-args :eval no-export

[[https://hackage.haskell.org/package/perf][file:https://img.shields.io/hackage/v/perf.svg]] [[https://github.com/tonyday567/perf/actions?query=workflow%3Ahaskell-ci][file:https://github.com/tonyday567/perf/workflows/haskell-ci/badge.svg]]

* Introduction

~perf~ provides some ideas, code and a library for low-level performance measurement for Haskell hacking. The library:

- provides a monad transformer, ~PerfT~, as a light-weight wrapper for use on existing code. ~PerfT~ modifications can be included in code bases, as opposed to performance being separated code and process, with any effects able to be erased at compile time with ~evalPerfT~.

- focuses on using the [[https://en.wikipedia.org/wiki/Time_Stamp_Counter][rdtsc]] instruction set that comes with the x86 chip to measure accurate time performance as number of clock cycles.

- abstracts the concept of what is being measured, so that concepts such as counters, debug checks, time and space performance can be treated equivalently.

* Setup

Note that running this readme.org is very slow compared with an external process which accesses the compiled version of the library.

#+begin_src haskell :results output
:r
:set prompt "> "
:set -Wno-type-defaults
:set -Wno-unused-do-bind
:set -Wno-name-shadowing
:set -XOverloadedStrings
:set -XOverloadedLabels
import Perf
import Data.FormatN
import qualified Data.Text as Text
import qualified Data.Text.IO as Text
import qualified Data.Map.Strict as Map
import Control.Monad
import Data.Bifunctor
putStrLn "ok"
#+end_src

#+RESULTS:
#+begin_example
Preprocessing library for perf-0.12.0.0..
Building library for perf-0.12.0.0..
[ 7 of 10] Compiling Perf.Report ( src/Perf/Report.hs, /Users/tonyday/haskell/perf/dist-newstyle/build/x86_64-osx/ghc-9.6.2/perf-0.12.0.0/build/Perf/Report.o, /Users/tonyday/haskell/perf/dist-newstyle/build/x86_64-osx/ghc-9.6.2/perf-0.12.0.0/build/Perf/Report.dyn_o ) [Source file changed]
[10 of 10] Compiling Perf ( src/Perf.hs, /Users/tonyday/haskell/perf/dist-newstyle/build/x86_64-osx/ghc-9.6.2/perf-0.12.0.0/build/Perf.o, /Users/tonyday/haskell/perf/dist-newstyle/build/x86_64-osx/ghc-9.6.2/perf-0.12.0.0/build/Perf.dyn_o ) [Source file changed]
Preprocessing benchmark 'perf-bench' for perf-0.12.0.0..
GHCi, version 9.6.2: https://www.haskell.org/ghc/ :? for help
Loaded GHCi configuration from /Users/tonyday/haskell/perf/.ghci
[1 of 2] Compiling Main ( app/bench.hs, interpreted )
Ok, one module loaded.
> [1 of 2] Compiling Main ( app/bench.hs, interpreted )
Ok, one module loaded.
>
>
ok
#+end_example

* Time
** What is a tick?

The fundamental operation of Perf.Time is tick, which sandwiches a (strict) function application between two readings of the rdstc chip.

#+begin_src haskell :results output :exports both
:t tick
#+end_src

#+RESULTS:
: tick :: (a -> b) -> a -> IO (Cycles, b)

tick returns in the IO monad, because reading a cycle counter is an IO effect. The trivial but fundamental point is that performance measurement effects the computation being measured.

Unlike benchmarking-style libraries, such as criterion, the result of the computation is passed through along with the cycle count. In this way, the `Perf` monad can be inserted into the midst of a computation in an attempt to measure performance in-situ as opposed to sitting off in a separate and perhaps unrepresentative process.

** tick_

tick_ measures the number of cycles between two clock reads.

#+begin_src haskell :results output :exports both
:t tick_
#+end_src

#+RESULTS:
: tick_ :: IO Cycles

#+begin_src haskell :results output :exports both
fmap word <$> replicateM 10 tick_
#+end_src

#+RESULTS:
: [1722,788,630,606,608,620,596,638,596,598]

** multiple ticks

#+begin_src haskell :results output :exports both
fmap (fmap (word . fst)) . replicateM 10 $ tick (const ()) ()
#+end_src

#+RESULTS:
: [6180,3282,2736,2712,2676,2670,2660,2634,2926,2656]

Here, ~const () ()~ was evaluated and took 6.1k cycles for the first effect, reducing down to 2.6k after 10 effects. What it did in that time we will never know, really, at least from usage of ~perf~. 2.2k cycles is, on my 2.5GHz machine equal to 2200 / 2.5e9 = 8.8e-7 of a second or 880 nanoseconds.

** tickIO

~tickIO~ measures the evaluation of an IO value.

#+begin_src haskell :results output :exports both
:t tickIO
#+end_src

#+RESULTS:
: tickIO :: IO a -> IO (Cycles, a)

#+begin_src haskell :results output :exports both
fmap (fmap (word . fst)) . replicateM 10 $ tickIO (pure ())
#+end_src

#+RESULTS:
: [5508,1632,1378,1210,1576,1214,1160,1174,1176,1176]

** sum example

#+begin_src haskell :exports both
fmap (expt (Just 2) . fromIntegral) . fst <$> ticks 10 sum ([1..10000] :: [Double])
#+end_src

#+RESULTS:
| 2.0e6 | 1.3e6 | 1.2e6 | 6.1e6 | 1.5e6 | 1.7e6 | 3.0e6 | 6.4e5 | 6.2e5 | 1.7e6 |


#+begin_src haskell :results output :exports both
ts <- ticks 10000 sum ([1..1000] :: [Double])
print $ average (fmap fromIntegral $ fst ts)
#+end_src

#+RESULTS:
:
: 54619.4812

* PerfT

~PerfT~ allows for multiple measurement points and is polymorphic in what is being measured. It returns a Map of results held in State.

Compare a lower-level usage of ticks, measuring the average of summing to one thousand over one thousand trials:

#+begin_src haskell :results output :exports both
first (average . fmap fromIntegral) <$> ticks 1000 sum [1..1000]
#+end_src

#+RESULTS:
: (73752.368,500500)

... with PerfT usage

#+begin_src haskell :results output :exports both
second (fmap (average . fmap fromIntegral)) <$> runPerfT (times 1000) (sum |$| [1..1000])
#+end_src

#+RESULTS:
: (500500,fromList [("",62706.844)])

Comparing performance of sum versus a list fusion approach:

#+begin_src haskell :results output :exports both
fmap (average . fmap fromIntegral) <$> (execPerfT (times 1000) $ do; (fap "sum" sum [1..1000]); (fap "fusion" (\x -> sum [1..x]) 1000))
#+end_src

#+RESULTS:
: fromList [("fusion",114013.79),("sum",63906.848)]

An IO example

#+begin_src haskell
exampleIO' :: IO ()
exampleIO' = do
txt <- Text.readFile "src/Perf.hs"
let n = Text.length txt
Text.putStrLn $ "length of file is: " <> Text.pack (show n)
#+end_src

#+begin_src haskell :results output :exports both
exampleIO = execPerfT time (do
txt <- fam "file_read" (Text.readFile "src/Perf.hs")
n <- fap "length" Text.length txt
fam "print_result" (Text.putStrLn $ "length of file is: " <> Text.pack (show n)))
#+end_src

#+begin_src sh :results output :exports both
perf-explore --exampleIO --record --check
#+end_src

#+RESULTS:
#+begin_example
length of file is: 1751
length of file is: 1751

label1 label2 label3 old_result new_result status

normal file-read time 4.28e5 4.19e5
normal length time 5.61e3 5.64e3
normal print-result time 5.47e4 5.38e4
outer file-read time 9.37e4 9.46e4
outer length time 5.38e3 5.40e3
outer outer-total time 1.11e5 1.12e5
outer print-result time 5.46e3 5.55e3
#+end_example

* perf-explore

~perf-explore~ contains some exploratory routines used to develop =perf=

#+begin_src sh :results output :exports both
perf-explore --help
#+end_src

#+RESULTS:
#+begin_example
examples of perf usage

Usage: perf-explore [-n|--runs ARG]
[--best | --median | --average | --averagesecs]
[--time | --space | --spacetime | --allocation | --count]
[-g|--golden ARG] [--nocheck] [-r|--record]
[--header | --noheader] [--error ARG] [--warning ARG]
[--improved ARG]
[--sums | --lengths | --nub | --examples | --example |
--exampleIO | --noops | --ticks | --gauge]
[-l|--length ARG]
[--sumFuse | --sum | --lengthF | --constFuse | --mapInc |
--noOp]

perf exploration

Available options:
-n,--runs ARG number of runs to perform
--best report upper decile
--median report median
--average report average
--averagesecs report average in seconds
--time measure time performance
--space measure space performance
--spacetime measure both space and time performance
--allocation measure bytes allocated
--count measure count
-g,--golden ARG golden file name
--nocheck do not check versus the golden file
-r,--record record the result to the golden file
--header include headers
--noheader dont include headers
--error ARG error level
--warning ARG warning level
--improved ARG improved level
--sums run on sum algorithms
--lengths run on length algorithms
--nub nub test
--examples run on example algorithms
--example run on the example algorithm
--exampleIO exampleIO test
--noops noops test
--ticks tick test
--gauge gauge runs on exmaple for comparison
-l,--length ARG length of list
--sumFuse fused sum pipeline
--sum sum
--lengthF foldr id length
--constFuse fused const pipeline
--mapInc fmap (+1)
--noOp const ()
-h,--help Show this help text
#+end_example

#+begin_src haskell :results output :exports both
fmap averageI <$> execPerfT (times 10000) (sum |$| [1..1000])
#+end_src

#+RESULTS:
: fromList [("",54874.5718)]

The equivalent to the above code is:

#+begin_src sh :results output :exports both
perf-explore -n 10000 -l 1000 --sum
#+end_src

#+RESULTS:
: label1 label2 results
:
: sum time 1.46e4

** noops

This no-op experiment is useful to understand the pure time performance of the machinery around measurement. It can be (re)run with:

#+begin_src sh :results output :exports both
perf-explore --noops
#+end_src

#+RESULTS:
#+begin_example
label1 label2 label3 results

const 1st time 4.46e2
const 2nd time 2.20e1
const 3rd time 2.20e1
const 4th time 2.20e1
const average time 2.22e1
const best time 2.10e1
const median time 2.19e1
pure 1st time 1.46e2
pure 2nd time 3.40e1
pure 3rd time 2.40e1
pure 4th time 2.40e1
pure average time 2.41e1
pure best time 1.86e1
pure median time 2.42e1
#+end_example

** measurement context

Exploration of how the code surrounding measurement effects performance.

#+begin_src sh :results drawer :exports both
perf-explore -n 1000 -l 1000 --ticks
#+end_src

#+RESULTS:
:results:
| | stepTime | tick | tickForce | tickForceArgs | tickLazy | tickWHNF | times |
| sumAux | 1.49e4 | 1.46e4 | 1.46e4 | 1.98e4 | 1.99e1 | 1.45e4 | 1.44e4 |
| sumCata | 1.45e4 | 1.44e4 | 1.46e4 | 1.46e4 | 1.99e1 | 1.44e4 | 1.44e4 |
| sumCo | 1.64e4 | 2.06e4 | 1.46e4 | 1.45e4 | 2.31e1 | 1.85e4 | 2.10e4 |
| sumCoCase | 1.30e4 | 2.18e4 | 1.01e4 | 1.39e4 | 1.99e1 | 1.02e4 | 1.08e4 |
| sumCoGo | 1.92e4 | 1.68e4 | 1.45e4 | 1.71e4 | 1.98e1 | 1.45e4 | 1.54e4 |
| sumF | 1.33e4 | 1.34e4 | 1.34e4 | 1.34e4 | 1.97e1 | 1.34e4 | 1.34e4 |
| sumFlip | 1.34e4 | 1.33e4 | 1.66e4 | 1.55e4 | 1.98e1 | 1.33e4 | 1.34e4 |
| sumFlipLazy | 1.33e4 | 1.37e4 | 1.88e4 | 1.40e4 | 1.97e1 | 1.34e4 | 1.35e4 |
| sumFoldr | 1.45e4 | 1.59e4 | 1.58e4 | 1.48e4 | 2.12e1 | 1.52e4 | 1.44e4 |
| sumFuse | 2.01e3 | 1.80e3 | 1.81e3 | 1.54e3 | 1.97e1 | 2.04e3 | 1.65e3 |
| sumFuseFoldl' | 2.14e3 | 1.80e3 | 1.92e3 | 1.71e3 | 1.97e1 | 1.98e3 | 1.66e3 |
| sumFuseFoldr | 8.58e3 | 5.43e3 | 5.13e3 | 5.84e3 | 1.97e1 | 5.26e3 | 7.13e3 |
| sumFusePoly | 1.92e3 | 1.93e3 | 1.96e3 | 1.58e3 | 2.00e1 | 1.60e3 | 1.94e3 |
| sumLambda | 1.34e4 | 1.34e4 | 1.35e4 | 1.33e4 | 2.05e1 | 1.34e4 | 1.34e4 |
| sumMono | 1.35e4 | 1.34e4 | 1.34e4 | 1.34e4 | 1.98e1 | 1.33e4 | 1.38e4 |
| sumPoly | 2.90e4 | 1.34e4 | 5.43e4 | 2.90e4 | 2.01e1 | 1.34e4 | 2.90e4 |
| sumSum | 1.33e4 | 1.34e4 | 1.33e4 | 1.34e4 | 1.96e1 | 1.34e4 | 1.34e4 |
| sumTail | 1.47e4 | 1.59e4 | 1.34e4 | 1.81e4 | 2.02e1 | 1.34e4 | 1.33e4 |
| sumTailLazy | 1.35e4 | 1.39e4 | 1.37e4 | 1.34e4 | 2.18e1 | 1.38e4 | 1.36e4 |
:end:


#+begin_src sh :results drawer :exports both
perf-explore -n 100000 -l 1000 --ticks
#+end_src

#+RESULTS:
:results:
| | stepTime | tick | tickForce | tickForceArgs | tickLazy | tickWHNF | times |
| sumAux | 1.83e4 | 1.83e4 | 1.81e4 | 1.81e4 | 1.75e1 | 1.81e4 | 1.86e4 |
| sumCata | 2.16e4 | 2.16e4 | 2.08e4 | 2.09e4 | 1.75e1 | 2.11e4 | 2.09e4 |
| sumCo | 1.83e4 | 1.82e4 | 1.82e4 | 1.88e4 | 1.74e1 | 1.81e4 | 1.81e4 |
| sumCoCase | 1.59e4 | 1.87e4 | 1.54e4 | 1.57e4 | 1.76e1 | 1.83e4 | 1.54e4 |
| sumCoGo | 2.06e4 | 2.07e4 | 2.07e4 | 2.11e4 | 1.74e1 | 2.07e4 | 2.12e4 |
| sumF | 8.87e3 | 9.05e3 | 8.61e3 | 8.73e3 | 1.73e1 | 8.66e3 | 8.84e3 |
| sumFlip | 8.57e3 | 8.66e3 | 8.49e3 | 8.65e3 | 1.74e1 | 8.70e3 | 8.80e3 |
| sumFlipLazy | 5.60e3 | 8.68e3 | 8.74e3 | 5.91e3 | 1.75e1 | 8.91e3 | 5.53e3 |
| sumFoldr | 2.14e4 | 2.10e4 | 2.08e4 | 2.08e4 | 1.75e1 | 2.09e4 | 2.12e4 |
| sumFuse | 2.23e3 | 2.13e3 | 2.11e3 | 2.02e3 | 1.69e1 | 1.98e3 | 2.15e3 |
| sumFuseFoldl' | 2.14e3 | 2.03e3 | 2.20e3 | 1.97e3 | 1.76e1 | 2.22e3 | 2.14e3 |
| sumFuseFoldr | 1.22e4 | 1.28e4 | 1.23e4 | 1.28e4 | 1.69e1 | 1.26e4 | 1.23e4 |
| sumFusePoly | 1.67e3 | 1.47e3 | 1.46e3 | 1.47e3 | 1.71e1 | 1.42e3 | 1.65e3 |
| sumLambda | 9.04e3 | 8.75e3 | 8.70e3 | 8.80e3 | 1.77e1 | 8.70e3 | 9.09e3 |
| sumMono | 5.97e3 | 5.96e3 | 5.82e3 | 6.26e3 | 1.85e1 | 5.69e3 | 5.91e3 |
| sumPoly | 8.74e3 | 8.85e3 | 8.70e3 | 8.73e3 | 1.76e1 | 8.64e3 | 8.76e3 |
| sumSum | 5.90e3 | 6.36e3 | 5.78e3 | 5.78e3 | 1.81e1 | 6.05e3 | 5.88e3 |
| sumTail | 1.21e4 | 6.48e3 | 5.85e3 | 1.19e4 | 1.76e1 | 5.86e3 | 1.22e4 |
| sumTailLazy | 6.05e3 | 8.40e3 | 6.02e3 | 5.74e3 | 1.75e1 | 5.91e3 | 5.94e3 |
:end:

*** short list
#+begin_src sh :results drawer :exports both
perf-explore -n 10000 -l 10 --best --ticks
#+end_src

#+RESULTS:
:results:
| | stepTime | tick | tickForce | tickForceArgs | tickLazy | tickWHNF | times |
| sumAux | 8.94e1 | 9.30e1 | 9.15e1 | 8.97e1 | 1.54e1 | 9.14e1 | 8.96e1 |
| sumCata | 9.74e1 | 9.79e1 | 9.78e1 | 9.81e1 | 1.55e1 | 9.77e1 | 9.62e1 |
| sumCo | 9.11e1 | 1.16e2 | 9.15e1 | 9.13e1 | 1.51e1 | 9.11e1 | 8.92e1 |
| sumCoCase | 9.05e1 | 9.21e1 | 9.18e1 | 9.18e1 | 1.53e1 | 8.99e1 | 8.90e1 |
| sumCoGo | 9.76e1 | 9.90e1 | 9.76e1 | 9.84e1 | 1.54e1 | 9.76e1 | 9.70e1 |
| sumF | 7.54e1 | 7.94e1 | 7.75e1 | 7.78e1 | 1.53e1 | 7.73e1 | 7.56e1 |
| sumFlip | 7.72e1 | 8.07e1 | 7.75e1 | 7.75e1 | 1.51e1 | 7.78e1 | 7.55e1 |
| sumFlipLazy | 7.74e1 | 7.93e1 | 7.72e1 | 7.78e1 | 1.55e1 | 7.62e1 | 7.55e1 |
| sumFoldr | 9.75e1 | 9.88e1 | 9.70e1 | 9.77e1 | 1.52e1 | 9.78e1 | 9.71e1 |
| sumFuse | 3.59e1 | 3.53e1 | 3.52e1 | 3.22e1 | 1.49e1 | 3.33e1 | 3.22e1 |
| sumFuseFoldl' | 3.52e1 | 3.50e1 | 3.52e1 | 3.30e1 | 1.50e1 | 3.33e1 | 3.27e1 |
| sumFuseFoldr | 5.00e1 | 4.99e1 | 5.06e1 | 4.91e1 | 1.50e1 | 4.91e1 | 4.94e1 |
| sumFusePoly | 4.42e1 | 4.39e1 | 4.36e1 | 4.19e1 | 1.51e1 | 4.15e1 | 4.31e1 |
| sumLambda | 7.93e1 | 8.10e1 | 7.94e1 | 7.93e1 | 1.54e1 | 7.92e1 | 7.73e1 |
| sumMono | 6.33e1 | 6.56e1 | 6.36e1 | 6.36e1 | 1.53e1 | 6.35e1 | 6.31e1 |
| sumPoly | 7.93e1 | 8.01e1 | 7.92e1 | 7.93e1 | 1.53e1 | 7.93e1 | 7.73e1 |
| sumSum | 7.93e1 | 7.98e1 | 7.95e1 | 7.95e1 | 1.52e1 | 7.93e1 | 7.73e1 |
| sumTail | 6.75e1 | 6.99e1 | 6.80e1 | 6.78e1 | 1.56e1 | 6.70e1 | 6.72e1 |
| sumTailLazy | 6.80e1 | 7.18e1 | 6.80e1 | 6.89e1 | 1.52e1 | 6.85e1 | 6.72e1 |
:end:

*** long list
#+begin_src sh :results drawer :exports both
perf-explore -n 100 -l 100000 --best --ticks
#+end_src

#+RESULTS:
:results:
| | stepTime | tick | tickForce | tickForceArgs | tickLazy | tickWHNF | times |
| sumAux | 2.26e6 | 2.28e6 | 2.28e6 | 2.25e6 | 1.68e1 | 2.31e6 | 2.26e6 |
| sumCata | 2.84e6 | 2.82e6 | 2.81e6 | 3.21e6 | 1.56e1 | 2.82e6 | 2.81e6 |
| sumCo | 2.26e6 | 2.27e6 | 5.02e6 | 2.27e6 | 1.66e1 | 2.28e6 | 2.28e6 |
| sumCoCase | 2.28e6 | 2.30e6 | 2.32e6 | 2.28e6 | 1.48e1 | 2.97e6 | 2.30e6 |
| sumCoGo | 2.84e6 | 2.80e6 | 2.84e6 | 2.85e6 | 1.51e1 | 2.81e6 | 2.81e6 |
| sumF | 7.76e5 | 7.78e5 | 7.80e5 | 7.81e5 | 1.51e1 | 9.76e5 | 7.05e5 |
| sumFlip | 7.48e5 | 9.04e5 | 7.46e5 | 7.43e5 | 1.68e1 | 7.48e5 | 7.44e5 |
| sumFlipLazy | 1.21e6 | 1.22e6 | 1.19e6 | 1.19e6 | 1.61e1 | 1.20e6 | 1.20e6 |
| sumFoldr | 3.63e6 | 2.87e6 | 3.25e6 | 2.81e6 | 1.45e1 | 5.89e6 | 2.81e6 |
| sumFuse | 1.92e5 | 1.92e5 | 1.92e5 | 1.92e5 | 1.47e1 | 1.92e5 | 1.92e5 |
| sumFuseFoldl' | 1.92e5 | 1.92e5 | 1.92e5 | 1.92e5 | 1.52e1 | 1.92e5 | 1.92e5 |
| sumFuseFoldr | 1.68e6 | 1.64e6 | 1.65e6 | 1.65e6 | 1.50e1 | 1.65e6 | 1.65e6 |
| sumFusePoly | 1.32e5 | 1.28e5 | 1.28e5 | 1.28e5 | 1.48e1 | 1.32e5 | 1.28e5 |
| sumLambda | 7.90e5 | 8.01e5 | 7.93e5 | 7.92e5 | 1.57e1 | 7.93e5 | 7.92e5 |
| sumMono | 7.86e5 | 8.04e5 | 7.87e5 | 7.83e5 | 1.48e1 | 7.99e5 | 7.88e5 |
| sumPoly | 7.99e5 | 8.18e5 | 8.06e5 | 8.00e5 | 1.64e1 | 8.10e5 | 8.00e5 |
| sumSum | 7.82e5 | 7.92e5 | 7.85e5 | 7.82e5 | 1.66e1 | 7.88e5 | 7.87e5 |
| sumTail | 7.43e5 | 7.84e5 | 8.23e5 | 8.18e5 | 1.61e1 | 7.89e5 | 7.38e5 |
| sumTailLazy | 1.35e6 | 1.37e6 | 1.34e6 | 1.33e6 | 1.58e1 | 1.35e6 | 1.34e6 |
:end:

** sums

#+begin_src sh :output drawer :exports both
perf-explore -n 1000 -l 1000 --sums
#+end_src

#+RESULTS:
| label1 | label2 | results |
| | | |
| sumAux | time | 16200.0 |
| sumCata | time | 15600.0 |
| sumCo | time | 19100.0 |
| sumCoCase | time | 16300.0 |
| sumCoGo | time | 15700.0 |
| sumF | time | 10500.0 |
| sumFlip | time | 10500.0 |
| sumFlipLazy | time | 6130.0 |
| sumFoldr | time | 16900.0 |
| sumFuse | time | 1950.0 |
| sumFuseFoldl' | time | 1950.0 |
| sumFuseFoldr | time | 14600.0 |
| sumFusePoly | time | 1400.0 |
| sumLambda | time | 9780.0 |
| sumMono | time | 11200.0 |
| sumPoly | time | 11500.0 |
| sumSum | time | 11100.0 |
| sumTail | time | 11000.0 |
| sumTailLazy | time | 11100.0 |

** lengths

#+begin_src sh :exports both
perf-explore -n 1000 -l 1000 --lengths
#+end_src

#+RESULTS:
| label1 | label2 | results |
| | | |
| lengthAux | time | 11000.0 |
| lengthCo | time | 11100.0 |
| lengthCoCase | time | 10500.0 |
| lengthF | time | 9540.0 |
| lengthFMono | time | 9190.0 |
| lengthFlip | time | 9770.0 |
| lengthFlipLazy | time | 4830.0 |
| lengthFoldr | time | 10100.0 |
| lengthFoldrConst | time | 11100.0 |
| lengthTail | time | 9830.0 |
| lengthTailLazy | time | 9540.0 |

** Gauge comparison

#+begin_src sh :results output :exports both
perf-explore -n 1000 -l 1000 --average --gauge
#+end_src

#+RESULTS:
#+begin_example
sumFuse
benchmarking function ... function time 849.4 ns

benchmarking function ... function time 848.6 ns

sum
benchmarking function ... function time 3.786 μs

benchmarking function ... function time 3.791 μs

lengthF
benchmarking function ... function time 1.863 μs

benchmarking function ... function time 1.872 μs

constFuse
benchmarking function ... function time 571.0 ns

benchmarking function ... function time 1.080 μs

mapInc
benchmarking function ... function time 9.203 ns

benchmarking function ... function time 15.99 μs

noop
benchmarking function ... function time 5.770 ns

benchmarking function ... function time 4.740 ns

#+end_example

** Space

Data is collected from GHCStats

- allocated_bytes
- gcs
- gcdetails_live_bytes
- max_live_bytes
- max_mem_in_use_bytes

#+begin_src sh :results ouput :exports both
perf-explore -n 10 -l 100000 --space +RTS -T -RTS
#+end_src

#+RESULTS:
| label1 | label2 | results |
| | | |
| sum | MaxMem | 839000.0 |
| sum | allocated | 416000.0 |
| sum | gcLiveBytes | 220000.0 |
| sum | gcollects | 0.1 |
| sum | maxLiveBytes | 0.0 |

* Perf.BigO

Perf.BigO represents functionality to determine the complexity order for a computation.

We could do a regression and minimise the error term, but we know that the largest run contains the most information; we would need to weight the simulations according to some heuristic.

Instead, we:

- estimate the order factor for each possible Order, from N3 to N0, setting the highest n run constant factor to zero,
- pick the order based on lowest absolute error result summed across all the runs,

#+begin_src haskell :results output :exports both
import qualified Prelude as P
import Data.List (nub)
estOrder (\x -> sum $ nub [1..x]) 10 [1,10,100,1000]
#+end_src

#+RESULTS:
: BigOrder {bigOrder = N2, bigFactor = 13.545801, bigConstant = 0.0}

* References

[[https://github.com/haskell-perf/checklist][The Haskell performance checklist]]

[[https://github.com/ndmitchell/spaceleak][ndmitchell/spaceleak: Notes on space leaks]]

** Core

[[https://ghc.gitlab.haskell.org/ghc/doc/users_guide/debugging.html#options-debugging][5.13. Debugging the compiler]]

#+begin_src sh
ghc app/speed.hs -ddump-simpl -ddump-to-file -fforce-recomp -dlint -O
#+end_src

[[https://wiki.haskell.org/Performance/GHC#Looking_at_the_Core][haskell wiki: Looking at the Core]]

[[https://godbolt.org/][godbolt]]

[[https://gitlab.haskell.org/ghc/ghc/-/issues/15185][ghc issue 15185: Enum instance for IntX / WordX are inefficient]]

[[https://fixpt.de/blog/2017-12-04-strictness-analysis-part-1.html][fixpt - All About Strictness Analysis (part 1)]]

** Profiling
*** setup
[[https://ghc.gitlab.haskell.org/ghc/doc/users_guide/profiling.html#prof-heap][8. Profiling]]

A typical configuration step for profiling:

#+begin_src sh :results output
cabal configure --enable-library-profiling --enable-executable-profiling -fprof-auto -fprof -write-ghc-environment-files=always
#+end_src

A cabal.project.local with profiling enabled:

#+begin_quote
write-ghc-environment-files: always
ignore-project: False
flags: +prof +prof-auto
library-profiling: True
executable-profiling: True
#+end_quote

Examples from markup-parse R&D:

Executable compilation:

#+begin_src sh :results output
ghc -prof -fprof-auto -rtsopts app/speed0.hs -threaded -fforce-recomp
#+end_src

Executable run:

#+begin_src sh :results output
app/speed0 +RTS -s -p -hc -l -RTS
#+end_src

*** Space usage output (-s)

#+begin_example
885,263,472 bytes allocated in the heap
8,507,448 bytes copied during GC
163,200 bytes maximum residency (4 sample(s))
27,752 bytes maximum slop
6 MiB total memory in use (0 MiB lost due to fragmentation)

Tot time (elapsed) Avg pause Max pause
Gen 0 207 colls, 0 par 0.009s 0.010s 0.0001s 0.0002s
Gen 1 4 colls, 0 par 0.001s 0.001s 0.0004s 0.0005s

TASKS: 4 (1 bound, 3 peak workers (3 total), using -N1)

SPARKS: 0 (0 converted, 0 overflowed, 0 dud, 0 GC'd, 0 fizzled)

INIT time 0.006s ( 0.006s elapsed)
MUT time 0.367s ( 0.360s elapsed)
GC time 0.010s ( 0.011s elapsed)
RP time 0.000s ( 0.000s elapsed)
PROF time 0.000s ( 0.000s elapsed)
EXIT time 0.001s ( 0.001s elapsed)
Total time 0.384s ( 0.380s elapsed)
#+end_example

*** Cost center profile (-p)

Dumped to speed0.prof

#+begin_example
COST CENTRE MODULE SRC %time %alloc

token MarkupParse src/MarkupParse.hs:(259,1)-(260,20) 50.2 50.4
wrappedQ' MarkupParse.FlatParse src/MarkupParse/FlatParse.hs:(215,1)-(217,78) 20.8 23.1
ws_ MarkupParse.FlatParse src/MarkupParse/FlatParse.hs:(135,1)-(146,4) 14.3 5.5
eq MarkupParse.FlatParse src/MarkupParse/FlatParse.hs:243:1-30 10.6 11.1
gather MarkupParse src/MarkupParse.hs:(420,1)-(428,100) 2.4 3.7
runParser FlatParse.Basic src/FlatParse/Basic.hs:(217,1)-(225,24) 1.0 6.0
#+end_example

*** heap analysis (-hc -l)

#+begin_src sh :results output
eventlog2html speed0.eventlog
#+end_src

Produces speed0.eventlog.html which contains heap charts.

** Cache speed

The average cycles per + operation can get down to about 0.7 cycles, and there are about 4 cache registers per cycle, so a sum pipeline uses 2.8 register instructions per +.

| Cache | Cycles |
|-------------------|----------------|
| register | 4 per cycle |
| L1 Cache access | 3-4 cycles |
| L2 Cache access | 11-12 cycles |
| L3 unified access | 30 - 40 |
| DRAM hit | 195 cycles |
| L1 miss | 40 cycles |
| L2 miss | >600 cycles |


Changes

0.12.0.0

  • added reportMain and support for executable development.
  • refactored app/perf-explore
  • created app/perf-bench and aded as a cabal bench thing.

0.11.0.0

  • added Perf.Count
  • GHC 9.6.2 support

0.10.0

  • GHC 9.2.1 support
  • inclusion of BigO

0.9.0

  • Removed deepseq dependency

0.8.0

  • GHC 9.0.1 support
  • internal fixes to remove numhask dependency