Microbenchmarking Implementations of Map in OCaml
Map
is one of the first higher-order functions I remember
encountering when I learned some rudimentary functional programming
topics as an undergraduate. More recently I began learning OCaml. The
map
implementation
I found in the standard library was the textbook definition I was
expecting
let rec map f = function
[] -> []
| a::l -> let r = f a in r :: map f l
but I was surprised to learn there are actually several
implementations of map
in the OCaml ecosystem. I wanted to know more
about these different implementations and their performance
characteristics. This article summarizes what I learned.
Part of my motivation to write this post arose from
this discussion of a pull
request that proposes the default map
implementation in lwt
should
be tail recursive. After reading all of the comments, I wanted to
develop experiments that would convince me whether this was a good
idea.
Introduction
Since I'll introduce several versions of map
, I'll refer to the
implementation above as stdlib
. An important aspect of this version
is that it isn't tail recursive. A tail recursive function uses
recursion in a specific way; it only calls itself as the final
operation in the function. In OCaml, tail recursive functions get the
benefit of tail-call optimization so that they only use one frame on
the stack. In contrast, the stdlib
implementation takes space on the
stack proportional to the length of the input list. For long enough
lists, this causes a stack overflow.
We can make a basic tail-recursive version of map
by composing two
standard library functions that are tail recursive: List.rev_map
(which creates the output of map
, but in reversed order) and
List.rev
which reverses lists. I'll call this naive implementation
ntr
(naive tail recursive). Compared to stdlib
, ntr
allocates
twice as much space on the heap (one list for the output of
List.rev_map
, and a second equally long list for List.rev
) and
spends additional time performing a list traversal. The benefit is
that with this implementation, calling map
on a long list won't
result in a stack overflow.
As we'll see later in the Results section, the naive tail recursive
implementation is, overall, slower than the stdlib
implementation. However, there are some optimizations we can apply to
improve the performance of both versions.
The first trick I call "unrolling". Similar to
loop unrolling in procedural languages,
the idea is to use cases to map
on groups of list elements so fewer
function calls are needed to complete the work. For example, an
unrolled version of stdlib
looks like:
let rec map f l =
match l with
| [] -> []
| [x1] ->
let f1 = f x1 in
[f1]
| [x1; x2] ->
let f1 = f x1 in
let f2 = f x2 in
[f1; f2]
| x1 :: x2 :: x3 :: tl ->
let f1 = f x1 in
let f2 = f x2 in
let f3 = f x3 in
f1 :: f2 :: f3 :: map tl
We get to choose how many elements to unroll (here I picked 3); some profiling is needed to decide the most appropriate number.
The second trick I call "hybridizing". The stdlib
version is faster,
but isn't safe for long lists. The tail recursive implementation is
slow, but safe for all lists. When we "hybridize" the two, we use the
faster version up to some fixed number of elements (e.g. 1000), and
then switch to the safe version for the remainder of the list.
These two optimizations are useful for understanding the
implementations of map
we find in other libraries. For example, both
base
and
containers
use an unrolled stdlib
-style map
, hybridized with an
ntr
-style map
.
Looking at ntr
, one might ask: Is it strictly necessary to use
additional heap space to get the implementation to be tail recursive?
The answer is no. The batteries
package
implements map
so that the work is done by a single tail recursive function, creating
one new list on the heap. To achieve this, the implementation uses
mutable state and casting to avoid a second list traversal
(specifically,
here).
As with any optimization, one makes tradeoffs between elegance, performance, and the language features one considers acceptable to use. I wanted to know:
-
Which version is the fastest in practice?
-
How much speed does one lose using the safer tail recursive version?
Both base
and containers
decided to hybridize with an ntr
-style
map
for safety. But the batteries
implementation is also robust to
stack overflow. That led to one final question:
- How fast is an unrolled
stdlib
-stylemap
hybridized with abatteries
-stylemap
?
Experimental Setup
In 2014, Jane Street introduced a
library for
microbenchmarking called core_bench
. As they explain
on their blog,
core_bench
is intended to measure the performance of small pieces of
OCaml code. This library helps developers to better understand the
cost of individual operations, like map
.
There are a couple benefits to measuring map
implementations with
core_bench
.
-
The library makes it easy to track both time and use of the heap.
-
Once you specify your test,
core_bench
automatically provides many command line options to help you present your test data in different ways. -
The library uses statistical techniques (bootstrapping and linear regression) to reduce many runs worth of data to a small number of meaningful performance metrics that account for the amortized cost of garbage collection and error introduced by system activity.
I wrote
this program
to compare performance between the six implementations I described
above. The program includes the batteries-hybrid I described above,
which I'll refer to as batt-hybr
. Other test details:
- I tested each algorithm against lists of lengths \(N=10^2, 10^3,
10^4\) and \(10^5\). On my system, \(N=10^5\) was the highest power of 10
for which the
stdlib
implementation did not fail due to a stack overflow. - Each list consisted of integer elements (all 0), and the function mapped onto the list simply added one to each element.
- I ran these tests on a Lenovo X220, Intel Core i5-2520M CPU (2.50GHz × 4 cores), with 4GB RAM.
- I used the OCaml 4.03.0 compiler, and compiled to native code
without using
flambda
. (This makefile gives the full specifications.)
Results
For each list size, core_bench
produces a table with several metrics besides time:
mWd
: Words allocated on the minor heap.mjWd
: Words allocated on the major heap.Prom
: Words promoted from minor to major heap.
The library also allows us to produce 95% confidence intervals and \(R^2\) values for the time estimates.
Map Benchmark, N = 100
┌────────────┬──────────┬──────────┬───────────────┬─────────┬──────────┬──────────┬────────────┐
│ Name │ Time R^2 │ Time/Run │ 95ci │ mWd/Run │ mjWd/Run │ Prom/Run │ Percentage │
├────────────┼──────────┼──────────┼───────────────┼─────────┼──────────┼──────────┼────────────┤
│ ntr │ 1.00 │ 741.20ns │ -0.23% +0.22% │ 609.01w │ 0.53w │ 0.53w │ 100.00% │
│ containers │ 1.00 │ 439.55ns │ -0.64% +0.67% │ 304.03w │ 0.17w │ 0.17w │ 59.30% │
│ batteries │ 1.00 │ 581.65ns │ -0.14% +0.16% │ 309.01w │ 0.35w │ 0.35w │ 78.47% │
│ base │ 1.00 │ 438.63ns │ -0.19% +0.20% │ 304.03w │ 0.16w │ 0.16w │ 59.18% │
│ stdlib │ 1.00 │ 610.45ns │ -0.10% +0.11% │ 304.01w │ 0.17w │ 0.17w │ 82.36% │
│ batt-hybr │ 1.00 │ 426.76ns │ -0.15% +0.17% │ 304.03w │ 0.17w │ 0.17w │ 57.58% │
└────────────┴──────────┴──────────┴───────────────┴─────────┴──────────┴──────────┴────────────┘
Map Benchmark, N = 1000
┌────────────┬──────────┬──────────┬───────────────┬─────────┬──────────┬──────────┬────────────┐
│ Name │ Time R^2 │ Time/Run │ 95ci │ mWd/Run │ mjWd/Run │ Prom/Run │ Percentage │
├────────────┼──────────┼──────────┼───────────────┼─────────┼──────────┼──────────┼────────────┤
│ ntr │ 1.00 │ 8.84us │ -0.14% +0.15% │ 6.01kw │ 52.08w │ 52.08w │ 100.00% │
│ containers │ 1.00 │ 5.07us │ -0.13% +0.15% │ 3.00kw │ 17.23w │ 17.23w │ 57.34% │
│ batteries │ 1.00 │ 6.57us │ -0.14% +0.14% │ 3.01kw │ 34.71w │ 34.71w │ 74.32% │
│ base │ 1.00 │ 4.99us │ -0.20% +0.24% │ 3.00kw │ 17.08w │ 17.08w │ 56.44% │
│ stdlib │ 1.00 │ 6.84us │ -0.10% +0.10% │ 3.00kw │ 17.22w │ 17.22w │ 77.34% │
│ batt-hybr │ 1.00 │ 4.96us │ -0.13% +0.13% │ 3.00kw │ 17.07w │ 17.07w │ 56.10% │
└────────────┴──────────┴──────────┴───────────────┴─────────┴──────────┴──────────┴────────────┘
Map Benchmark, N = 10,000
┌────────────┬──────────┬──────────┬───────────────┬─────────┬──────────┬──────────┬────────────┐
│ Name │ Time R^2 │ Time/Run │ 95ci │ mWd/Run │ mjWd/Run │ Prom/Run │ Percentage │
├────────────┼──────────┼──────────┼───────────────┼─────────┼──────────┼──────────┼────────────┤
│ ntr │ 0.99 │ 203.11us │ -0.45% +0.53% │ 60.01kw │ 5.40kw │ 5.40kw │ 100.00% │
│ containers │ 1.00 │ 144.95us │ -0.50% +0.60% │ 48.01kw │ 3.04kw │ 3.04kw │ 71.37% │
│ batteries │ 1.00 │ 136.87us │ -0.51% +0.63% │ 30.01kw │ 3.64kw │ 3.64kw │ 67.39% │
│ base │ 1.00 │ 130.85us │ -0.34% +0.39% │ 44.98kw │ 2.66kw │ 2.66kw │ 64.42% │
│ stdlib │ 1.00 │ 118.70us │ -0.30% +0.29% │ 30.00kw │ 1.80kw │ 1.80kw │ 58.44% │
│ batt-hybr │ 1.00 │ 106.91us │ -0.42% +0.52% │ 30.01kw │ 2.22kw │ 2.22kw │ 52.64% │
└────────────┴──────────┴──────────┴───────────────┴─────────┴──────────┴──────────┴────────────┘
Map Benchmark, N = 100,000
┌────────────┬──────────┬──────────┬───────────────┬──────────┬──────────┬──────────┬────────────┐
│ Name │ Time R^2 │ Time/Run │ 95ci │ mWd/Run │ mjWd/Run │ Prom/Run │ Percentage │
├────────────┼──────────┼──────────┼───────────────┼──────────┼──────────┼──────────┼────────────┤
│ ntr │ 0.98 │ 10.27ms │ -2.39% +2.32% │ 600.02kw │ 411.83kw │ 411.83kw │ 94.33% │
│ containers │ 0.99 │ 10.62ms │ -2.19% +1.83% │ 588.02kw │ 404.91kw │ 404.91kw │ 97.61% │
│ batteries │ 1.00 │ 7.19ms │ -0.78% +0.79% │ 300.02kw │ 300.35kw │ 300.35kw │ 66.02% │
│ base │ 0.99 │ 10.88ms │ -1.54% +1.41% │ 584.99kw │ 397.60kw │ 397.60kw │ 100.00% │
│ stdlib │ 0.99 │ 6.33ms │ -1.08% +1.07% │ 300.01kw │ 171.73kw │ 171.73kw │ 58.18% │
│ batt-hybr │ 0.93 │ 6.09ms │ -4.30% +4.63% │ 300.02kw │ 285.46kw │ 285.46kw │ 55.93% │
└────────────┴──────────┴──────────┴───────────────┴──────────┴──────────┴──────────┴────────────┘
A few things I noticed about the data:
-
The tail recursive implementation is slow, but not always the slowest. In the \(N=10^5\) case,
ntr
,base
, andcontainers
show similar performance; this makes sense given they have the same behavior after a prefix of the list. -
For short lists,
base
,containers
, andbatt-hybr
are fastest, likely because of unrolling. -
We can see from the
mWd
andmjWd
columns thatntr
uses the most heap space, as we would expect. -
As we increase the size of the list,
stdlib
gets faster relative tontr
; I suspect this can be attributed to the cost of garbage collection. -
Overall,
batt-hybr
appears to have the best performance (though in the \(N=10^5\) case, the confidence interval is large enough that we can't concludebatt-hybr
is faster thanstdlib
, and the \(R^2\) value is a bit low).
Conclusions
The data above suggest three main findings:
-
The stack-based standard library implementation of
map
is faster than the naive tail recursive implementation, taking about 60-80% of the time to do the same work depending on the size of the list. -
There are clear benefits to both unrolling and hybridizing. Hybridizing lets us take an implementation that performs well on long lists (
batteries
), and make it even better by combining it with one that's fast on short lists, to producebatt-hybr
. -
Overall, the data show we don't have to give up safety (resistance to stack overflow) to get better performance. A tail recursive implementation can be quite competitive, albeit more complex.
Discussion
A follow up question to this analysis, in the context of the
discussion of Lwt_list.map_p
is: "How significant is the cost of
map
compared to other operations?" To partially address this, I
wrote
another program
that measures how long it takes to write 4096 bytes to a Unix pipe
using Lwt_unix.write
, and then read it back using Lwt_unix.read
. Using
core_bench
, I found:
Unix Pipe Read/Write Benchmark
┌─────────┬──────────┬───────────────┬─────────┬────────────┐
│Time R^2 │ Time/Run │ 95ci │ mWd/Run │ Percentage │
├─────────┼──────────┼───────────────┼─────────┼────────────┤
│ 1.00 │ 893.33ns │ -0.16% +0.18% │ 56.00w │ 100.00% │
└─────────┴──────────┴───────────────┴─────────┴────────────┘
It seems reasonable to estimate that Lwt_list.map_p
would primarily
be applied to IO operations like the ones in this experiment. In that
case, the data suggest the cost per element of applying the slowest
map
implementation (ntr
) is about 0.82% of the cost of one 4KB
read/write operation on a unix pipe. In the context of Lwt
, it seems
reasonable to conclude the implementation of map
isn't a significant
concern; the real cost is performing IO. In light of that, it seems
preferable to use an implementation that cannot cause a stack
overflow.
A second question that might be asked about this analysis is: "Was it
necessary to introduce the added complexity of core_bench
to measure
the performance of map
?" For example, one might set up a performance
test with just successive calls to Time.now
or
Unix.gettimeofday
. In an earlier draft of this article, I tried such
an
approach,
producing some misleading data. In that test, I did a full garbage
collection in between runs (applications of map
), which suppressed
that (nontrivial) cost and made the tail recursive implementation
appear quite fast. core_bench
does a much better job incorporating
the cost of garbage collection by varying the number of test runs
performed in a row and then establishing a trend (with linear
regression) that reflects the amortized cost of garbage collection per
run.
One interesting finding did arise from my earlier tests: for short lists, allocating memory on the heap actually appears to be faster than creating frames on the stack. Since the tests perform a full garbage collection between test runs, the timings show only the cost of allocating space on the heap. This produced data like the following:
Summary Statistics, \(N=10^3\) Garbage Collected Map Benchmark Times (μs)
Std. Lib. | NTR | Base | Batteries | Containers | |
---|---|---|---|---|---|
Mean | 27.18 | 27.21 | 11.23 | 19.03 | 13.06 |
Min | 15.97 | 14.07 | 6.91 | 9.06 | 7.87 |
25% | 18.84 | 17.88 | 7.15 | 10.97 | 10.01 |
50% | 20.03 | 19.07 | 8.11 | 15.50 | 10.01 |
75% | 23.13 | 22.17 | 12.87 | 18.84 | 10.97 |
Max | 849.01 | 576.02 | 379.80 | 622.03 | 434.88 |
Summary Statistics, \(N=10^4\) Garbage Collected Map Benchmark Times (μs)
Std. Lib. | NTR | Base | Batteries | Containers | |
---|---|---|---|---|---|
Mean | 230.23 | 216.60 | 141.39 | 144.59 | 161.56 |
Min | 77.01 | 73.91 | 61.04 | 54.84 | 63.18 |
25% | 97.04 | 86.07 | 67.95 | 58.89 | 70.81 |
50% | 121.95 | 99.90 | 75.10 | 61.99 | 77.01 |
75% | 193.83 | 284.91 | 148.59 | 118.26 | 175.24 |
Max | 12719.87 | 2753.97 | 2960.21 | 3616.09 | 13603.93 |
When we ignore the time spent on garbage collection, we see that ntr
is surprisingly competitive, especially in comparison with stdlib
;
the difference between the two implementations, of course, is that
ntr
allocates more memory on the heap while stdlib
creates more
stack frames.
Notes
Thanks to Anton Bachin and Simon Cruanes for reading drafts of this post.