Map is one of the first higher-order functions I remember
encountering when I learned some rudimentary functional programming
topics as an undergraduate. More recently I began learning OCaml. The
I found in the standard library was the textbook definition I was
let rec map f = function  ->  | a::l -> let r = f a in r :: map f l
but I was surprised to learn there are actually several
map in the OCaml ecosystem. I wanted to know more
about these different implementations and their performance
characteristics. This article summarizes what I learned.
Part of my motivation to write this post arose from
this discussion of a pull
request that proposes the default
map implementation in
be tail recursive. After reading all of the comments, I wanted to
develop experiments that would convince me whether this was a good
Since I'll introduce several versions of
map, I'll refer to the
implementation above as
stdlib. An important aspect of this version
is that it isn't tail recursive. A tail recursive function uses
recursion in a specific way; it only calls itself as the final
operation in the function. In OCaml, tail recursive functions get the
benefit of tail-call optimization so that they only use one frame on
the stack. In contrast, the
stdlib implementation takes space on the
stack proportional to the length of the input list. For long enough
lists, this causes a stack overflow.
We can make a basic tail-recursive version of
map by composing two
standard library functions that are tail recursive:
(which creates the output of
map, but in reversed order) and
List.rev which reverses lists. I'll call this naive implementation
ntr (naive tail recursive). Compared to
twice as much space on the heap (one list for the output of
List.rev_map, and a second equally long list for
spends additional time performing a list traversal. The benefit is
that with this implementation, calling
map on a long list won't
result in a stack overflow.
As we'll see later in the Results section, the naive tail recursive
implementation is, overall, slower than the
implementation. However, there are some optimizations we can apply to
improve the performance of both versions.
The first trick I call "unrolling". Similar to
loop unrolling in procedural languages,
the idea is to use cases to
map on groups of list elements so fewer
function calls are needed to complete the work. For example, an
unrolled version of
stdlib looks like:
let rec map f l = match l with |  ->  | [x1] -> let f1 = f x1 in [f1] | [x1; x2] -> let f1 = f x1 in let f2 = f x2 in [f1; f2] | x1 :: x2 :: x3 :: tl -> let f1 = f x1 in let f2 = f x2 in let f3 = f x3 in f1 :: f2 :: f3 :: map tl
We get to choose how many elements to unroll (here I picked 3); some profiling is needed to decide the most appropriate number.
The second trick I call "hybridizing". The
stdlib version is faster,
but isn't safe for long lists. The tail recursive implementation is
slow, but safe for all lists. When we "hybridize" the two, we use the
faster version up to some fixed number of elements (e.g. 1000), and
then switch to the safe version for the remainder of the list.
These two optimizations are useful for understanding the
map we find in other libraries. For example, both
use an unrolled
map, hybridized with an
ntr, one might ask: Is it strictly necessary to use
additional heap space to get the implementation to be tail recursive?
The answer is no. The
so that the work is done by a single tail recursive function, creating
one new list on the heap. To achieve this, the implementation uses
mutable state and casting to avoid a second list traversal
As with any optimization, one makes tradeoffs between elegance, performance, and the language features one considers acceptable to use. I wanted to know:
Which version is the fastest in practice?
How much speed does one lose using the safer tail recursive version?
containers decided to hybridize with an
map for safety. But the
batteries implementation is also robust to
stack overflow. That led to one final question:
- How fast is an unrolled
maphybridized with a
In 2014, Jane Street introduced a
core_bench. As they explain
on their blog,
core_bench is intended to measure the performance of small pieces of
OCaml code. This library helps developers to better understand the
cost of individual operations, like
There are a couple benefits to measuring
map implementations with
The library makes it easy to track both time and use of the heap.
Once you specify your test,
core_benchautomatically provides many command line options to help you present your test data in different ways.
The library uses statistical techniques (bootstrapping and linear regression) to reduce many runs worth of data to a small number of meaningful performance metrics that account for the amortized cost of garbage collection and error introduced by system activity.
to compare performance between the six implementations I described
above. The program includes the batteries-hybrid I described above,
which I'll refer to as
batt-hybr. Other test details:
- I tested each algorithm against lists of lengths \(N=10^2, 10^3,
10^4\) and \(10^5\). On my system, \(N=10^5\) was the highest power of 10
for which the
stdlibimplementation did not fail due to a stack overflow.
- Each list consisted of integer elements (all 0), and the function mapped onto the list simply added one to each element.
- I ran these tests on a Lenovo X220, Intel Core i5-2520M CPU (2.50GHz × 4 cores), with 4GB RAM.
- I used the OCaml 4.03.0 compiler, and compiled to native code
flambda. (This makefile gives the full specifications.)
For each list size,
core_bench produces a table with several metrics besides time:
mWd: Words allocated on the minor heap.
mjWd: Words allocated on the major heap.
Prom: Words promoted from minor to major heap.
The library also allows us to produce 95% confidence intervals and \(R^2\) values for the time estimates.
Map Benchmark, N = 100
┌────────────┬──────────┬──────────┬───────────────┬─────────┬──────────┬──────────┬────────────┐ │ Name │ Time R^2 │ Time/Run │ 95ci │ mWd/Run │ mjWd/Run │ Prom/Run │ Percentage │ ├────────────┼──────────┼──────────┼───────────────┼─────────┼──────────┼──────────┼────────────┤ │ ntr │ 1.00 │ 741.20ns │ -0.23% +0.22% │ 609.01w │ 0.53w │ 0.53w │ 100.00% │ │ containers │ 1.00 │ 439.55ns │ -0.64% +0.67% │ 304.03w │ 0.17w │ 0.17w │ 59.30% │ │ batteries │ 1.00 │ 581.65ns │ -0.14% +0.16% │ 309.01w │ 0.35w │ 0.35w │ 78.47% │ │ base │ 1.00 │ 438.63ns │ -0.19% +0.20% │ 304.03w │ 0.16w │ 0.16w │ 59.18% │ │ stdlib │ 1.00 │ 610.45ns │ -0.10% +0.11% │ 304.01w │ 0.17w │ 0.17w │ 82.36% │ │ batt-hybr │ 1.00 │ 426.76ns │ -0.15% +0.17% │ 304.03w │ 0.17w │ 0.17w │ 57.58% │ └────────────┴──────────┴──────────┴───────────────┴─────────┴──────────┴──────────┴────────────┘
Map Benchmark, N = 1000
┌────────────┬──────────┬──────────┬───────────────┬─────────┬──────────┬──────────┬────────────┐ │ Name │ Time R^2 │ Time/Run │ 95ci │ mWd/Run │ mjWd/Run │ Prom/Run │ Percentage │ ├────────────┼──────────┼──────────┼───────────────┼─────────┼──────────┼──────────┼────────────┤ │ ntr │ 1.00 │ 8.84us │ -0.14% +0.15% │ 6.01kw │ 52.08w │ 52.08w │ 100.00% │ │ containers │ 1.00 │ 5.07us │ -0.13% +0.15% │ 3.00kw │ 17.23w │ 17.23w │ 57.34% │ │ batteries │ 1.00 │ 6.57us │ -0.14% +0.14% │ 3.01kw │ 34.71w │ 34.71w │ 74.32% │ │ base │ 1.00 │ 4.99us │ -0.20% +0.24% │ 3.00kw │ 17.08w │ 17.08w │ 56.44% │ │ stdlib │ 1.00 │ 6.84us │ -0.10% +0.10% │ 3.00kw │ 17.22w │ 17.22w │ 77.34% │ │ batt-hybr │ 1.00 │ 4.96us │ -0.13% +0.13% │ 3.00kw │ 17.07w │ 17.07w │ 56.10% │ └────────────┴──────────┴──────────┴───────────────┴─────────┴──────────┴──────────┴────────────┘
Map Benchmark, N = 10,000
┌────────────┬──────────┬──────────┬───────────────┬─────────┬──────────┬──────────┬────────────┐ │ Name │ Time R^2 │ Time/Run │ 95ci │ mWd/Run │ mjWd/Run │ Prom/Run │ Percentage │ ├────────────┼──────────┼──────────┼───────────────┼─────────┼──────────┼──────────┼────────────┤ │ ntr │ 0.99 │ 203.11us │ -0.45% +0.53% │ 60.01kw │ 5.40kw │ 5.40kw │ 100.00% │ │ containers │ 1.00 │ 144.95us │ -0.50% +0.60% │ 48.01kw │ 3.04kw │ 3.04kw │ 71.37% │ │ batteries │ 1.00 │ 136.87us │ -0.51% +0.63% │ 30.01kw │ 3.64kw │ 3.64kw │ 67.39% │ │ base │ 1.00 │ 130.85us │ -0.34% +0.39% │ 44.98kw │ 2.66kw │ 2.66kw │ 64.42% │ │ stdlib │ 1.00 │ 118.70us │ -0.30% +0.29% │ 30.00kw │ 1.80kw │ 1.80kw │ 58.44% │ │ batt-hybr │ 1.00 │ 106.91us │ -0.42% +0.52% │ 30.01kw │ 2.22kw │ 2.22kw │ 52.64% │ └────────────┴──────────┴──────────┴───────────────┴─────────┴──────────┴──────────┴────────────┘
Map Benchmark, N = 100,000
┌────────────┬──────────┬──────────┬───────────────┬──────────┬──────────┬──────────┬────────────┐ │ Name │ Time R^2 │ Time/Run │ 95ci │ mWd/Run │ mjWd/Run │ Prom/Run │ Percentage │ ├────────────┼──────────┼──────────┼───────────────┼──────────┼──────────┼──────────┼────────────┤ │ ntr │ 0.98 │ 10.27ms │ -2.39% +2.32% │ 600.02kw │ 411.83kw │ 411.83kw │ 94.33% │ │ containers │ 0.99 │ 10.62ms │ -2.19% +1.83% │ 588.02kw │ 404.91kw │ 404.91kw │ 97.61% │ │ batteries │ 1.00 │ 7.19ms │ -0.78% +0.79% │ 300.02kw │ 300.35kw │ 300.35kw │ 66.02% │ │ base │ 0.99 │ 10.88ms │ -1.54% +1.41% │ 584.99kw │ 397.60kw │ 397.60kw │ 100.00% │ │ stdlib │ 0.99 │ 6.33ms │ -1.08% +1.07% │ 300.01kw │ 171.73kw │ 171.73kw │ 58.18% │ │ batt-hybr │ 0.93 │ 6.09ms │ -4.30% +4.63% │ 300.02kw │ 285.46kw │ 285.46kw │ 55.93% │ └────────────┴──────────┴──────────┴───────────────┴──────────┴──────────┴──────────┴────────────┘
A few things I noticed about the data:
The tail recursive implementation is slow, but not always the slowest. In the \(N=10^5\) case,
containersshow similar performance; this makes sense given they have the same behavior after a prefix of the list.
For short lists,
batt-hybrare fastest, likely because of unrolling.
We can see from the
ntruses the most heap space, as we would expect.
As we increase the size of the list,
stdlibgets faster relative to
ntr; I suspect this can be attributed to the cost of garbage collection.
batt-hybrappears to have the best performance (though in the \(N=10^5\) case, the confidence interval is large enough that we can't conclude
batt-hybris faster than
stdlib, and the \(R^2\) value is a bit low).
The data above suggest three main findings:
The stack-based standard library implementation of
mapis faster than the naive tail recursive implementation, taking about 60-80% of the time to do the same work depending on the size of the list.
There are clear benefits to both unrolling and hybridizing. Hybridizing lets us take an implementation that performs well on long lists (
batteries), and make it even better by combining it with one that's fast on short lists, to produce
Overall, the data show we don't have to give up safety (resistance to stack overflow) to get better performance. A tail recursive implementation can be quite competitive, albeit more complex.
A follow up question to this analysis, in the context of the
Lwt_list.map_p is: "How significant is the cost of
map compared to other operations?" To partially address this, I
that measures how long it takes to write 4096 bytes to a Unix pipe
Lwt_unix.write, and then read it back using
core_bench, I found:
Unix Pipe Read/Write Benchmark
┌─────────┬──────────┬───────────────┬─────────┬────────────┐ │Time R^2 │ Time/Run │ 95ci │ mWd/Run │ Percentage │ ├─────────┼──────────┼───────────────┼─────────┼────────────┤ │ 1.00 │ 893.33ns │ -0.16% +0.18% │ 56.00w │ 100.00% │ └─────────┴──────────┴───────────────┴─────────┴────────────┘
It seems reasonable to estimate that
Lwt_list.map_p would primarily
be applied to IO operations like the ones in this experiment. In that
case, the data suggest the cost per element of applying the slowest
map implementation (
ntr) is about 0.82% of the cost of one 4KB
read/write operation on a unix pipe. In the context of
Lwt, it seems
reasonable to conclude the implementation of
map isn't a significant
concern; the real cost is performing IO. In light of that, it seems
preferable to use an implementation that cannot cause a stack
A second question that might be asked about this analysis is: "Was it
necessary to introduce the added complexity of
core_bench to measure
the performance of
map?" For example, one might set up a performance
test with just successive calls to
Unix.gettimeofday. In an earlier draft of this article, I tried such
producing some misleading data. In that test, I did a full garbage
collection in between runs (applications of
map), which suppressed
that (nontrivial) cost and made the tail recursive implementation
appear quite fast.
core_bench does a much better job incorporating
the cost of garbage collection by varying the number of test runs
performed in a row and then establishing a trend (with linear
regression) that reflects the amortized cost of garbage collection per
One interesting finding did arise from my earlier tests: for short lists, allocating memory on the heap actually appears to be faster than creating frames on the stack. Since the tests perform a full garbage collection between test runs, the timings show only the cost of allocating space on the heap. This produced data like the following:
Summary Statistics, \(N=10^3\) Garbage Collected Map Benchmark Times (μs)
Summary Statistics, \(N=10^4\) Garbage Collected Map Benchmark Times (μs)
When we ignore the time spent on garbage collection, we see that
is surprisingly competitive, especially in comparison with
the difference between the two implementations, of course, is that
ntr allocates more memory on the heap while
stdlib creates more