Joe Thomas

Jul 23, 2021

How to send emails from Dream

I recently learned the OCaml ecosystem has most of the libraries needed to build email features, but I wasn't able to find an example that brings all of those pieces together. I wanted a simple example I could easily adapt to build common account verification and password recovery workflows in the Dream web framework. This post is my attempt to solve that problem and explain what I learned in the process. The accompanying code can be found here.

Defining the problem

As an application developer, my main concerns when sending mail are:

  1. Trying to ensure the address I'm sending to is valid.
  2. Handling the email synthesis process asynchronously, so it doesn't block the web server from handling requests quickly.
  3. Transmitting the email content (SMTP or a REST API call)
  4. Monitoring that the applications messages aren't getting blocked/dropped for some reason.

To test how easily I could satisfy these requirements, I built a web app that lets the user send an arbitary email.

A screenshot of the email form.

The app has two parts:

  • A server process responsible for showing the form above and validating user inputs.
  • A "worker" process responsible for sending mail. The worker gets its tasks from a queue populated by the server.

Validating Email Addresses

I decided to utilize the library emile for address validation. An email address might be invalid for reasons outside of the application's control (like the domain going way), but emile at least provides a way to check a given string conforms to 5(!) RFCs relevant to parsing email addresses. Check out the tests to see the impressive diversity of valid email addresses on the web.

The library provides several different predicates for checking addresses. I utilized Emile.of_string, which generates a simple result type, like this:

let form_post request =
  match%lwt Dream.form request with
    | `Ok ["address", address; "message", message; "subject", subject ] ->
      let parse = Emile.of_string address in
      let alert = match parse with
        | Ok _ -> Printf.sprintf "Sent an email to %s." address
        | Error _ -> Printf.sprintf "%s is not a valid email address." address
      let queue_req = match parse with
        | Ok _ -> queue_email address subject message
        | Error _ -> Lwt.return_unit
    queue_req >>= (fun _ ->
        Dream.html (Template.show_form ~alert request))
    | _ -> Dream.empty `Bad_Request

This way, if a user provides an invalid address, the form displays a simple warning message and won't queue any email tasks.

Queuing Tasks with RabbitMQ

The second issue I needed to address is handling the email send process with a background worker (queue_email in the code excerpt above). Sending an email can be a slow process, especially if you need to make additional database queries or render images to produce a customized email. That shouldn't block the web server from quickly responding to the user's requests. Instead, the server should record the email task in a queue, to be handled by a background worker that isn't as sensitive to latency.

There are several different technologies that can be used to solve this problem, for example RabbitMQ, Redis and SQS. I decided to use RabbitMQ because I didn't want to tie my implementation to one cloud provider and RabbitMQ is supported by amqp-client on OPAM. This isn't the only way to go: for Redis and SQS, you might have success with the packages redis or aws-sqs, respectively.

The worker and server process share a common function for establishing a connection to RabbitMQ:

let rabbit_connection () =
  let%lwt connection = Connection.connect ~id:"dream" "localhost" in
  let%lwt channel = Connection.open_channel ~id:"email" Channel.no_confirm connection in
  let%lwt queue = Queue.declare channel ~arguments:[Rpc.Server.queue_argument] "email" in
  Lwt.return (channel, queue)

I was initially little confused about why I needed a "channel" when I already had a connection to the broker. Both the worker and the server connect to the RabbitMQ Broker via TCP. It's common for clients to need several connections to the broker, but it's costly to have many TCP connections. RabbitMQ lets us avoid this cost with channels, which are like "lightweight connections that share a single TCP connection", according to these docs.

I created this function to submit messages to the queue:

type email = {
  address: string;
  subject: string;
  text: string;
} [@@deriving yojson]

let queue_email address subject text =
  let email = {address; subject; text} in
  let text = email |> yojson_of_email |> Yojson.Safe.to_string in
  let%lwt channel, queue = rabbit_connection () in
  let%lwt _ = Queue.publish channel queue (Message.make text) in

Here the server writes the address, subject, and text into a record type that can be serialized to JSON before being pushed into the queue.

The worker process, queue_worker, is fairly simple to define:

let handle_message (m: Message.t) =
  let text = snd m.message in
  let email =
      Some (text |> Yojson.Safe.from_string |> email_of_yojson)
    with _ ->
      Printf.printf "Received invalid email record from RabbitMQ: %s" text;
  match email with
  | Some e -> send_email e
  | None -> Lwt.return_unit

let queue_worker () =
  Printf.printf "Starting Queue Worker\n";
  let%lwt channel, queue = rabbit_connection () in
  let rec handle () =
    Queue.get ~no_ack:true channel queue >>= fun message ->
    let task = match message with
      | Some m -> flush stdout; handle_message m
      | _ -> Printf.printf "No new tasks, waiting.\n"; flush stdout; Lwt_unix.sleep 5.
    task >>= handle
  handle ()

The worker consists of an infinite loop. When the worker wakes and finds tasks in the queue, it uses yojson to deserialize the messages and passes them to send_email. If no work is available, the worker just sleeps for 5 seconds.

Sending Email with cohttp and Mailgun

In order to send mail, I set up a trial Mailgun account. I selected this provider for several reasons:

  • Mailgun has a generous free tier (5000 emails).
  • A new account also comes with a "sandbox" domain, so I didn't have to deal with registering a domain.
  • Mailgun supports both a REST API and SMTP, allowing me to try different integrations.
  • Mailgun provides dashboards for tracking messages sent over time as well as emails that were rejected (bounced, marked as spam, etc.).

Mailgun's API allowed me to send an email by sending a POST with an API key and form data, like this:

let send_email (e: email) =
   let open Cohttp in
   let open Cohttp_lwt_unix in

   let api_key = Sys.getenv "MAILGUN_API_KEY" in
   let sender = Sys.getenv "MAILGUN_SEND_ADDRESS" in
   let api_base = Sys.getenv "MAILGUN_API_BASE" in

   let params = [
     ("from", [sender]);
     ("to", [e.address]);
     ("subject", [e.subject]);
     ("text", [e.text])
   ] in
   let cred = `Basic ("api", api_key) in
   let uri = api_base ^ "/messages" |> Uri.of_string in
   let headers = Header.add_authorization (Header.init ()) cred in
   let start =
     Printf.printf "Initiating email send to %s.\n" e.address;
     Unix.gettimeofday () in
   let%lwt resp, rbody = Client.post_form uri ~params ~headers in
   let%lwt finish = Unix.gettimeofday () |> Lwt.return in
   let%lwt rbody_str = (Cohttp_lwt.Body.to_string rbody) in
   Printf.printf "API Response Status: %d.\n" (Code.code_of_status resp.status);
   Printf.printf "API Response body %s.\n" rbody_str;
   Printf.printf "Time to send an email: %.2fs\n" (finish -. start);
   Lwt.return ()

The endpoint replies with either 200 and a JSON body like this:

  "id": "<some mailgun email address here>"
  "message": "Queued. Thank you."

or else a 400 and an explanation of why the request was rejected. In my tests, a successful API request typically took between 0.5 and 1 seconds. For comparison, the webserver can serve the email form in less than 200 microseconds, so the performance benefits to shifting email tasks to a background queue are nontrivial.

Sending email with SMTP

Mailgun supports SMTP but they recommend using their API, especially for sending large amounts of mail at once. I used letters to test their SMTP support with this script:

let body = Letters.Plain {|This is a test email body.|}

let send_email () =
  let config = (Letters.Config.make
      ~username: Sys.getenv "MAILGUN_SMTP_USER"
      ~password: Sys.getenv "MAILGUN_SMTP_PASSWORD"
  let sender = Sys.getenv "MAILGUN_SMTP_SENDER" in
  let recipients = [Letters.To (Sys.getenv "EMAIL_ADDRESS")] in
  let subject = "Test email from mailgun." in
  let mail = Letters.build_email ~from:sender ~recipients ~subject ~body in

  Printexc.record_backtrace true;
  match mail with
  | Ok message ->
       Letters.send ~config ~sender ~recipients ~message
      e ->
      Printf.printf "Error. %s\n" (Printexc.to_string e);
      Printexc.print_backtrace stdout;
  | Error _ -> Printf.printf "Message synthesis failed."; flush stdout; Lwt.return_unit

let () = @@ send_email ()

I ran into some difficulties because Mailgun does not consistently follow RFC 4954 (thanks to Calascibetta Romain for figuring out the problem and creating a patch). Once I applied the patch, I was able to send mail successfully.

I haven't fully explored the tradeoffs of using SMTP versus an API. The main goal of my experiements was to confirm that if I needed to integrate with a specific SMTP server, I knew how to do so.


This example focused on email, but task queues and background workers can be used to implement other important pieces of functionality for web applications. For example, I've utilized celery in the past to generate weekly reports and schedule recurring ETL jobs. This exercise was a nice opportunity to see how one would construct a system similar to celery in OCaml. It also provides a simple illustration of how we can utilize queues to facilitate horizontal scaling. If the volume of email requests grows too big for one machine, we can put the web server and the worker on separate instances or have multiple worker instances consuming from the email queue.

I didn't cover error handling in this example, but that's clearly an important piece of functionality to include when building a production email system. At a minimum, it would be useful to add a retry mechanism in the event that Mailgun is unavailable.


I found the resources below helpful for working on this project:

  • The amqp-client documentation, especially the examples were helpful for setting up with RabbitMQ.
  • The client tutorial was helpful for getting started with cohttp. Later, I utilized this section of the docs to understand that I needed to use post_form instead of post when interacting with Mailgun's API.


If you ended up looking through the source code for this example, let me know how what you thought! I'm interested in adding more tutorial resources to the OCaml ecosystem, so feel free to post a PR or issue to dream-email-example if you have ideas about how to make these resources better.

Jul 12, 2021

Writing a REST API with Dream

In the last few weeks I've been building a simple REST API using the Dream web framework. Writing APIs has been one major component of my paid programming work and I wanted to compare and contrast the experience of writing an API in a strongly typed functional programming language with the same workflow in a dynamically typed language like Python.

You can find the source code for this project here. Hopefully, this repo will be a useful resource for anyone who wants to look at a slightly larger example application that uses Dream. Though I'm still a beginner, I was surprised how easy it was to complete familiar tasks in this framework. I'm optimistic about the future of web programming in OCaml!

This post provides a sort of "experience report" for working with Dream, Yojson, and Caqti. Obviously, these types of reports are highly subjective and you should run your own experiments too! I'll be comparing with my experience in working with Pyramid and SQLAlchemy, since those are the Python projects I've used most.

Finally, it's important to point out that Dream is in alpha (version 1.0.0~alpha2 as of this writing). This is an early version! Comparisons with frameworks like Pyramid or Flask are in some sense apples-and-oranges, because those projects have had a long time (and many engineering hours) to mature. Ultimately, the point of this post (and the repo that goes with it) isn't to make make value judgements ("framework X is good, Y is bad"), but rather to understand how to translate concepts from one workflow to another.

Problem Definition

I decided to build a API for managing time series data. My requirements for the project were as follows:

  • The API will allow creating, reading, and deleting "sensors". A sensor is an abstract IoT device that measures a floating point quantity at a specific cadence (for example a weather station measuring average hourly windspeed).
  • Every sensor gets an API key that allows it to upload (POST) data to an endpoint.
  • A GET endpoint will allow users to retrieve the data a sensor generated between two dates.
  • The API sends and receives data formatted as JSON and must be backed by a PostgreSQL database.
  • The API should have login/logout endpoints for a (not yet built) frontend to use.
  • Finally, a user should only be allowed to access sensor data that belongs to them.

I chose these requirements because they cover a number of different read/write operations that require data formatting/parsing and tracking relationships between four different entities (users, keys, sensors, and readings).

Dependencies and Setup

For this build, I needed three core dependencies:

  1. A server to handle web requests (Dream).
  2. Support for parsing and synthesizing JSON records.
  3. A library for integrating with PostgreSQL.

After a bit of research, I settled on Yojson and Caqti for (2) and (3) respectively. Both of these are fairly standard and appear in the excellent corpus of examples that accompany Dream. Since I had previously written an example of how to use Dream with a dockerized PostrgreSQL container, I used that as a starting point.

It's useful to have a way to run (and re-run) ad-hoc requests against the API during development. I used Insomnia for this, but other tools like curl would work too.

Adding an endpoint

To give a sense of my development workflow, I want to walk through my process for adding a simple endpoint, /api/login. Inside the router, that endpoint looks like: "/api/login" login;

In this excerpt login is a "handler". Handlers are responsible for translating requests into responses; this is captured in their type signature.

Conceptually, the login handler needs to do three things:

  1. Receive a JSON body and confirm it's formatted as expected.
  2. Look up the relevant user/password combination in the database.
  3. Load the relevant session information if the credentials are valid.

For the first part, the endpoint needs to receive a JSON payload that looks like {"username": "...", "password": "..."}. To describe that payload, I introduced a simple record type:

type login_doc = {
  username : string;
  password : string;
} [@@deriving yojson]

Adding [@@deriving yojson] means the compiler can generate functions to convert these records to and from JSON. In this case those functions are called login_doc_of_yojson and yojson_of_login_doc.

My login handler looked roughly like this:

let json_receiver json_parser handler request =
  let%lwt body = Dream.body request in
  let parse =
      Some (body
      |> Yojson.Safe.from_string
      |> json_parser)
    with _ ->
  match parse with
  | Some doc -> handler doc request
  | None ->
    { error="Received invalid JSON input." }
    |> yojson_of_error_doc
    |> json_response ~status: `Bad_Request

let login =
  let login_base login_doc request =
    let%lwt user_id = Dream.sql request
        (Models.User.get login_doc.username login_doc.password) in
    match user_id with
    | Some id ->
      let%lwt () = Dream.invalidate_session request in
      let%lwt () = Dream.put_session "user" (Int.to_string id) request in
      Dream.empty `OK
    | None -> Dream.empty `Forbidden
  json_receiver login_doc_of_yojson login_base

If a user uploads an invalid JSON body, this will cause login_doc_of_yojson to throw an exception. By default, this produces a 500 server error response. To handle this situation more gracefully, I introduced json_receiver. If the parser passed to json_receiver succeeds on the request body, the results are passed to the inner handler. Otherwise, the server responds with a 400 Bad Request. Re-using json_receiver across my endpoints allows me to avoid introducing atry/with block any time time I need to handle JSON input.

I decided to manage models in a separate library, Models. The build tool, dune, made this easy to do; I just needed separate dune files for the server/model and server/bin folders. I like this design because it simplifies re-use of the database portion of my project. For example, if I later needed to build a CLI admin tool, that tool could utilize my existing queries without being exposed to API concerns.

Inside User, the get query function is defined as:

let get =
  let query =
    R.find_opt T.(tup2 T.string T.string)
      "SELECT id FROM app_user WHERE username = ? and password = ?" in
  fun username password (module Db : DB) ->
    let%lwt user_or_error = Db.find_opt query (username, password) in
    Caqti_lwt.or_fail user_or_error

Caqti allows us to define a function that consumes our query parameters (the username and password) along with a database connection, and produces a promise that will resolve with the results of the query. In this case, the types encode that the function produces an int option containing the user's primary key if the credentials are valid. (Note that it's not a good idea to store passwords in plain text like this; this is just for illustrative purposes.)

Session management, the final item that the endpoint needs to address, is handled by Dream. The framework allows sessions to be stored in cookies, memory, or the database; I opted for database sessions because I had used similar session backends in the past. Sessions allow us to store pairs key/value strings across requests. I used the session just to store the logged-in user's ID, so that the API has easy access in later database queries.

Thoughts on the Dream Router

The final router for my server looked like this:

let () = ~interface:""
  @@ Dream.logger
  @@ Dream.sql_pool "postgresql://sensors@"
  @@ Dream.sql_sessions
  @@ Dream.router [
    (* More endpoints here ... *)

    Dream.get "/version" version; "/api/login" login; "/api/logout" logout;

    Dream.scope "/api" [login_required] [
      Dream.get "/user/:user_id" user;
      (* More endpoints here... *)

    Dream.scope "/sensor" [] [ "/upload" @@ api_key_required sensor_upload;
  @@ Dream.not_found

I really appreciate how Dream makes it possible to get a concise overview of the API; in some Pyramid projects, I found this wasn't always possible. In fact, in Pyramid I sometimes struggled with subtle routing bugs that were not obvious until run time. Having the compiler validate the router in Dream was a welcome change.

At the moment, Pyramid makes it somewhat easier to handle access/permissions concerns compared to Dream. Working in Pyramid, it was relatively common for the routes to manage some amount of parameter validation and permissions. For example, given a path /user/123/article/abc456, the route (or "resources" in Pyramid terminology) would be responsible for:

  • Extracting the user and article IDs (123 and abc456).
  • Determining those records actually exist in the database, and passing them to the view/handler in a context record.
  • Validating that the requester has permissions to operate on those User and Article records.

This is discussed a bit more in the Pyramid Docs under URL traversal. Effectively, Pyramid resources mean that handler/view code downstream can focus on updating database records and/or rendering HTML/JSON without worrying about permissions concerns.

I didn't want to build a permissions system for my API so instead I incorporated User IDs into my function signatures in Model and used inner joins to model permissions. For example, here is an example of a query that fetches the metadata for all sensors belonging to a particular user:

SELECT, s.description, k.uuid
FROM sensor s
INNER JOIN user_sensor us
 ON us.sensor = AND us.app_user = $1 AND us.sensor = $2
INNER JOIN api_key k
 ON s.api_key =

By joining against user_sensor, I ensure that a user only gets data for the sensors they own.

I should point out that Dream allows us to use a custom router, however! The project has an issue for more elaborate routing and Ulrik Strid has introduced dream-routes, which allows additional type information to be encoded in routes. So, writing a more sophisticated router that behaves like Pyramid's resource system is certainly possible.

Comparing Caqti and SQLAlchemy

As a web programmer, especially one working on an analytics project, it's important to get comfortable working with the database library you've chosen for your project. Indeed, a significant portion of this project consisted of familiarizing myself with Caqti.

Caqti is a bit different from SQLAlchemy. With SQLAlchemy, you typically define one class for each table. Then, using SQLAlchemy's metaprogramming features, those classes are used to define queries. For comparison, here are equivalent queries against the user table:


let get =
  let query =
    R.find_opt T.(tup2 T.string T.string)
      "SELECT id FROM app_user WHERE username = ? and password = ?" in
  fun username password (module Db : DB) ->
    let%lwt user_or_error = Db.find_opt query (username, password) in
    Caqti_lwt.or_fail user_or_error


db.session.query(User) \
        User.username == username,
        User.password == password

Working with Caqti, a mistake in the syntax of a SQL query won't be discovered until runtime. I made fewer mistakes like this with SQLAlchemy because most queries are written in Python and can be linted by the IDE. Another recurring point of confusion for me with Caqti had to do with matching the function on the Caqti_request (R.find_opt above) with the function invoked in Db (Db.find_opt); this is how we indicate that the query produces zero, one, zero/one, or multiple rows. If the two don't agree, the program won't compile; initially I struggled to understand what this compiler error meant:

Error: The function applied to this argument has type
         ?env:(Caqti_driver_info.t -> string -> Caqti_query.t) ->
         ?oneshot:bool ->
         ('a, unit, [< `Many | `One | `Zero > `Zero ]) Caqti_request.t
This argument cannot be applied without label

Using Caqti became easier as I became accustomed to thinking in terms of prepared queries. About halfway through the project I realized I needed to switch from SQLite to PostgreSQL, and changing databases was fairly painless because Caqti supports both.

I also made use of Caqti's features for dealing with custom column types. Caqti does not have built-in support for JSON columns, but adding a custom column type to handle this was straightforward:

type readings = float option array [@@deriving yojson]

let sql_readings =
  let encode (a: readings) =
    Ok (a |> yojson_of_readings |> Yojson.Safe.to_string)
  let decode text =
    Ok (text |> Yojson.Safe.from_string |> readings_of_yojson)
  T.(custom ~encode ~decode string)

In my case, the JSON that I needed to store consisted of arrays of floating point numbers, so building the custom type was simply a matter of connecting Yojson and Caqti in the right way.


Ever since I started building web applications in Python, I've thought of an API server as a big function that transforms requests into responses, with any nontrivial state residing in databases or services like S3. Dream provides a modern web framework that matches my mental model for how a web server should work.

Web programming in OCaml "feels" a bit different than working in Python. In Python it's possible to get something running quickly, but it's also easy to forget corner cases and introduce defects that have to be addressed later in the application's lifecycle. An OCaml project requires some up-front investment, but this investment pays off later in several ways. First, fast feedback from the compiler (together with editor integrations like merlin and tuareg) helped me to identify issues earlier. Yojson made it simple to enforce that JSON requests and responses adhered to a fixed schema, and helped me process inputs more systematically. OCaml made refactoring safe and easy, so that I could confidently adapt my design as I introduced new requirements. In the right context, I think the advantages of using OCaml could significantly reduce the total lifecycle costs of maintaining many web applications without reducing the pace of new development.


I found the resources below helpful for working on this project:

  • The Dream API docs set a high standard for readability and helped me quickly get up to speed with the framework.
  • This section of the Dream repo has excellent examples, both for working with the framework and deploying it.
  • I referred to Bobby Priambodo's blog posts about interfacing Caqti and PostgreSQL as I wrote the models in my project.
  • The Caqti docs, especially for Caqti_type and Caqti_request were helpful as I was writing queries.
  • There is also an example in the Caqti repo that is worth looking through.
  • The README on ppx_yojson_conv had helpful examples that I referred to while writing my own JSON-related types.


If you ended up looking through the source code for my API, let me know how what you thought! I'm interested in adding more tutorial resources to the OCaml ecosystem, so feel free to post a PR or issue to sensors if you have ideas about how to make these resources better.

Jul 01, 2021

Let's write a shell in OCaml!

This Summer I'm studying OCaml and systems programming at the Recurse Center. Currently I'm reading the delightful book Operating Systems: Three Easy Pieces by Remzi Arpaci-Dusseau and Andrea Arpaci-Dusseau. Besides great exposition, the book includes several interesting projects, one of which is to write a shell.

The project materials, a set of shell scripts that define tests, are language agnostic. I believe the authors intended for students to use C, but I worked through the exercise in OCaml; you can find the final result in this repo. This blog post is a short experience report; if you're interested in working on a beginner OCaml project, hopefully it provides some useful scaffolding for your studies.

Shell features

The shell I built (osh) has a somewhat limited feature set. It:

  • Allows several commands to be run in parallel (e.g. program1 arg1 & program2 arg2 arg3)

  • Supports simple output redirection (e.g. echo "hello" > hello.txt). Both stdout and stderr go to the same file.

  • Has three builtin functions, cd, exit, and path. path allows you to define a list of directories to search for executables; we need this because environment variables aren't supported.

Other niceties, like tab completion and pipes, aren't supported. To facilitate testing, the shell can run in a batch mode (osh <my file of commands>) as well as an interactive mode. A final simplifying assumption is that osh only provides a basic error message when inputs are malformed.


There are three main topics I studied while working on this project:

  1. The Unix module, especially fork and execv. This page, from Xavier Leroy and Didier Rémy, was helpful for getting up to speed. The man-pages were also useful.

  2. The Stream module. Streams in OCaml are roughly analogous to generators in Python; they allowed me to use a single type (a string Stream.t) to represent commands coming from an interactive session or a batch file.

  3. The Angstrom parser-combinator library. I used this library to determine whether a given line of user input is well formed.

I suspect you could implement the parser without Angstrom, if you really wanted to. I wanted to learn more about this library, and this project seemed like a good place to try it out.

Although I previously worked through some simple monadic parser exercises in Haskell, I struggled the most with Angstrom. Reading one of the main mli files seemed to be the best way to get an overview of the combinators the library provides. I also spent a fair amount of time experimenting in utop as I put my parser together.


Here is a rough outline for how I built my shell.

First, I defined a main function that would fetch a line of text from stdin and echo it back to the user. Then I added the ability to represent either interactive or batch input with a Stream:

let prompt_stream =
  let f _ =
    Printf.printf "osh> ";
    Some (read_line ()) in
  Stream.from f

let file_stream filename =
  let in_channel = open_in filename in
  let f _ =
      Some (input_line in_channel)
    with End_of_file ->
      close_in in_channel;
  Stream.from f

It's useful to do this part as early as possible, since batch input allows you to run the tests.

Once I could get osh to fail all 22 test cases, I started adding the simpler built-in commands cd and exit. (Note ./ -c will run all 22 tests, without stopping at the first one that fails.)

My first version of main just treated user input as string list:

let rec main stream =
  let parse text = String.split_on_char ' ' text in
  let process text =
    match parse text with
    | [] -> main stream  (* Empty lines should just be treated as no-ops. *)
    | [""] -> main stream
    | ["exit"] -> exit 0
    | ["cd"; x] -> Unix.chdir x
    | "cd":: _ -> print_error ()
    | x::xs ->
      match Unix.fork () with
      | 0 -> (* Child *)
        Unix.execv x (Array.of_list xs)
      | child_pid -> (* Parent *)
        let _ = Unix.waitpid [] child_pid
        in ()
  Stream.iter process stream

This worked well enough for running simple commands like /usr/bin/ls -lah. In order to get output redirection and & to work, I needed to upgrade the program's parsing capabilities. I switched from representing user input with a string list to a variant type:

type exec = {
  executable: string;
  arguments: string list;
  output: string option;

type line =
  | NoOp
  | ChangeDirectory of string
  | PathChange of string list
  | Executable of exec list
  | Quit

I found the OCaml compiler (and its emacs integrations, merlin and tuareg) very helpful for refactoring. When I revised my types to make them better reflect the data I wanted to model, type errors identified all of the places I needed to update. At this point, I introduced an Angstrom parser and re-implemented support for cd, exit, and path. That allowed me to get familiar with the parser combinators *>, many, and choice, before attempting the more complicated parsers needed to support & and >.

The type system protected me from a number of silly mistakes, but it wasn't a silver bullet. I spent an hour confused about why one of my parsers would go into an infinite loop before realizing I needed to use many1 instead of many. I think this is the sort of mistake I would learn to avoid if I spent more time working with Angstrom.


As I worked through this project, I started to appreciate how OCaml chooses "good defaults" for my code (especially immutable data) without forbidding me from working in an imperative mode when I need to.

For example, I found that the function responsible for running a child process was most natural to express with imperative language features:

let run_executable path {executable; arguments; output} =
  let full_paths = (fun p -> p ^ "/" ^ executable) path in
  let executables = List.append full_paths [executable] in
  let present = List.filter executable_present executables in

  match present with
  | [] -> print_error (); None
  | x :: _ ->
    match Unix.fork () with
    | 0 -> (* Child *)
       let out = match output with
         | Some filename -> Unix.openfile filename [Unix.O_TRUNC; Unix.O_WRONLY; Unix.O_CREAT ] 0o640
         | None -> Unix.stdout
       let err = match output with
         | None -> Unix.stderr
         | _ -> out
       Unix.dup2 out Unix.stdout;
       Unix.dup2 err Unix.stderr;
       Unix.execv x @@ Array.of_list @@ executable::arguments;
    | child_pid -> (* Parent *)
      Some child_pid

Writing this way made it easy for me to switch between C and OCaml; in both languages, we follow a pattern of calling fork, then execv in the child process. The fact that I can easily re-use my knowledge of the UNIX APIs across different programming languages is one of the main reasons I'm excited to learn more systems programming concepts!

Variants and pattern matching are features I miss when working in languages like Python. These features make it so easy to look at a function like:

let main stream =
  let path = ref ["/bin"] in
  let process text =
    match Parser.parse_line text with
    | Error _ -> print_error ()
    | Ok (NoOp) -> () (* Empty lines should just be treated as no-ops. *)
    | Ok (Quit) -> exit 0
    | Ok (ChangeDirectory x) -> change_directory x
    | Ok (PathChange paths) ->
      let cwd = Unix.getcwd () in
      path := (amend_path cwd) paths
    | Ok (Executable execs) -> run_executables !path execs
  Stream.iter process stream

and quickly understand all of the different commands osh supports.

Although it took me some time to get up and running with Angstrom, I strongly prefer working with parser combinators to writing an imperative parser (as I've had to do in Python and Java). Angstrom makes it easy to build up small parsers and test them in isolation before composing them into the larger parser you really need. If I planned to develop a shell with more features, I would probably break Parser out into its own file with separate unit tests.


If you give this project a try, let me know how it turned out! I'm interested in adding more tutorial resources to the OCaml ecosystem, so feel free to post a PR or issue to simple-ocaml-shell if you have ideas about how to make these resources better.

Jun 06, 2017

Microbenchmarking Implementations of Map in OCaml

Map is one of the first higher-order functions I remember encountering when I learned some rudimentary functional programming topics as an undergraduate. More recently I began learning OCaml. The map implementation I found in the standard library was the textbook definition I was expecting

let rec map f = function
   [] -> []
  | a::l -> let r = f a in r :: map f l

but I was surprised to learn there are actually several implementations of map in the OCaml ecosystem. I wanted to know more about these different implementations and their performance characteristics. This article summarizes what I learned.

Part of my motivation to write this post arose from this discussion of a pull request that proposes the default map implementation in lwt should be tail recursive. After reading all of the comments, I wanted to develop experiments that would convince me whether this was a good idea.


Since I'll introduce several versions of map, I'll refer to the implementation above as stdlib. An important aspect of this version is that it isn't tail recursive. A tail recursive function uses recursion in a specific way; it only calls itself as the final operation in the function. In OCaml, tail recursive functions get the benefit of tail-call optimization so that they only use one frame on the stack. In contrast, the stdlib implementation takes space on the stack proportional to the length of the input list. For long enough lists, this causes a stack overflow.

We can make a basic tail-recursive version of map by composing two standard library functions that are tail recursive: List.rev_map (which creates the output of map, but in reversed order) and List.rev which reverses lists. I'll call this naive implementation ntr (naive tail recursive). Compared to stdlib, ntr allocates twice as much space on the heap (one list for the output of List.rev_map, and a second equally long list for List.rev) and spends additional time performing a list traversal. The benefit is that with this implementation, calling map on a long list won't result in a stack overflow.

As we'll see later in the Results section, the naive tail recursive implementation is, overall, slower than the stdlib implementation. However, there are some optimizations we can apply to improve the performance of both versions.

The first trick I call "unrolling". Similar to loop unrolling in procedural languages, the idea is to use cases to map on groups of list elements so fewer function calls are needed to complete the work. For example, an unrolled version of stdlib looks like:

let rec map f l =
  match l with
  | [] -> []
  | [x1] ->
    let f1 = f x1 in
  | [x1; x2] ->
    let f1 = f x1 in
    let f2 = f x2 in
    [f1; f2]
  | x1 :: x2 :: x3 :: tl ->
    let f1 = f x1 in
    let f2 = f x2 in
    let f3 = f x3 in
    f1 :: f2 :: f3 :: map tl

We get to choose how many elements to unroll (here I picked 3); some profiling is needed to decide the most appropriate number.

The second trick I call "hybridizing". The stdlib version is faster, but isn't safe for long lists. The tail recursive implementation is slow, but safe for all lists. When we "hybridize" the two, we use the faster version up to some fixed number of elements (e.g. 1000), and then switch to the safe version for the remainder of the list.

These two optimizations are useful for understanding the implementations of map we find in other libraries. For example, both base and containers use an unrolled stdlib-style map, hybridized with an ntr-style map.

Looking at ntr, one might ask: Is it strictly necessary to use additional heap space to get the implementation to be tail recursive? The answer is no. The batteries package implements map so that the work is done by a single tail recursive function, creating one new list on the heap. To achieve this, the implementation uses mutable state and casting to avoid a second list traversal (specifically, here).

As with any optimization, one makes tradeoffs between elegance, performance, and the language features one considers acceptable to use. I wanted to know:

  • Which version is the fastest in practice?

  • How much speed does one lose using the safer tail recursive version?

Both base and containers decided to hybridize with an ntr-style map for safety. But the batteries implementation is also robust to stack overflow. That led to one final question:

  • How fast is an unrolled stdlib-style map hybridized with a batteries-style map?

Experimental Setup

In 2014, Jane Street introduced a library for microbenchmarking called core_bench. As they explain on their blog, core_bench is intended to measure the performance of small pieces of OCaml code. This library helps developers to better understand the cost of individual operations, like map.

There are a couple benefits to measuring map implementations with core_bench.

  1. The library makes it easy to track both time and use of the heap.

  2. Once you specify your test, core_bench automatically provides many command line options to help you present your test data in different ways.

  3. The library uses statistical techniques (bootstrapping and linear regression) to reduce many runs worth of data to a small number of meaningful performance metrics that account for the amortized cost of garbage collection and error introduced by system activity.

I wrote this program to compare performance between the six implementations I described above. The program includes the batteries-hybrid I described above, which I'll refer to as batt-hybr. Other test details:

  • I tested each algorithm against lists of lengths \(N=10^2, 10^3, 10^4\) and \(10^5\). On my system, \(N=10^5\) was the highest power of 10 for which the stdlib implementation did not fail due to a stack overflow.
  • Each list consisted of integer elements (all 0), and the function mapped onto the list simply added one to each element.
  • I ran these tests on a Lenovo X220, Intel Core i5-2520M CPU (2.50GHz × 4 cores), with 4GB RAM.
  • I used the OCaml 4.03.0 compiler, and compiled to native code without using flambda. (This makefile gives the full specifications.)


For each list size, core_bench produces a table with several metrics besides time:

  • mWd : Words allocated on the minor heap.
  • mjWd : Words allocated on the major heap.
  • Prom : Words promoted from minor to major heap.

The library also allows us to produce 95% confidence intervals and \(R^2\) values for the time estimates.

Map Benchmark, N = 100

│ Name       │ Time R^2 │ Time/Run │          95ci │ mWd/Run │ mjWd/Run │ Prom/Run │ Percentage │
│ ntr        │     1.00 │ 741.20ns │ -0.23% +0.22% │ 609.01w │    0.53w │    0.53w │    100.00% │
│ containers │     1.00 │ 439.55ns │ -0.64% +0.67% │ 304.03w │    0.17w │    0.17w │     59.30% │
│ batteries  │     1.00 │ 581.65ns │ -0.14% +0.16% │ 309.01w │    0.35w │    0.35w │     78.47% │
│ base       │     1.00 │ 438.63ns │ -0.19% +0.20% │ 304.03w │    0.16w │    0.16w │     59.18% │
│ stdlib     │     1.00 │ 610.45ns │ -0.10% +0.11% │ 304.01w │    0.17w │    0.17w │     82.36% │
│ batt-hybr  │     1.00 │ 426.76ns │ -0.15% +0.17% │ 304.03w │    0.17w │    0.17w │     57.58% │

Map Benchmark, N = 1000

│ Name       │ Time R^2 │ Time/Run │          95ci │ mWd/Run │ mjWd/Run │ Prom/Run │ Percentage │
│ ntr        │     1.00 │   8.84us │ -0.14% +0.15% │  6.01kw │   52.08w │   52.08w │    100.00% │
│ containers │     1.00 │   5.07us │ -0.13% +0.15% │  3.00kw │   17.23w │   17.23w │     57.34% │
│ batteries  │     1.00 │   6.57us │ -0.14% +0.14% │  3.01kw │   34.71w │   34.71w │     74.32% │
│ base       │     1.00 │   4.99us │ -0.20% +0.24% │  3.00kw │   17.08w │   17.08w │     56.44% │
│ stdlib     │     1.00 │   6.84us │ -0.10% +0.10% │  3.00kw │   17.22w │   17.22w │     77.34% │
│ batt-hybr  │     1.00 │   4.96us │ -0.13% +0.13% │  3.00kw │   17.07w │   17.07w │     56.10% │

Map Benchmark, N = 10,000

│ Name       │ Time R^2 │ Time/Run │          95ci │ mWd/Run │ mjWd/Run │ Prom/Run │ Percentage │
│ ntr        │     0.99 │ 203.11us │ -0.45% +0.53% │ 60.01kw │   5.40kw │   5.40kw │    100.00% │
│ containers │     1.00 │ 144.95us │ -0.50% +0.60% │ 48.01kw │   3.04kw │   3.04kw │     71.37% │
│ batteries  │     1.00 │ 136.87us │ -0.51% +0.63% │ 30.01kw │   3.64kw │   3.64kw │     67.39% │
│ base       │     1.00 │ 130.85us │ -0.34% +0.39% │ 44.98kw │   2.66kw │   2.66kw │     64.42% │
│ stdlib     │     1.00 │ 118.70us │ -0.30% +0.29% │ 30.00kw │   1.80kw │   1.80kw │     58.44% │
│ batt-hybr  │     1.00 │ 106.91us │ -0.42% +0.52% │ 30.01kw │   2.22kw │   2.22kw │     52.64% │

Map Benchmark, N = 100,000

│ Name       │ Time R^2 │ Time/Run │          95ci │  mWd/Run │ mjWd/Run │ Prom/Run │ Percentage │
│ ntr        │     0.98 │  10.27ms │ -2.39% +2.32% │ 600.02kw │ 411.83kw │ 411.83kw │     94.33% │
│ containers │     0.99 │  10.62ms │ -2.19% +1.83% │ 588.02kw │ 404.91kw │ 404.91kw │     97.61% │
│ batteries  │     1.00 │   7.19ms │ -0.78% +0.79% │ 300.02kw │ 300.35kw │ 300.35kw │     66.02% │
│ base       │     0.99 │  10.88ms │ -1.54% +1.41% │ 584.99kw │ 397.60kw │ 397.60kw │    100.00% │
│ stdlib     │     0.99 │   6.33ms │ -1.08% +1.07% │ 300.01kw │ 171.73kw │ 171.73kw │     58.18% │
│ batt-hybr  │     0.93 │   6.09ms │ -4.30% +4.63% │ 300.02kw │ 285.46kw │ 285.46kw │     55.93% │

A few things I noticed about the data:

  • The tail recursive implementation is slow, but not always the slowest. In the \(N=10^5\) case, ntr,base, and containers show similar performance; this makes sense given they have the same behavior after a prefix of the list.

  • For short lists, base, containers, and batt-hybr are fastest, likely because of unrolling.

  • We can see from the mWd and mjWd columns that ntr uses the most heap space, as we would expect.

  • As we increase the size of the list, stdlib gets faster relative to ntr; I suspect this can be attributed to the cost of garbage collection.

  • Overall, batt-hybr appears to have the best performance (though in the \(N=10^5\) case, the confidence interval is large enough that we can't conclude batt-hybr is faster than stdlib, and the \(R^2\) value is a bit low).


The data above suggest three main findings:

  1. The stack-based standard library implementation of map is faster than the naive tail recursive implementation, taking about 60-80% of the time to do the same work depending on the size of the list.

  2. There are clear benefits to both unrolling and hybridizing. Hybridizing lets us take an implementation that performs well on long lists (batteries), and make it even better by combining it with one that's fast on short lists, to produce batt-hybr.

  3. Overall, the data show we don't have to give up safety (resistance to stack overflow) to get better performance. A tail recursive implementation can be quite competitive, albeit more complex.


A follow up question to this analysis, in the context of the discussion of Lwt_list.map_p is: "How significant is the cost of map compared to other operations?" To partially address this, I wrote another program that measures how long it takes to write 4096 bytes to a Unix pipe using Lwt_unix.write, and then read it back using Using core_bench, I found:

Unix Pipe Read/Write Benchmark

│Time R^2 │ Time/Run │          95ci │ mWd/Run │ Percentage │
│    1.00 │ 893.33ns │ -0.16% +0.18% │  56.00w │    100.00% │

It seems reasonable to estimate that Lwt_list.map_p would primarily be applied to IO operations like the ones in this experiment. In that case, the data suggest the cost per element of applying the slowest map implementation (ntr) is about 0.82% of the cost of one 4KB read/write operation on a unix pipe. In the context of Lwt, it seems reasonable to conclude the implementation of map isn't a significant concern; the real cost is performing IO. In light of that, it seems preferable to use an implementation that cannot cause a stack overflow.

A second question that might be asked about this analysis is: "Was it necessary to introduce the added complexity of core_bench to measure the performance of map?" For example, one might set up a performance test with just successive calls to or Unix.gettimeofday. In an earlier draft of this article, I tried such an approach, producing some misleading data. In that test, I did a full garbage collection in between runs (applications of map), which suppressed that (nontrivial) cost and made the tail recursive implementation appear quite fast. core_bench does a much better job incorporating the cost of garbage collection by varying the number of test runs performed in a row and then establishing a trend (with linear regression) that reflects the amortized cost of garbage collection per run.

One interesting finding did arise from my earlier tests: for short lists, allocating memory on the heap actually appears to be faster than creating frames on the stack. Since the tests perform a full garbage collection between test runs, the timings show only the cost of allocating space on the heap. This produced data like the following:

Summary Statistics, \(N=10^3\) Garbage Collected Map Benchmark Times (μs)

Std. Lib. NTR Base Batteries Containers
Mean 27.18 27.21 11.23 19.03 13.06
Min 15.97 14.07 6.91 9.06 7.87
25% 18.84 17.88 7.15 10.97 10.01
50% 20.03 19.07 8.11 15.50 10.01
75% 23.13 22.17 12.87 18.84 10.97
Max 849.01 576.02 379.80 622.03 434.88

Summary Statistics, \(N=10^4\) Garbage Collected Map Benchmark Times (μs)

Std. Lib. NTR Base Batteries Containers
Mean 230.23 216.60 141.39 144.59 161.56
Min 77.01 73.91 61.04 54.84 63.18
25% 97.04 86.07 67.95 58.89 70.81
50% 121.95 99.90 75.10 61.99 77.01
75% 193.83 284.91 148.59 118.26 175.24
Max 12719.87 2753.97 2960.21 3616.09 13603.93

When we ignore the time spent on garbage collection, we see that ntr is surprisingly competitive, especially in comparison with stdlib; the difference between the two implementations, of course, is that ntr allocates more memory on the heap while stdlib creates more stack frames.


Thanks to Anton Bachin and Simon Cruanes for reading drafts of this post.