Joe Thomas | articles in the "OCaml, SMTP, Debugging" category

Debugging an SMTP Integration in OCaml

For the last few weeks I've been working on a side project to create a small library called tidy_email that makes it easier to send email in OCaml. The repository has some example programs illustrating how to connect to different email services. I wanted to make sure that these programs would fail gracefully if configured with the wrong credentials. But when I ran my example SMTP app with an invalid password, I saw this:

Send failed. Details: Error: Stack overflow

I expected to see some sort of "not authorized" message. Why was this happening? This post summarizes how I debugged this problem.

Context

Among email services (like Mailgun, Sendgrid, and Amazon SES) there are typically two ways for a web application to send email: make a request to a REST API or use SMTP. REST APIs are convenient because they expose more features. In particular, when you provide malformed data, the API can respond with a detailed explanation of what the application did incorrectly. On the other hand, SMTP is advantageous if you already have a mail server, want to integrate with a legacy system, or can't be bothered to write a custom REST API integration.

tidy_email provides a simple wrapper around other libraries that handle HTTP requests and/or SMTP. For SMTP, it uses letters, which in turn depends on a library called colombe. At a high level, what I wanted to understand was why letters was experiencing a stack overflow when provided with bad SMTP credentials.

The specific SMTP server I was testing against was provided by Mailgun. I had previously had some difficulties with this server; these came about because Mailgun's implementation doesn't strictly adhere to the relevant RFCs, and I thought something similar might be occurring here.

Gathering Data

I started by cloning the letters and colombe repositories from github. Fortunately, I already had merlin installed for Emacs. This gave me the ability to jump from a function call in my source code to the function's definition (C-c l). It also allowed me to quickly look up types (C-c t). These hotkeys were invaluable for efficiently working with unfamiliar code.

Before I started digging into the source code, I tried running strace on my test app. This tool produces a log of all system calls made by the app under test. This results in a lot of output; here is an excerpt of what I saw:

read(5, "220 Mailgun Influx ready\r\n", 4096) = 26
read(5, 0x56132a92d460, 4096)           = -1 EAGAIN (Resource temporarily unavailable)
write(5, "EHLO glenmeretechnologies.com\r\n", 31) = 31
getpid()                                = 15565
epoll_wait(3, [{EPOLLIN, {u32=5, u64=8589934597}}], 64, 9879) = 1
read(5, "250-smtp-out-n02.prod.us-west-2."..., 4096) = 144
read(5, 0x56132a92d460, 4096)           = -1 EAGAIN (Resource temporarily unavailable)
write(5, "STARTTLS\r\n", 10)            = 10
getpid()                                = 15565
epoll_wait(3, [{EPOLLIN, {u32=5, u64=8589934597}}], 64, 9840) = 1
read(5, "220 Go ahead\r\n", 4096)       = 14
brk(0x56132a954000)                     = 0x56132a954000
brk(0x56132a975000)                     = 0x56132a975000
brk(0x56132a996000)                     = 0x56132a996000
read(5, 0x56132a92d460, 4096)           = -1 EAGAIN (Resource temporarily unavailable)
write(5, "\26\3\3\2\0\1\0\1\374\3\0033y\374\354\246\200\2749u\7\375P\226\273\37\342|\257\235\277h"..., 517) = 517
getpid()                                = 15565
epoll_wait(3, [{EPOLLIN, {u32=5, u64=8589934597}}], 64, 9796) = 1
read(5, "\26\3\3\0Z\2\0\0V\3\3\22c1v\234\367\0\322fR\230\264\236;\314\f\23\234$\245\33"..., 4096) = 3389
read(5, 0x56132a92d460, 4096)           = -1 EAGAIN (Resource temporarily unavailable)
...
read(5, "", 4096)                       = 0
(last line repeats hundreds of times)

From this output, I was able to glean a couple of things:

My app is successfully connecting to smtp.mailgun.org, as evidenced by the 220 Go ahead message.
The hundreds of read calls seem related to the stack overflow issue. They suggest some sort of infinite loop or that the app is trying to read data that isn't available for some reason.

Getting more information about SMTP

At this point, I decided I needed to learn a little bit about how SMTP works. Fortunately, wikipedia provides a fairly concise overview with an example of a successful SMTP session (S = Server, C = Client):

S: 220 smtp.example.com ESMTP Postfix
C: HELO relay.example.com
S: 250 smtp.example.com, I am glad to meet you
C: MAIL FROM:<bob@example.com>
S: 250 Ok
C: RCPT TO:<alice@example.com>
S: 250 Ok
C: RCPT TO:<theboss@example.com>
S: 250 Ok
C: DATA
S: 354 End data with <CR><LF>.<CR><LF>
C: From: "Bob Example" <bob@example.com>
C: To: Alice Example <alice@example.com>
C: Cc: theboss@example.com
C: Date: Tue, 15 Jan 2008 16:02:43 -0500
C: Subject: Test message
C:
C: Hello Alice.
C: This is a test message with 5 header fields and 4 lines in the message body.
C: Your friend,
C: Bob
C: .
S: 250 Ok: queued as 12345
C: QUIT
S: 221 Bye

Unfortunately, this excerpt doesn't explain much about how a session using STARTTLS should look. Since I saw some successful read calls in my strace output, with arguments that weren't readable text, I inferred that TLS was probably interfering with my ability to easily see what was going on in the session after the first 220 response.

Getting more data from logs

I used merlin to jump to the relevant parts of the colombe source code and noticed that colombe uses the Logs module. Adding these two lines at the start of my test app allowed me to start collecting log output, and see what was going on in colombe:

  Logs.set_reporter (Logs.format_reporter ());
  Logs.set_level (Some Logs.Info);

When I took a look at the log output, I saw:

...
send: [INFO] Return
send: [INFO] READ : 220 Mailgun Influx ready
send: [INFO] READ : 250-smtp-out-n01.prod.us-west-2.postgun.com
250-AUTH PLAIN LOGIN
250-SIZE 52428800
250-8BITMIME
250-SMTPUTF8
250-PIPELINING
250 STARTTLS
...
send: [INFO] Return
send: [INFO] READ :
send: [INFO] Return
send: [WARNING] The server send an invalid base64 value: "Go ahead"
send: [INFO] READ :
send: [INFO] Return
send: [INFO] Response 535 Authentication failed
send: [INFO] Quitting connection...
send: [INFO] Before recv...
send: [INFO] READ :
(repeats, hundreds of times)
...

Similar to the strace results, I could see many failed reads. But this time I could see they occurred after a 535 Authentication failed message came back from the server.

During this phase of analysis it was helpful to be able to add my own log points to the source code. In order to do this, though, I needed to ensure that my test app would build against my modified version of colombe, not the published version on opam. Within the colombe repo, I ran

dune build src && dune install colombe && dune build sendmail && dune install sendmail

This pinned both colombe and sendmail to my local versions on disk. Because letters uses these two libraries, I also needed to go to my letters repo and run:

dune build . && dune install letters

With these two steps taken care of, I could use dune to execute my test app and see the custom instrumentation that I'd added to the underlying libraries.

Reproducing the problem

At this point, I was pretty confident that colombe was having problems immediately after trying to quit the SMTP session by sending a QUIT. After some google searching, I learned that openssl allows you to run an SMTP session interactively. To test, I ran:

openssl s_client -connect smtp.mailgun.org:587 -starttls smtp

The first part of the session looked like this:

250 STARTTLS
---
Post-Handshake New Session Ticket arrived:
SSL-Session:
    Protocol  : TLSv1.3
    Cipher    : TLS_AES_128_GCM_SHA256
    Session-ID: 7E83FE3CC7D65BE60AF56A2E049C09A27AB33DF834454617B81C0469A32EEC62
    Session-ID-ctx:
    Resumption PSK: 4A7C875CECCCA5633D3FF2107857EAF220FF561B782AAC3AC416385612E08AAB
    PSK identity: None
    PSK identity hint: None
    SRP username: None
    TLS session ticket lifetime hint: 604800 (seconds)
    TLS session ticket:
    0000 - 4a 79 8c 03 d5 1e 3f 5d-dc b2 2d b6 38 db 4e 10   Jy....?]..-.8.N.
    0010 - 91 b6 1d 44 d2 95 c9 50-f3 f9 a7 89 8d 0e b9 7e   ...D...P.......~
    0020 - f1 d6 52 da 3d 84 f5 ee-58 72 a8 81 1e 0f ad c7   ..R.=...Xr......
    0030 - c7 76 8f 8c d9 d9 56 8f-f3 72 de 0e bf 17 b0 2f   .v....V..r...../
    0040 - 72 a7 98 cd 92 46 da 45-fc b8 a9 d4 6e 73 71 f5   r....F.E....nsq.
    0050 - 74 f2 1d eb 83 b3 01 83-77 3f 7a e0 9a 36 4a fe   t.......w?z..6J.
    0060 - b3 89 3e c7 4c aa 87 eb-6d 46 42 46 73 8f 35 1f   ..>.L...mFBFs.5.
    0070 - 8c                                                .

    Start Time: 1630180065
    Timeout   : 7200 (sec)
    Verify return code: 0 (ok)
    Extended master secret: no
    Max Early Data: 0
---
read R BLOCK
EHLO
250-smtp-out-n02.prod.us-west-2.postgun.com
250-AUTH PLAIN LOGIN
250-SIZE 52428800
250-8BITMIME
250-SMTPUTF8
250 PIPELINING

By running EHLO, I was able to see the different operations this SMTP server supports. In the output above, we can see that one of those operations is AUTH PLAIN. I tried to authenticate with invalid credentials, and saw this:

AUTH PLAIN BADPASSWORD
501 Invalid response encoding
535 Authentication failed
closed

This means that after a client provides bad credentials, Mailgun's SMTP implementation immediately terminates the connection. But, when colombe encountered a failure to authenticate, it tried to clean up the session like this:

let properly_quit_and_fail ctx err =
  let open Monad in
  let* _txts = send ctx Value.Quit () >>= fun () -> recv ctx Value.PP_221 in
  Error err

So colombe tried to send a QUIT message and await a 221 Bye, but that message never arrived because the server had already closed the connection.

Fixing the problem

Once I understood the problem, I realized I could avoid the issue by revising properly_quit_and_fail to prevent waiting on a response from the server. Effectively, this meant changing a let* to a let:

let properly_quit_and_fail ctx err =
  let open Monad in
  let _txts = send ctx Value.Quit () >>= fun () -> recv ctx Value.PP_221 in
  Error err

I also filed this issue, which prompted one of the maintainers to refactor how connections are managed, with a much more thorough solution than what I had done. (Many thanks to Calascibetta Romain for fixing my issue so quickly!) Once merged, other colombe users shouldn't encounter the stack overflow issues I described here.

Final Thoughts

This was an interesting bug to investigate because it covered a number of unfamiliar topics. I didn't know much about SMTP when I started, and I learned a couple useful facts while debugging:

SMTP sessions consist of a handful of commands that are fairly easy to understand.
OpenSSL's tooling makes it easy to test SMTP by hand.
Even very commonly used mail services like Mailgun may not work exactly the way you expect from reading RFCs or other documentation!

I also learned some useful techniques I can reuse the next time I need to debug OCaml problems:

When writing test apps, enable logging (at info level) from the beginning!
Use dune install to add instrumentation and temporary bug fixes to libraries.
Developer tooling like merlin is very helpful for efficiently navigating unfamiliar code.

Feedback

If you found this post useful (or conversely, found an error), send me an email. I'm interested in adding more tutorial resources to the OCaml ecosystem and would happy to hear suggestions about how to make these resources better. My contact information is listed on the About page.

posted at 00:00 · OCaml, SMTP, Debugging

Aug 25, 2021