Debugging Elixir

I recently ran into a puzzle with one of my Phoenix apps. File uploads would sometimes time out, but only on the production server and not on my workstation.

Scene of the Crime

A tcpdump session revealed that the server's receive buffer was filling up: the Window Size field in successive TCP headers from the server shrank to zero and stayed there until the connection was closed.

Peeking at the source code for Plug and Cowboy led me to believe that neither of them should be creating backpressure to the receive buffer in this case. Plug was putting the request body into a temporary file and should keep asking Cowboy for body data until it had the whole thing. Even if Plug did get stuck, Cowboy looked like it would ask for data until its buffer was full and that buffer was big enough for the entire upload that was timing out.


I figured a good spot to split the problem space would be in Cowboy. If it was asking for more data and not getting it, the trouble would likely be in Erlang or the OS. If it was not asking for more data, then the trouble would likely be in Cowboy itself.

It was time to break out dbg, which I had heard of but not used in anger before.

From looking at the Cowboy source code, I thought I just wanted to see the balance between calls to :cowboy_http.parse_body (which should be called as body data comes in) and :cowboy_http.setopts_active (which should be called to request more data). So I whipped up a little function to handle trace data that would just print out the function name on function calls and ignore returns:

handler = fn
{:trace, _pid, :call, {_m, f, _a}}, _acc -> IO.inspect(f)
_, _acc -> nil

From my interactive Elixir session on the server, I started up a process to receive trace messages and handle them with my function. I also requested tracing of function calls for all processes. Then I enabled tracing for calls to the two functions I was interested in.

:dbg.tracer(:process, {handler, nil})
:dbg.p(:all, :c)
:dbg.tpl(:cowboy_http, :parse_body, :x)
:dbg.tpl(:cowboy_http, :setopts_active, :x)

The Plot Thickens

I expected to see alternating calls to each function, and was just curious to see which would be the last call before the request hung. I was surprised instead to see many calls to parse_body for each call to setopts_active. Looking more closely at the code, I found that {:active, n} was being passed to :inet.setopts/2 to request multiple chunks of data be delivered before going back into passive mode. I had imagined from the documentation that Cowboy would pass {:active, true} to request a single chunk at a time. Digging deeper into the source code, I found an undocumented option I could set all the way back in my Phoenix endpoint configuration to have {:active, 1} passed to :inet.setopts/2 and hopefully make my debugging simpler.

config :skilman, Skilman.Endpoint, http: [protocol_options: [active_n: 1]]

I was surprised to discover that setting this option made the trouble go away! This was sufficient to solve my problem, though probably at some small cost to performance. But my curiosity was ablaze: I still wanted to know where the trouble was.

Redoubled Efforts

Having looked at the source for Cowboy more carefully and having read the documentation for :inet, I thought maybe the function calls I was tracing weren't the most definite way to see where the trouble was. I now thought I should be tracing the messages from :inet to Cowboy to see if Cowboy was acting on them as expected. I removed the active_n configuration option to let it go back to its default. Then I set up a different tracing arrangment to see the message from :inet as they're received:

handler = fn
{:trace, _pid, :receive, {:tcp, _socket, _data}}, _acc -> IO.puts("ok")
{:trace, _pid, :receive, {:tcp_closed, _socket}}, _acc -> IO.puts("closed")
{:trace, _pid, :receive, {:tcp_error, _socket, _reason}}, _acc -> IO.puts("error")
{:trace, _pid, :receive, {:tcp_passive, _socket}}, _acc -> IO.puts("passive")
_, _acc -> nil
:dbg.tracer(:process, {handler, nil})
:dbg.p(:all, :m)

A Final Twist

Another surprise awaited: now the trouble was gone even with active_n back to its default! I restarted my app and the trouble returned. I set up tracing again and it disappeared. This led me to believe that I wouldn't find the trouble in Cowboy, and I wasn't prepared to dive any deeper than that at the time. I put the active_n configuration back, made a note of why it's there and how to test if it's still needed, and treated myself to some onion rings.

I hope you found this article helpful and interesting! If you did, you may enjoy my other work. If you have any questions or comments, please drop me an email.

Aaron D. Parks