I recently ran into a puzzle with one of my Phoenix apps. File uploads would sometimes time out, but only on the production server and not on my workstation.
A tcpdump
session revealed that the server's receive buffer
was filling up: the Window Size field in successive TCP headers from the server
shrank to zero and stayed there until the connection was closed.
Peeking at the source code for Plug and Cowboy led me to believe that neither of them should be creating backpressure to the receive buffer in this case. Plug was putting the request body into a temporary file and should keep asking Cowboy for body data until it had the whole thing. Even if Plug did get stuck, Cowboy looked like it would ask for data until its buffer was full and that buffer was big enough for the entire upload that was timing out.
I figured a good spot to split the problem space would be in Cowboy. If it was asking for more data and not getting it, the trouble would likely be in Erlang or the OS. If it was not asking for more data, then the trouble would likely be in Cowboy itself.
It was time to break out dbg, which I had heard of but not used in anger before.
From looking at the Cowboy source code, I thought I just wanted to see the
balance between calls to :cowboy_http.parse_body
(which should be
called as body data comes in) and :cowboy_http.setopts_active
(which should be called to request more data). So I whipped up a little function
to handle trace data that would just print out the function name on function
calls and ignore returns:
handler = fn {:trace, _pid, :call, {_m, f, _a}}, _acc -> IO.inspect(f) _, _acc -> nil end
From my interactive Elixir session on the server, I started up a process to receive trace messages and handle them with my function. I also requested tracing of function calls for all processes. Then I enabled tracing for calls to the two functions I was interested in.
:dbg.tracer(:process, {handler, nil}) :dbg.p(:all, :c) :dbg.tpl(:cowboy_http, :parse_body, :x) :dbg.tpl(:cowboy_http, :setopts_active, :x)
I expected to see alternating calls to each function, and was just curious to
see which would be the last call before the request hung. I was surprised
instead to see many calls to parse_body
for each call to
setopts_active
. Looking more closely at the code, I found that
{:active, n}
was being passed to :inet.setopts/2
to
request multiple chunks of data be delivered before going back into passive
mode. I had imagined from the documentation that Cowboy would pass
{:active, true}
to request a single chunk at a time. Digging deeper
into the source code, I found an undocumented option I could set all the way
back in my Phoenix endpoint configuration to have {:active, 1}
passed to :inet.setopts/2
and hopefully make my debugging simpler.
config :skilman, Skilman.Endpoint, http: [protocol_options: [active_n: 1]]
I was surprised to discover that setting this option made the trouble go away! This was sufficient to solve my problem, though probably at some small cost to performance. But my curiosity was ablaze: I still wanted to know where the trouble was.
Having looked at the source for Cowboy more carefully and having read the
documentation for :inet
, I thought maybe the function calls I was
tracing weren't the most definite way to see where the trouble was. I now
thought I should be tracing the messages from :inet
to Cowboy to
see if Cowboy was acting on them as expected. I removed the
active_n
configuration option to let it go back to its default.
Then I set up a different tracing arrangment to see the message from
:inet
as they're received:
:dbg.stop_clear() handler = fn {:trace, _pid, :receive, {:tcp, _socket, _data}}, _acc -> IO.puts("ok") {:trace, _pid, :receive, {:tcp_closed, _socket}}, _acc -> IO.puts("closed") {:trace, _pid, :receive, {:tcp_error, _socket, _reason}}, _acc -> IO.puts("error") {:trace, _pid, :receive, {:tcp_passive, _socket}}, _acc -> IO.puts("passive") _, _acc -> nil end :dbg.tracer(:process, {handler, nil}) :dbg.p(:all, :m)
Another surprise awaited: now the trouble was gone even with
active_n
back to its default! I restarted my app and the trouble
returned. I set up tracing again and it disappeared. This led me to believe that
I wouldn't find the trouble in Cowboy, and I wasn't prepared to dive any deeper
than that at the time. I put the active_n
configuration back, made
a note of why it's there and how to test if it's still needed, and treated
myself to some onion rings.
I hope you found this article helpful and interesting! If you did, you may enjoy my other work. If you have any questions or comments, please drop me an email.
Aaron D. Parks