This is a technical post. Regular Fastmail users subscribed to receive email updates from the Fastmail blog can just ignore this post.
The average user of the Fastmail website is probably a bit different to most websites. Webmail tends to be a "productivity application" that people use for an extended period of time. So for the number of web requests we get, we probably have less individual users than other similar sized sites, but the users we do have tend to stay for a while and do lots of actions/page views.
Because of that we like to have a long HTTP keep-alive timeout on our connections. This makes interactive response nicer for users as moving to the next message after spending 30 seconds reading a message is quick because we don't have to setup a new TCP connection or SSL session, we just send the request and get the response over the existing keep-alive one. Currently we set the keepalive timeout on our frontend nginx servers to 5 minutes.
I did some testing recently, and found that most clients didn't actually keep the connection open for 5 minutes. Here's the figures I measured based on Wireshark dumps.
I wondered why most clients used \<= 2 minutes, but Chrome was happy with much higher.
Interestingly one of the other things I noticed while doing this test with Wireshark is that after 45 seconds, Chrome would send a TCP keep-alive packet, and would keep doing that every 45 seconds until the 5 minute timeout. No other browser would do this.
After a bunch of searching, I think I found out what's going on.
It seems there's some users behind NAT gateways/stateful firewalls that have a 2 minute state timeout. So if you leave an HTTP connection idle for > 2 minutes, the NAT/firewall starts dropping any new packets on the connection and doesn't even RST the connection, so TCP goes into a long retry mode before finally returning that the connection timed out to the application.
To the user, the visible result is that after doing something with a site, if they wait > 2 minutes, and then click on another link/button, the action will just take ages to eventually timeout. There's a Chrome bug about this here:
So the Chrome solution was to enable SO_KEEPALIVE on sockets. On Windows 7 at least, this seems to cause TCP keep-alive pings to be sent after 45 seconds and every subsequent 45 seconds, which avoids the NAT/firewall timeout. On Linux/Mac I presume this is different because they're kernel tuneables that default to much higher. (Update: I didn’t realise you can set the idle and interval for keep-alive pings at the application level in Linux and Windows)
This allows Chrome to keep truly long lived HTTP keep-alive connections. Other browsers seem to have worked around this problem by just closing connections after \<= 2 minutes instead.
I've mentioned this to the Opera browser network team, so they can look at doing this in the future as well, to allow longer lived keep-alive connection.
I think it's going to be a particularly real problem with Server-Sent Event type connections that can be extremely long lived. We're either going to have to send application level server -> client pings over the channel every 45 seconds to make sure the connection is kept alive, or enable a very low keep-alive time on the server and enable SO_KEEPALIVE on each event source connected socket.