Copyright © 1999–2018 FastMail Pty Ltd
This blog post is part of the FastMail 2014 Advent Calendar.
Technical level: medium
Availability is the ability for authorised users to gain access to their data in a timely manner.
While both Confidentiality and Integrity are important, they are not noticed unless something goes wrong. Availability on the other hand, is super visible. If you have an outage then it will be all through the media - like Yahoo, or Microsoft, or Google, or indeed FastMail.
Availability at FastMail
We achieve this by reducing single points of failure, and having all data replicated in close to real time.
I was in New York earlier this year consolidating our machines, removing some really old ones, and also moving everything to new cabinets which have a better cooling system and more reliable power. We had out-grown the capacity in our existing cabinets, and didn't have enough power to run completely on just half our circuits any more.
Our new cabinets have redundant power - a strip up each side of the
rack - and every server is wired to both strips, and able to run from
just one. Each strip has the capacity to run the entire rack by
The servers are laid out in such a way that we can shut down any one cabinet. In fact, we can shut down half the cabinets at a time without impacting production users. In 2014 it's not such a big deal to be able to reinstall any one of your machines in just a few minutes - but in 2005 when we switched to fully automated installation of all our machines, only a few big sites were doing it. For the past few years, we've been at the point where we can shut down any machine with a couple of minutes' notice to move service off it, and users don't even notice that it's gone. We can then fully reinstall the operating system.
We have learned some hard lessons about availability over the years. The 2011 incident took a week to recover from because it hit every server at exactly the same time. We couldn't mitigate it by moving load to the replicas. We are careful not to upgrade everywhere at once any more, no matter how obvious and safe the change looks!
Availability and Jurisdiction
People often ask why we're not running production out of our Iceland datacentre. We only host secondary MX and DNS, plus an offsite replica of all data there.
While we work hard on the reliability of our systems, a lot of the credit for our uptime has to go to our awesome hosting provider, NYI. They provide rock-solid power and network. To give you some examples:
- During Hurricane Sandy, when other datacentres were bucketing fuel up the staircases and having outages, we lost power on ONE circuit for 30 seconds. It took out two units which hadn't been cabled correctly, but they weren't user facing anyway.
- We had a massive DDOS attempted against us using the NTP flaw a while ago. They blocked just the NTP port to the one host being attacked, and informed us of the attack while they asked their upstream providers to push the block out onto the network to kill off the attack. Our customers didn't even notice.
- They provide 24/7 onsite technical staff. Once when they were busy with another emergency, I had to wait 30 minutes for a response on an issue. The CEO apologised to me personally for having to wait. Normal response times are within 2 minutes.
The only outage we've had this year that can be attributed to NYI at all is a 5 minute outage when they switched the network uplink from copper to fibre, and managed to set the wrong routing information on the new link. 5 minutes in a year is pretty good.
The sad truth is, we just don't have the reliability from our Iceland datacentre to provide the uptime that our users expect of us.
- Network stats to New York: you see the only time it drops below 99.99% is July, when I moved all the servers, and there was the outage on the 26th (actually 5 minutes by my watch). As far as I can tell, the outages on the 31st were actually a pingdom error rather than a problem in NYI
- Network stats to Iceland: Ignore the 5 hour outage in August, because that was actually me in the datacentre. We don't have dual cabinet redundancy there, so I couldn't keep services up while I replaced parts. Even so, there are multiple outages longer than 10 minutes. These would have been very user-visible if users saw them. As it is, they just page the poor on-call engineer.
If we were to run production traffic to another datacentre, we would have to be convinced that they provide a similar level of quality to that provided by NYI. Availability is the life-blood of our customers. They need email to be up, all the time.
Once you get the underlying hardware and infrastructure to the level of reliability we have, the normal cause of problems is human error.
We have put a lot of work this year into processes to help avoid human errors causing production outages. There will be more on the testing process and beta => qa => production rollout stages in a later blog post. We've also had to change our development style slightly to deal with the fact that we now have two fully separate instances of our platform running in production - we'll also blog about that, since it's been a major project this year.
General internet issues
Of course, the internet itself is never 100% reliable, as was seen by our Optus and Vodafone using customers in Australia recently. Optus were providing a route back from NYI which went through Singtel, and it wasn't passing packets. There was nothing we could do, we had to wait for Optus to figure out what was wrong and fix it at their end.
We had a similar situation with Virgin Media in the UK back in 2013, but then we managed to route traffic via a proxy in our Iceland datacentre. This wouldn't have worked for Australia, because traffic from Australia to Iceland travels through New York too.
We are looking at what is required to run up a proxy in Australia for Asia-Pacific region traffic if there are routing problems from this part of the world again. Of course, that depends on the traffic from our proxy being able to get through.
One of the nastiest network issues we've ever had was when traffic to/from Iceland was being sent through two different network switches in London, depending on the exact source/destination address pair - and one of the switches was faulty - so only half our traffic was getting through. That one took 6 hours to be resolved. Thankfully, there was no production traffic to Iceland, so users didn't notice.