On Friday we had one of the worst outages we’ve had in over 3 years. For at least 3 hours, all accounts were inaccessible (web, IMAP and POP), and for a few users, it was several hours longer than that. For some other users, there were additional mailbox and quota problems after that.
Obviously this is something we never want to happen, and over the years we’ve setup many systems to avoid outages like this occurring.
A small “trivial” change to a configuration program that was rolled out caused a cascading series of events that resulted in some important files being corrupted. We had to take down all services and rebuild the corrupted files from the last backup and add in any changes since the backup. Once rebuilt, we were able to bring back up all services. A separate corruption issue that affected a few users caused some longer outages and quota issues until we fixed those mailboxes.
We’ve identified the chain of events that caused the problems. Because it’s actually a chain of events, we’ve identified at least 5 separate issues to fix, so that this problem, or something similar to it won’t happen again, and we’ll be doing those over the next week.
Of course we can never be 100% sure that there aren't other cascade event paths that will cause outages, but by learning from past mistakes, fixing the known problems, continuously enhancing our test infrastructure, and being more aware of possible consequential errors in the future, we're always aiming to minimise the chance of them occurring and providing the highest reliability possible.
Although it's draining and frustrating to see a large problem like this occur that so badly affects our users, it's been fascinating to actually investigate this problem. What we end up seeing is how a set of little mistakes, bad timing, decisions made long ago, and human error (all of which are wonderfully obvious in hindsight) end up causing a much bigger problem than the initial trigger would ever suggest.
In this case, the problem stemmed from a cascading sequence of issues that started with a single misplaced comma.
Putting all the above together, creates the disaster. A small configuration change was rolled out. Every new incoming IMAP connection would cause a new imapd to be forked and immediately abort and dump core. Each core file would end up with a separate filename. This very quickly caused the cyrus meta partitions to fill up. They reached 100% full before we fully realised what was happening. This caused changes to the mailboxes database to only partially write, causing a lot of them to become corrupted.
When we realised this is what had happened, we quickly stopped all services, undid the change, and tried to recover the corrupted databases. Fortunately the databases are backed up each half hour, and there are replicas as well. Using some libraries, we were able to quickly put together code that pulled any still valid records, records from the backup, and records from the replicas and combined them, and rebuilt the mailboxes databases, and then started everything back up.
Fortunately for most people, the mess stopped there. Unfortunately for a few users, there were some additional problems as well.
As well as the mailboxes database corruptions, it was discovered that the code that maintains the cyrus.header and cyrus.index files also didn't like the partial writes that disk full conditions generate. This caused a small number of mailboxes to be completely inaccessible (Inbox select failed).
Fortunately cyrus has a simple utility to fix corruptions of this form called "reconstruct", so we ran that to fix up any broken mailboxes. Fixing up a mailbox with reconstruct however doesn't fix up quota calculations, and that has a separate utility "quota" that you can run with a –f flag to fix quotas. We ran that on users to make sure all quota calculations were correct.
Unfortunately there's a bug in the quota fix code that in some edge cases can double the apparent quota usage of users. This caused a number of accounts to have an incorrect quota usage to be set on their account, and in some cases caused them to go over their quota, causing new messages to be delayed or bounced.
However the story wouldn't be complete without the additional human errors that let this happen as well. Thanks to help from the My Opera developer Cosimo, we internally have a Jenkins continuous integration (CI) server setup. This means that on every code/configuration commit, the following tests occur:
This test should have picked up the problem. So what happened? Well a day or so before the problem commit, another change occurred that altered the structure of branches on the primary repository. This caused the git pull the CI server does to fail, and thus the CI tests to fail.
While we were in the process of working out what was wrong and fixing this on the CI server (only pulling the "master" branch turned out to be the easiest fix), the problematic commit went in. So once we fixed up the git pull issue, we then found that the very first test on the CI server was failing with some strange IMAP connection failure error. Rather than believing this was a real problem, we assumed it was due to something else with the CI tests being broken after the branch changes, and resolved to look at it on Monday. Of course the test really was broken due to a bad commit, and as Murphy's Law would dictate, someone else would do a rollout on Friday Norway time of the broken commit.
A combination of software, human and process errors all came together to create an outage that affected all Fastmail users for several hours at least, and some users even more.
Obviously we want to ensure this particular problem doesn't happen again, and more importantly, that processes are in place where possible to avoid other cascade type failures in the future. We already have tickets to fix the particular code and configuration related issues in this case:
We've also witnessed how important keeping the CI tests clean, and tracking down all failures are important. We've immediately added new tests to sanity check database, imap and smtp connections as a very first step before any other functional tests are run. If any of them fail, we tail the appropriate log files and list the contents of the core directories, so the CI failure emails that are sent to all developers will make it very clear that there's a serious problem that needs immediate investigation.