SSL certificates updated again

A few days ago we updated our SSL certificates. The algorithm used to sign these certificates (SHA256) presented problems with some older clients and operating systems, notably WebOS and Nokia devices. To fix this we got our CA (DigiCert) to re-sign the certificates using the older SHA1 algorithm, which should work pretty much everywhere. These certificates are now live on all of our domains.

Most users should not notice any change. If you are on a device or client where you’ve had to install the DigiCert root certificate in the last few days, you may need to do this again as these certificates are signed from a different root certificate. If that affects you, the root certificate is available from DigiCert’s root certificate page and is called “DigiCert High Assurance EV Root CA”. If you also need the intermediate cert, its available from the same page with the name “DigiCert High Assurance CA-3″.

Posted in Technical. Comments Off

Please update your FastMail password

We’ve just sent the following announcement email to all FastMail users.

Dear FastMail User

You may have heard of a recent security bug in the OpenSSL library (that has been called ‘Heartbleed’) used by two-thirds of the Internet including ourselves and other major sites like Amazon, Google, Yahoo, etc. FastMail was quick to update its servers to fix this bug and issue new SSL certificates as soon as we were made aware of it.

We have no reason to believe any of our servers were targeted or exploited by this security flaw, but given the nature of the flaw it’s impossible to know if this bug was being exploited before it was announced.

Because of this, we are recommending that all FastMail users logout of all existing sessions and change their account passwords.

Again, there’s no evidence our servers or your password have been compromised, but we’re recommending this as a precautionary measure.

If you hate remembering passwords, we recommend you use a password manager program to remember them for you. Most modern browsers (e.g. Firefox, Chrome, etc) have a password manager built in and will offer to remember your passwords for you. LastPass and 1Password are also popular choices.

When you choose a new password, it’s important that you do not use the same password elsewhere and choose a password with reasonable complexity.

Your email is often the key to your online world. Many sites let you reset your password by sending a reset code to your email address. When you reuse your FastMail password at other sites, you’re making it much easier for attackers to potentially break in to your email account. Other sites often don’t have the same high security measures as FastMail (such as compulsory HTTPS, locked-down servers, etc.), which makes them much easier for criminals to break in to. If they hold your email address and the same password that you use for FastMail, the attacker can then access your email account and get into everything else you use online.

If you’re using alternative logins already, we recommend you delete and re-add them with any base password changed.

To change your password and log out of all existing sessions, you can use these steps.

Change password in current interface

  1. Log in to your FastMail account using the web interface
  2. From the menu at the top left, select ‘Password & Security’
  3. Enter your existing password where directed
  4. Enter your new password where directed. Re-enter again to make sure we got it right
  5. In the ‘Logged in Sessions’ section, click ‘Log out’ next to each existing session
  6. Click ‘Done’ to dismiss the panel
  7. From the menu at the top left, select ‘Log out’
  8. Now log in to your account again with your new password. This is often useful as a password manager will now prompt you to remember your password at this point.

Change password in ‘classic’ interface

  1. Log in to your FastMail account using the web interface
  2. Select the ‘Account’ item at the top right
  3. Select the ‘Password/Security Settings’ item
  4. Enter your new password where directed. Re-enter again to make sure we got it right
  5. Enter your existing password where directed
  6. Click ‘Update Password’
  7. Click ‘Logged In Sessions’ in the sidebar on the left
  8. Click ‘Delete’ next to each existing session
  9. Click ‘Log out’ at the top right
  10. Now log in to your account again with your new password. This is often useful as a password manager will now prompt you to remember your password at this point.

Again, this is a highly precautionary measure. FastMail is extremely concerned about security and has always tried to be highly pro-active with keeping our customer’s accounts and data as secure as possible.


The FastMail Team

Posted in News. Comments Off

When two-factor authentication is not enough

TL;DR: This is the story of a failed attempt to steal FastMail’s domains.

We don’t publish all attempts on our security, but this one stands out for how much effort was put into the attack, and how far it went.

We’ve had a handful of minor attack attempts recently. Targetted phishing emails to staff trying to steal credentials. An NTP-based DDOS which was quickly mitigated by NYI, our excellent hosting service.

These sorts of attacks are the “background radiation” of the internet. Along with port scans and entries in the web server logs from malware trying us out to see if we’re vulnerable to old PHP bugs (hint, we’re not). It’s the reality of being on the internet.

This blog post was first drafted before the Heartbleed fiasco. Sometimes, no matter how careful you are, you get a nasty surprise. We responded very quickly, as always. Anyway, on with this story.

About a month ago, our account suddenly wound up subscribed to hundreds of mailing lists. All these mailing lists failed to use double or confirmed opt-in, so someone was simply able to enter the email address into a form and sign us up, no confirmation required. This really is poor practice, but it’s still pretty common out there. A special shout-out goes to government and emergency response agencies in the USA for their non-confirmation signup on mailing lists. Thanks guys.

The upshot was that the hostmaster address was receiving significant noise. Rob Mueller (one of our directors) wasted (so we thought) a bunch of his time removing us from those lists one by one, being very careful to check that none of the ‘opt-out’ links were actually phishing attempts. This turns out to have been time very well spent.

Internet identities

At FastMail, our security is central to the safety of our users. Given access to an email account, an attacker could reset passwords on other sites, including those which allow spending real money.

We take this responsibility very seriously, and we’re always looking for ways to improve our security.

Two factor authentication (2FA)

The Domain Name System is one thing that’s even lower layer and more central to identity security than the email server itself.

Based on recent articles in the tech press, we really wanted to have ALL our domains protected by two factor authentication.

Our domains were historically spread amongst multiple registrars. We chose to consolidate with Gandi because they have a great slogan (“no bullshit”), they support 2FA, and they support all the top level domains we require.

Robert Norris, one of our sysadmins, was in charge of the migration. He set up a corporate account with Gandi to get assistance in transferring the domains, and set up two-factor authentication at the same time.

Gandi uses the popular OATH TOTP (also known as “Google Authenticator”) mechanism. Rob wrote a small TOTP client and placed it with the key on our management servers in the secure storage area where we also keep our SSL certificates. The account password itself was encrypted in our password manager, which is stored separately.

Only a small number of trusted people have access to the credentials for our Gandi account. We are satisfied that this level of security is strong enough.

The attempt

On March 19th, this came to

From: Gandi Corporate Team
Subject: [RN1374-GANDI] Email address update request
Date: Wednesday, March 19, 2014 8:27 PM


We received an email update request for the account RN1374-GANDI.

Previous email address:
New email address:

If you are opposed to this modification, thank you for letting us know
only by replying to this email.

If you can read this message, then you can recover the password of your
account, and thus modify the email address of the handle. In that case,
we won't take care of your request.

Without any reply from your side, we will proceed within 24 hours.

Best regards,

Gandi Corporate

The hostmaster alias actually forwards to three of us, and we were all hyper-alert, so we thankfully noticed this email.

Within twenty four hours.

One day.

Gandi assure us that their fraud detection systems would have detected this, but for the 2 weeks it took from this email until we had full control over our account again, we were worried.

This request had completely bypassed our two-factor protection.

Forged source addresses

There is a well known problem in network security. You can’t trust the source address of an IP packet – they are trivially forged.

It’s the reason why we have source port randomisation, sequence number randomisation… all the things designed to stop an attacker being able to forge both an initial SYN packet and also the response to an ACK packet to bring up a TCP connection.

While they can falsify the source of a request, an attacker without full network control can not receive the response to their forged packet and continue the handshake.

This is why this email was such a surprise. Like the poor quality mailing lists mentioned above, it didn’t require a confirmed opt-in. We had to reply to say that we didn’t want the contact email address changed.

This means that a forged source address was sufficient. Even though the attacker couldn’t read email to, they didn’t need to. All they needed was for us to not read it.

To Gandi’s credit, they responded very quickly to our “NO, DON’T CHANGE IT” email, and locked our account to stop any further shenanigans while they investigated and collected more documents from us.

Falsified documents

We discovered that Gandi received a paper email change form (pdf) claiming to be from a “Robert NORRIS” (the name which appears on our whois data), along with pictures of a passport of said “Robert NORRIS” and company registration documents also claiming to be for FastMail Pty Ltd.

At the time of writing, we are still in debate with the Gandi Legal Department about whether they can even show us these documents. They claim that French Law forbids them from showing us documents which purport to be from us. This is something to be aware of when choosing an vendor – different companies operate in different jurisdictions. There’s also a certain degree to which the conservatism of legal departments (protect the company as much as possible) conflicts with the corporate motto (“no bullshit”). The first response we got was certainly bullshit – “in order to meet a legal or regulatory obligation”. We challenged them to give an actual legal obligation and were given Article 226-15 of the French Criminal Code, along with rough English translation as follows:

“The act, committed in bad faith, of opening, deleting, delaying or diverting correspondence, whether or not it arrived at its destination, and addressed to a third party, or to fraudulently gain knowledge thereof, is punishable by one year of imprisonment and a fine of 45,000 EUR.”

We don’t believe that law is relevant – it’s the “no interception” law that exists everywhere, and doesn’t forbid anyone from quoting documents in replies to the purported source of those documents. If the law really was as Gandi Legal seem to be interpreting it, it would be illegal to quote an email in your response unless you were certain that the source address hadn’t been faked.

Was this a “security flaw”?

Security is built in layers, and I would definitely say that the fact that we received that email means one of the layers was weaker than it should be. Partly it’s poor choice of wording (Gandi claim that they would not necessarily have changed the email within 24 hours, depending on other investigations).

It still would have been necessary to either disable or reset the two-factor authentication on our account as well for the attacker to get full control. That’s difficult, but not necessarily impossible. After the fact, there’s no way to know how it would have gone down. We certainly weren’t willing to take the risk of doing nothing and seeing what happened!

What we do know is that the attacker was very determined, and willing to go as far as forging documents while simultaneously generating noise to make us less likely to notice the attack. They must have figured they had a chance.

Improving security for the future

A disadvantage of adding something like two-factor authentication after the fact is that you may miss the interactions with your existing processes. Gandi’s paper “email reset” form makes a lot of sense in the world where most of their customers are individuals or small businesses with one or two domains, and using addresses that they may lose access to. With no other factors, if they lose access to the email address and forget their password, there needs to be a process to regain access.

It’s always great to have a consistent process. Having a consistent process means that attackers can’t just try their luck until they find someone who is more trusting than average. Australia has a fantastic system called the 100 point check for authenticating people. We like process, consistently applied.

The problem we have is that we didn’t expect that the account email address could be changed without any reference to our two factors at all. Maybe nobody at Gandi realised either. That’s a security flaw – even if it doesn’t mean everything is totally broken.

We have had some very frank discussions with Gandi over the past week, and they agreed to make all three of the improvements we proposed as a result of these events:

  • the setting “disable password resets via email” was not on the security settings page of their website. Because of this, we hadn’t discovered and enabled it. They are moving it to the security page.
  • if an account has 2FA enabled, a red flag will automatically be raised against the request, meaning significant extra investigation will be done.
  • if an account has 2FA enabled, then an active confirmation will be required from the owner of the account before changing the email address. This means it will be harder to regain access if you lose all your factors, but that’s a good thing! Turning on 2FA means you want it to be hard for anyone who doesn’t have those two factors to gain access.

These steps will make attacks against Gandi accounts even more difficult in future, and we applaud their efforts to improve security and willingness to listen to our concerns.

There is one other measure that we have suggested which is still under discussion. Requiring the TOTP code to be entered on the password reset form, rather than using a secret question. We believe secret questions are bogus security, and we have an appeal to authority to back us up.

Gandi have blogged about this as well, and also given some general advice on keeping your account with them secure.


FastMail came out of this attack unscathed. Our domains are now even more secure, because Gandi has tons of proof on file about who we are and who our company is. Also Gandi’s processes have become more secure as a result of our experiences, so we are confident that we can safely keep our domains with them.

An important lesson learned is that just because a provider has a checkbox labelled “2 factor authentication” in their feature list, the two factors may not be protecting everything – and they may not even realise that fact themselves. Security risks always come on the unexpected paths – the “off label” uses that you didn’t think about, and the subtle interaction of multiple features which are useful and correct in isolation.

Posted in Technical. Comments Off

All SSL certificates updated

Based on a recent security issue in the OpenSSL library, we’ve updated all our server software and taken the precaution of replacing all of our SSL certificates. Most users shouldn’t notice any difference, but if your email client/xmpp client/ldap client/etc reports a certificate issue, this is probably the reason why.

Posted in News. Comments Off

FastMail housekeeping – removing little used features and simplifying others

In maintaining a large system like FastMail, we often find ourselves coming across code or configurations that’s can be harder to modify and update than we expect because of the way they interact with some particular feature or features. Normally this just means finding a different way of doing it. However in some cases, the feature itself is used by such a small number of people and the original reason it was useful is no longer so important that we’ve decided to retire a few rarely used features or update them to work differently than they have traditionally.

Below we describe the features we’re removing or changing, the rationale, and an alternative if available.

We plan to roll out what changes we can on beta over the next month, and fully roll them out everywhere on April 30.

Update: These changes are taking longer than expected to complete, so only some have been done so far. Details inline below.

Removing WAP

We’ve had a WAP server at for many years. WAP was a vastly reduced markup and display system designed for accessing internet content on early feature phones that didn’t have the power or bandwidth to render full HTML pages.

With the rise of smartphones and full HTML browsers, the use of WAP has dwindled to only a couple of users. Because of that, and because in most cases Opera Mini (which can access the HTML classic UI) will run on the phones people are using WAP on, we’ve decided to completely shutdown WAP.

Update: May 8, the WAP server has now been removed.

Removing Email reflector

The email reflector was an interesting attempt to provide an alternative to email forwarding. Basically the idea arose because some work places didn’t like employees logging into webmail accounts at work. So what you could do was “reflect” your FastMail email to your work address, and when you replied to the email from your work account, it would reflect back via FastMail and all the email addresses and headers would be rewritten to make it look like it came from your FastMail account!

In theory this was a really neat idea, however in practice it never quite worked as reliably as we liked. It was always marked as “in early stages of development” and was prone to “leaking” strange email addresses or creating extremely bizarre results if reflected and non-reflected email addresses somehow ended up being used together (e.g. someone accidentally added a “reflected” address into their work address book and used that with a non-reflector email address). Internally the code and configuration to make it all work is complex and messy.

Additionally these days, many people with work places with restrictive web policies have a personal smartphone they can easily configure to access any email account they want. They seems a much more natural solution to the problem.

On April 30, we’ll remove all reflector rules for any people who still have them still setup. We recommend you manually remove them before then so you’re more in control.

Removing SMS Sending via the web interface or SMTP

It’s currently possible to send SMS messages directly from the FastMail compose screen. You have to buy some SMS credits first, but after that, you just include a number@sms email address in your to/cc/bcc addresses and it’ll convert the first 160 chars into an SMS to that number. This also works via SMTP, you send to You also have to set an originator phone number for the personality you are using to send from. In theory, this is the phone number the SMS will appear to come from.

When this was first implemented almost 10 years ago, it was a really useful feature. Most people had feature phones that were slow to type SMS’s on. Using this feature, you could quickly type a message in the web interface/email client, and to the recipient it would appear to have come from your phone, so if they replied to it, the reply would go to your phone.

Since then though, the usefulness of this has dropped significantly. Most people have smartphones where it’s now much easier to tap out a quick message. Also, mobile operators became much more strict about setting arbitrary originator numbers and now most block such messages. In it’s current state, most messages sent from FastMail now appear to come from a fixed number, not the originator number people have set the personality to, so if someone replies to the SMS, the reply disappears rather than going to the original sender.

On top of this, this feature has been an ongoing source of fraud issues for us. We still regularly see accounts signed up with stolen credit cards for the sole purpose of sending SMS spam, even with our heavy rate limiting.

Because of these flaws, we’re going to remove the ability to send arbitrary messages to SMS numbers altogether. Note that this will NOT affect SMS forwarding rules or SMS two factor authentication. Both of those are definitely being kept and will continue to work.

Update: May 8, the SMS sending via web interface/SMTP has been removed

Simplifying Pop Link retrieval

One of the features Pop Links have is the ability to set scheduled retrievals (every 1, 2, 3 or 12 hours, daily, or weekly). The minimum you can set is based on your current service level. In addition in the classic interface, you can perform additional manual retrievals from the action menu on demand. The fact the current interface doesn’t have this feature is regularly cited by a number of users as a reason not to switch interfaces.

The original reason for the different retrieval schedule limits was because we feared retrieval might be a resource intensive process with many users. What we’ve found is that almost all Pop Links are set to retrieve on the shortest interval possible for that service level and that they consume relatively negligible resources.

So what we’re going to do is remove the ability to schedule different retrieval periods, and instead just have a simple “manual” or “auto” mode switch. Respectively these will:

  • “manual” mode
    – No automatic checks
    – Classic UI: Can select from the action menu to manually retrieve
    – Current UI: (Update: This bit added based on user feedback, matches what the auto mode also does. If you don’t want this, you can still manually disable a pop link to stop this happening) If you click on a folder to go to that folder or ‘refresh’ the current folder, then any pop links that file into that folder get checked at that moment. Can go to the Advanced -> Pop Links screen to manually retrieve.
  • “auto” mode
    – Automatically checks every 1 hour
    – While logged into and active on either web interface, checks every 5 minutes (active is defined as performing actions which cause your browser to communicate with our server)
    – Classic UI: Can select from the action menu to manually retrieve.
    – Current UI: If you click on a folder to go to that folder or ‘refresh’ the current folder, then any pop links that file into that folder get checked at that moment. Can also go to the Advanced -> Pop Links screen to manually retrieve.

We believe this fits much better with what people actually want. Namely that emails are regularly retrieved from a remote service, that retrieves occur more frequently while you’re actually logged in and using the web interface, and that there’s an explicit way to perform a retrieve if you absolutely want to do one then and there.

Update: June 2, the “click on a folder to activate pop links filing into that folder” has been rolled out, the remaining changes will be rolled out soon.

Update: June 5, the remaining features have been rolled out and the Pop Links screen updated to allow “manual” or “auto” mode selection, as well as allowing an explicit “fetch” of a pop link along with the previous “test” of a pop link

Simplifying Personalities

SMTP FROM Envelope

Personalities have a option “SMTP FROM Envelope”. The fact that option says “Advanced: SMTP MAIL FROM envelope address to use. Leave blank unless told to” gives you some idea that this is a very rarely used option. Basically the point of it was to avoid SPF failures when sending email made to look like it came from an external service.

Realistically the amount of email blocked due to SPF failures is extremely low these days. SPF never really fixed any particular email problems (and added some really nasty ones like breaking forwarding without using a horrible hack) and ended up just becoming another scoring marker of little value in the overall judgement of an email’s spamminess.

The correct solution to this issue is to use the actual external server for that domain to send as that domain. In that case, the From address (in the header) and the SMTP MAIL FROM address (in the SMTP protocol) are the same, and so only the Email address field of personalities is required and is what will also be used for the SMTP MAIL FROM envelope.

Update: June 5, the SMTP FROM Envelope feature has been removed from the Personalities screen and disabled on sending


Currently there’s a separate signatures screen for setting your signatures. You then select which signature to associate with with each personality.

Most people don’t work this way and find this extra level of indirection annoying or confusing. So we’re going to remove the signatures screen and just allow setting/editing of signatures on the personalities screen.

The one thing this will affect is the classic compose interface. In the advanced section you can choose which signature you want to use separate to the current personality. This option will be removed. If you want to use separate custom signatures, we recommend putting them in the Notepad and using the “Insert note” feature before sending.


Well that took longer than I expected. In some ways it’s sad to see some of these go (I wrote most of the code behind them!), but realistically the things being removed are little used and the proposed changes are small but neaten up some strange legacy edges and result in a better overall product for the majority of users.

Posted in News. Comments Off

Improved default search behaviour in classic interface

When we introduced the current interface, one of the features we were really happy with was our vastly improved searching. Basically we implemented a full text index that allowed you to search the headers and content of all your email in all folders for any words in a few seconds (in IMAP parlance, this uses the FUZZY SEARCH extension)

At the time, we decided to leave the search on the classic interface as it was (by default, search from/to/cc/subject headers but not the message content and search on substrings rather than whole terms).

However the general consensus from classic users is that they’d really like the improved search that the current interface comes with. So today we’ve rolled out a change that better unifies the search syntax on both the classic and current interfaces.

So now on both interfaces if you do a search:

dinner john

It will do a fast indexed search of the from, to, cc, bcc and subject headers as well as the message body content for messages that contain both “dinner” and “john”. This search is done on words/terms with stemming where possible, not sub-strings. This searches the current folder by default on classic, and across all folders by default on the current interface. On classic, you can check the “All” checkbox to search across all folders.

If you want to revert to the historical sub-string searching of headers, you can use the substr: modifier. Some more examples:

  • example – fuzzy search from/to/cc/bcc/subject headers and message bodies for the word “example”
  • body:example – fuzzy search message bodies for the word “example”
  • to:example – fuzzy search to/cc/bcc headers for the word “example”
  • onlycc:example – fuzzy search cc header for the word “example”
  • substr:example – search from/to/cc/subject headers (but not body content) for the substring “example”
  • substr:(dinner john) – search from/to/cc/subject headers for both substrings “dinner” and “john”
  • substr:(body:example) – search message body content for the substring “example” (warning: likely very slow!)
  • substr:(to:example) -  search to/cc/bcc headers for the substring “example”
  • substr:(onlycc:example) -  search cc header for the substring “example”

A complete list of all the search options can be found on our mailbox searching help page.

Posted in News. Comments Off

Cleaning up from an IMAP server failure

This blog post is highly technical. I cover details about how our email storage system works and how it was impacted by a complex server corruption and failure. I explain why our normal procedures failed in this instance, the ways in which our system helped us to track down and restore almost all the impacted emails, and some improvements we are making for our system to be more resilient in future.


To show how things fit together, I’m going to start with an overview of our infrastructure and internal terminology. Some details have been shared before, but the exact setup changes over time. If you want to get straight on with the story, skip to the heading “The failure” below.

A slot

Each mail server is a standalone 2U or 4U storage server. Storage on each server is split into multiple slots (currently 15, 16 or 40 per machine depending on hardware model).

All current slots are called “teraslots”, consisting of:

  • A 1Tb partition for long-term email storage.
  • Space on a shared high-speed SSD for indexes, caches, recently delivered email, etc.
  • Space on a separate partition for long-term search indexes.
  • Space on a RAM filesystem for search indexes of recently delivered email.

All partitions are encrypted, either with LUKS or directly on the hardware if supported.

Every slot runs an entirely separate instance of the Cyrus IMAP server, complete with its own configuration files and internal IP address. Instances can be started and stopped independently. Configuration files are generated by a template system driven from a master configuration file, for consistency and ease of management.

A store

A store is the logical “mail server” on which a user’s email is stored. A store is made up of a number of slots, using Cyrus’ built-in replication to keep them synchronised. At any time, one slot is the master for a store, and it replicates changes to all the other slots.

A normal store currently consists of three slots, two in New York and one in Iceland. To move a slot between machines, it’s easiest to configure a new replica slot, wait until all data is replicated and the new slot is up to date, and then remove the unwanted replica slot. In the case where we’re moving things around, a store may consist of more slots – there’s no real limit to how many replicas you can run at once.

We spread the related slots of different stores such that no one machine has too many “pairs” to another machine. This means that if a single server fails, the load spreads to many machines, rather than doubling the load on one other box.

For each store, one slot is marked in our database as the “master slot”. In the past we used to bind an IP address to the master slot and use magic IP failover, but no more. Instead, all access is either via a perl library (which knows which slot is the master) or via the nginx frontend proxy, which selects a backend using the same perl library during login.


The cyrus replication system doesn’t record the actual changes to mailboxes: it just writes “something changed in mailbox X” to a log file (or in our multi-replica configuration, to a separate log file per replica).

A separate process, sync_client, runs on the master slot. It takes a working copy of the replication log files that other cyrus processes create. If the sync_client process fails or is interrupted, it always starts from the last log file it was processing and re-runs all the mailboxes. This means that a repeating failure stops all new replication for that replica, which becomes relevant later.

The sync_client process combines duplicate events, then connects to a sync_server on the other end and requests some information about each mailbox named in the log. It compares the state on the replica with the state on the master, and determines which commands need to be run to bring the replica into sync, including renaming mailboxes. Mailboxes have a unique identifier as well as their name, and that identifier persists through renames.

We have extensively tested this replication process. Over the years we have fixed many bugs and added several features, including the start of progress towards a true master-master replication system. We still run a task every week which compares the state of all folders on the master and replica, to ensure that replication is working correctly.


Our failover process is fairly seamless. If everything goes cleanly, the only thing users see is a disconnection of their IMAP/POP connection, followed by a slow reconnection. There’s no connection error and no visible downtime. Ideally it takes under 10 seconds. In practice it’s sometimes up to 30 seconds, because we give long-running commands 10 seconds to complete, and replication can take a moment to catch up as well.

Failover works like this (failover from slot A to slot B):

  1. Check the size of log files in replica-channel directories on slot A. If any are more than a couple of KB in size, abort. We know that applying a log file of a few KB usually only takes a few seconds. If they’re much bigger than that, replication has fallen behind and we should wait until it catches up, or work out what’s wrong. It’s possible to override this with a “FORCE” flag. We also check that there’s a valid target slot to move to, and a bunch of other sanity stuff.
  2. Mark the store as “moving” in the database.
  3. Wait 2 seconds. Every proxy re-reads the store status every second from a local status daemon which gets updates from the database, so within 1 second, all connection attempts to the store from our web UI or from an external IMAP/POP connection will be replaced by a sleep loop. The sleep loop just holds off responding until the moving flag gets cleared, and it can get a connection to the new backend.
  4. Shut down master slot A. At this point, all existing IMAP/POP connections are dropped. It will wait for up to 10 seconds to let them shut down cleanly before force closing the remaining processes.
  5. Inspect all channel directories for log files on slot A again. Run sync_client on each log file to ensure that they sync completely. If there are any failures, bail out by restarting with the master still on slot A.
  6. Shut down the target slot B.
  7. Change the database to label the new slot B as the master for this store.
  8. Start up slot A as a replica.
  9. Start up slot B as the master (this also marks the store as “up”).

Within a second, the proxies know that the store is available again and they continue the login process for waiting connections.

Unclean failover

In the case of a clean failover, all the log files are run, and the replica is 100% up-to-date.

If for some reason we need to do a forced failover (say, a machine has failed completely and we need to get new hardware in), then we can have what’s called a “split brain”, where changes were written to one machine, but have not been seen by the other machine, so it makes its own changes without knowledge of what has already happened.

Split-brain is a common problem in distributed systems. It’s particularly nasty with IMAP because there are many counters with very strict semantics. There’s the UID number, which must increase without gaps (at least in theory) and there’s the MODSEQ number, which similarly must increase without changes ever being made “in the past”, otherwise clients will be confused.

Recovering from split brain without breaking any of the guarantees to clients was a major goal of the rewrite of both the replication protocol and mailbox storage which I did in 2008-2009. These changes eventually lead to Cyrus version 2.4.

Anti-corruption measures

We also want to be safe against random corruption. We have found that corruption occurs a couple of times per year across all our servers (the cause of this is hard to say, most likely hard drive or RAID controller issues from what we’ve seen) and we were bitten hard in the past by a particularly nasty linux kernel bug which wrote a few zeros into our metadata files during repack.

Since then we have added checksums everywhere. A separate crc32 for each record in a cyrus index, cache or db, and a sha1 of each message file. Cyrus also sends a crc32 of the “sync state” along with every mailbox, allowing it to determine if the set of changes did actually create the same mailbox state during a sync. A sync_crc mismatch triggers a full comparison of all data in that mailbox, allowing sync_client to repair the error and resync the mailbox.

And now you know enough about our architecture to understand what happened!

The failure

On Thursday 27th February, at 4:30am Melbourne time, Rob N (the on-call admin for the night) was paged by our monitoring system because one of our servers (imap21) was not responding. You can read his initial blog post about the incident as well.

He initially thought it was just a crashed server – there are various reasons why complex systems crash, and it wasn’t obvious which one it was. Our normal procedure is to restart the server and restart all the IMAP slots, replicate any remaining changes off the machine, then fail over to replicas while the search indexes catch up.

Unfortunately, things were worse than that. It started up with significant corruption on the operating system partition, and one of the 15 IMAP partitions failed to mount. At this point he force-failed-over (meaning moving the IMAP server master to its New York replica without doing a full sync first) and went back to bed to get some sleep and deal with things in the morning.

Initial cleanup

I was away during the first couple of days of this, on school camp with my daughter in one of the few places in the world where you can’t even get phone reception, but I have notes from the others and log files to go on.

On Thursday morning, there were a number of support tickets from people who were missing a few days of recent email. A quick look at the logs showed that one of the 11 master slots (of the 15 on this server, two were empty and two were replicas) had not been replicating for a couple of days.

The cause of this was an edge case issue with our replication system. If there is a loop in mailbox renames (i.e. folder ‘A’ was renamed to ‘B’ at the same time that folder ‘B’ was renamed to ‘A’, or a more complex circle of names), then the replication system bails out and tells us it needs to be fixed.

Normally I would notice this, and the others knew about the possibility as well, but over the previous few weeks I had been moving slots around in Iceland as part of our efforts to split entirely from the Opera network while retaining full offsite redundancy. There had been a lot of cyrus noise in the notification channel, and nobody picked up that this noise pointed to a more significant issue.

As mentioned above, this single issue in the working log file means that no new events can be processed. Interestingly, changes to other mailboxes in the same file might get replicated, because it will re-visit them over and over, but other newly-changed mailboxes will not be noticed.

To make things worse, the failed channel was also the “new user server”. Every week a task runs , calculates which store is least loaded, and sets a database field to direct newly created users to that store. So there were brand new users, and some of them had used our import facility to bring large amounts of email in.


The usual fix in this sort of case is to re-run the log files, even though the slot is now a replica. Because we keep a sha1 of the content of each message, the replication system can detect different files at the different ends. It can then fix up mailboxes by re-injecting BOTH messages with new UIDs, higher than any yet used, and then expunging the original UID at both ends. This means that no matter which one a client saw first, it now gets a repaired view of the mailbox with both messages in it. I wrote a long justification for this logic, and why we do it, which you can read at

We also use the same method in reconstruct to repair from a damaged mailbox on disk, but I’m about to re-visit that based on the below.

So, for messages within an existing mailbox, almost all were restored by this, except where things were broken for other reasons.

Bogus reconstructs

Because sync wasn’t working, the team in Melbourne started running reconstructs to repair broken folders. In theory, this was a great idea – it’s what I would have recommended.

In practice though, the data on the machine was badly broken in subtle ways, and this just made things worse. Looking through the log files, I see numerous cases where reconstruct determined that the stored sha1 of the message was not the same as the sha1 of the file on disk, and so injected the message again with a new UID. This should never happen unless the file is actually corrupted. Looking at the files on disk now, their sha1 is actually the old value, not the new one that it calculated during the reconstruct.

So the machine either has a faulty CPU, faulty memory, dodgy RAID card, intermittently faulty disk… whatever it is, it’s pretty horrible! We will be able to test it more once we’re happy that all the salvageable data is copied off.

Due to the faulty hardware, the attempts to make things better actually made things worse!

The worst cases were where reconstruct decided that a folder didn’t exist, and wiped it from the mailboxes database. The replication engine then wiped the copy on the replica. Likewise, where mailboxes didn’t exist at the other end due to not being replicated yet (see above), they got wiped from the original copy. These are the cases where, for 49 users, we can see that some emails were lost entirely.

Thankfully, repeated checksum mismatches caused replication to bail out in many places, saving us from a lot worse pain.

Time to cut our losses

On the morning of Friday 28th when I got back into phone range, pulled out my laptop on the bus home and inspected the damage, sync was still bailing out everywhere.

First thing I said was “turn off all the replication and reconstructs – let’s stop making things worse”.

I had an empty machine (imap14) which I had just finished clearing out in New York (see above about all the noise – lots of reconfiguring to get everything into consistent teraslots). I configured new replica pairs on imap14 for all the slots that used to be paired with imap21. It took about 4 days to get everything fully replicated again (moving that much data around, including sanity checking and building new search databases takes time), but we’re now fully replicated again, nothing fewer than 3 slots for each store, one offsite.

I left imap21 up, but with nothing talking to it or trusting it any more.

Examining the mess

The first job was to determine the extent of the damage. Thankfully, the log partition on imap21 wasn’t damaged, so we had backup copies of all the log files from everywhere on our log server by the time I got home. These aren’t the ‘sync log’ files from above; they’re the general syslog used to monitor and audit the everyday operations of the system. We keep them around for a few months to help track down the history of problems.

The first step was to identify what had happened, and locate any messages that hadn’t been replicated. We syslog important actions within Cyrus, in a consistent format with the label ‘auditlog:’.

Here’s an example of a recent message appended to my mailbox and then moved to Trash:

2014-03-10T03:11:04.544356-04:00 imap20 sloti20t12/lmtp[23349]: auditlog: append sessionid=<sloti20t12-23349-1394435464-1> mailbox=<user.brong> uniqueid=<6af857f64475158a> uid=<950415> guid=<d0744eaad56ecc0342dd22da2a2187ee16935ecd> cid=<d0744eaad56ecc03>
2014-03-10T03:27:31.374972-04:00 imap20 sloti20t12/imap[22916]: auditlog: append sessionid=<sloti20t12-22916-1394436442-1> mailbox=<user.brong.Trash> uniqueid=<24bbf0f44475158a> uid=<1157077> guid=<d0744eaad56ecc0342dd22da2a2187ee16935ecd> cid=<d0744eaad56ecc03>
2014-03-10T03:27:31.390532-04:00 imap20 sloti20t12/imap[22916]: auditlog: expunge sessionid=<sloti20t12-22916-1394436442-1> mailbox=<user.brong> uniqueid=<6af857f64475158a> uid=<950415> guid=<d0744eaad56ecc0342dd22da2a2187ee16935ecd> cid=<d0744eaad56ecc03>

(The sessionid field can be used to track the exact login time, IP address, etc – and for deliveries it can link through the logs to find the server which sent it to us originally).

It was appended to my INBOX via LMTP, then appended to my Trash folder and expunged from my Inbox. The ‘guid’ field is the sha1 of the underlying message.

Over on my New York replica slot:

2014-03-10T03:11:04.544356-04:00 imap20 sloti20t12/lmtp[23349]: auditlog: append sessionid=<sloti20t12-23349-1394435464-1> mailbox=<user.brong> uniqueid=<6af857f64475158a> uid=<950415> guid=<d0744eaad56ecc0342dd22da2a2187ee16935ecd> cid=<d0744eaad56ecc03>
2014-03-10T03:27:31.374972-04:00 imap20 sloti20t12/imap[22916]: auditlog: append sessionid=<sloti20t12-22916-1394436442-1> mailbox=<user.brong.Trash> uniqueid=<24bbf0f44475158a> uid=<1157077> guid=<d0744eaad56ecc0342dd22da2a2187ee16935ecd> cid=<d0744eaad56ecc03>
2014-03-10T03:27:31.390532-04:00 imap20 sloti20t12/imap[22916]: auditlog: expunge sessionid=<sloti20t12-22916-1394436442-1> mailbox=<user.brong> uniqueid=<6af857f64475158a> uid=<950415> guid=<d0744eaad56ecc0342dd22da2a2187ee16935ecd> cid=<d0744eaad56ecc03>

And of course, the same in Iceland:

2014-03-10T03:11:06.127098-04:00 timap5 slotti5t02/syncserver[6838]: auditlog: append sessionid=<slotti5t02-6838-1394336626-1> mailbox=<user.brong> uniqueid=<6af857f64475158a> uid=<950415> guid=<d0744eaad56ecc0342dd22da2a2187ee16935ecd> cid=<d0744eaad56ecc03>
2014-03-10T03:27:33.575200-04:00 timap5 slotti5t02/syncserver[6838]: auditlog: expunge sessionid=<slotti5t02-6838-1394336626-1> mailbox=<user.brong> uniqueid=<6af857f64475158a> uid=<950415> guid=<d0744eaad56ecc0342dd22da2a2187ee16935ecd> cid=<d0744eaad56ecc03>
2014-03-10T03:27:33.833070-04:00 timap5 slotti5t02/syncserver[6838]: auditlog: append sessionid=<slotti5t02-6838-1394336626-1> mailbox=<user.brong.Trash> uniqueid=<24bbf0f44475158a> uid=<1157077> guid=<d0744eaad56ecc0342dd22da2a2187ee16935ecd> cid=<d0744eaad56ecc03>

Anyway, we found something else interesting in the log files. For nearly 24 hours before the hard crash, there had been random transient failures in the replication logs. It would attempt to replicate a change, get a checksum error, try to replicate the exact same change, and the second time it would work! The machine was obviously intermittently failing well before it finally crashed.

We also found that two of the other master slots had encountered replication errors before the hard crash. The errors had caused replication to repeatedly bail out and they were also falling behind – this was enough noise that anyone would have noticed something was wrong – but it started about 2am our time, and everyone was asleep.

Based on user tickets and the state of the machine before the crash, we knew that there was still email missing. It was time to try to identify it. We determined the following algorithm to find messages that needed to be recovered.

  1. If there was an ‘append’ on imap21 via something that was NOT sync_client or syncserver, but no matching append on the replica with the same guid, that’s bad.
  2. If there was an ‘expunge’ on the replica by a sync program that can’t be matched to an expunge by an imap or pop3 process, then that was a message which shouldn’t have been wiped (but see above about the re-UIDing process; if we find another append to the same folder with the same GUID in either case, then the lower-UID one isn’t interesting any more).
  3. If a folder was deleted by a sync tool without a matching intended delete, then we have a problem.

Replication – the bad bits

I discovered, to my horror, that when replicating an expunged event from the master, the sync protocol ignored the sha1 of the message. It should either have checked that it was the same message file that was expunged on the replica as well, or just bailed out. This meant it would expunge a never-seen message from the replica if the same UID had been seen and then deleted on the master. Ouch.

This is also a flaw with folder deletes. When a deleted folder is replicated, it just removes the folder and message files from the disk on the replica. That’s wrong and should never be allowed. To ensure we can always recover accidentally deleted mail, any deleted folders are supposed to be moved to a special DELETED namespace. They’re only permanently deleted a week later by a cleanup process.

The end result of all this: some new folders created after the failover from imap21 were irreversibly deleted when we tried to sync over the missing messages from imap21.

Recovery of messages

In either case, an append that never synced or an expunge that wasn’t justifiable, the fix is the same. Find a file with the same sha1 in any of that user’s mailboxes, or their backup, or on disk on imap21, and that’s your message. We did that for nearly half a million emails in total, over half of those belonging to 3 users who had just imported a large amount of email from external services.

After that, we were left trying to recover a few folders which got wiped. The fix in that case is to recover from backup (hopefully most of the messages) and then search for any appends since the highest UID in the backup, and append those as well (unless a justifiable expunge can be found for them).

The great thing about the log file containing a GUID which is the sha1, is that you can determine with cryptographic certainty that you’ve found the right file. We pulled files from imap21 (don’t have to care about corruption, the sha1 will check it), files from the replicas, and even the backup server for the user to look for matching messages.

Some statistics

  • 370 affected users
  • 458471 found messages
  • 29837 lost messages (no sha1 file able to be found)
  • 32 lost folders (deleted, nothing in backup)
  • 49 users with SOMETHING lost

We have emailed all affected users with details of exactly what was lost and recovered for their account.

Making sure this does not happen again

We take our uptime and reliability very seriously. Losing anything is a major failure for us, and we are determined to take steps to ensure this situation can not happen again.

Operational changes

  • We now page the on-duty if replication falls more than 5 minutes behind to all replicas.
  • We alert (non-emergency) if ANY replica falls more than 5 minutes behind. This alert is made to look different from the Cyrus notices that can happen during regular maintenance, so everybody will be aware that there is an issue.
  • We will no longer attempt to reconstruct or keep replication going if we suspect a faulty server (at least until replication is super-safe). We’ll be much quicker to just declare the machine faulty and recover messages independently.

Software changes

  • The immediate fix is to do the GUID test even if the message was expunged on the master. This avoids the nasty case of expunging an unseen message during split brain.
  • The larger fix is two-pronged. First make reconstruct a lot less likely to damage existing mailboxes in case of corruption. There’s no point re-injecting a corrupt message file, nobody wants a file full of random blocks off disk. If the sha1 doesn’t match, then abort and get the admin to fix the permission problems or check the disk first.
  • The second prong is changes to the replication system, so that it’s impossible to make the replica delete message files immediately via the replication protocol. The most you will be able to do is make it move them aside for later removal on the scheduled rotation (we run on Saturdays and delete things that have been marked for deletion for at least one week at that point – so between 1 and 2 weeks).
  • Finally, we want to integrate the ability to fetch files from a replica to repair a corrupt local copy. We already have a separate magic perl script that can do this, but it runs outside Cyrus, with its own ugly locking tricks to force Cyrus to accept what it’s doing. It would be much neater to have this integrated into reconstruct, so that it’s replication-aware.


This has been a long saga (2 weeks from failure to restoring everything we could), and a learning experience for everyone involved. Our detailed logging and checksums meant we were able to recover the vast majority of messages affected by the corruption, but we are obviously unhappy that we lost anything, and we will be taking the steps outlined above to prevent this issue happening again.

We apologise to the 370 users who have been without some email for over a week while we took our time making sure we fully understood what had happened and what needed to be recovered. We’re especially sorry for small number of people for whom we lost emails irretrievably. We have contacted these people individually.

We’re proud of our reliability track record at FastMail and we respect the trust that people place in us to store their email. We are working hard to restore and maintain that trust, both by being open about exactly what happened with this incident and by updating the system and processes we have in place to improve our resilience in the future.

Posted in Technical. Comments Off

Get every new post delivered to your Inbox.

Join 5,752 other followers