Over the past week we have changed how deletion works in our new modern interface. In this blog post I will explain what those changes were, and some actions we have taken to ensure no emails are accidentally lost.
This is a technical blog post, so it contains a moderate level of technical detail.
I will address how our backup and disaster recovery system works, and how we used it this week to recover emails which we suspected to be accidentally deleted.
Last week we rolled out the new conversations-enabled interface. However, we discovered we had under-estimated the impact of conversations on users' existing workflow.
In particular, many users did not realise that when they selected a single item in a folder, it represented the entire conversation (all related messages, including those the user had sent).
When they pressed 'Delete' with one or more conversations selected, it deleted all messages in those conversations, including messages in other folders. For example, deleting a conversation in Inbox could also delete messages from "Sent Items" and "Important - Keep".
We have altered the 'Delete' action to be safer in these ways:
Rather than leaving users to hunt for which emails were affected, We wrote a tool to data-mine our mail server logs. We log every create and delete of emails, along with enough data to identify which ones were "Delete to Trash". We can also identify if the action came from an IMAP client or the web interface.
We found emails which could have been accidentally deleted using the following algorithm:
All the emails which matched these criteria were restored back into the folders they were originally deleted from, with a custom keyword added. This makes it easy for users to find them again. Every affected user has been emailed with instructions on how to identify the restored emails.
When you delete an email on the FastMail servers, it isn't immediately removed from disk, even if you manually expunge via IMAP. We do this:
So we actually batch up all deletes and run them once per week at the least busy time for our servers - Saturday night in the USA. It's weekend everywhere in the world then.
We also never remove email files within one week after deletion, so that our "Restore" feature can work as advertised.
This is, of course, in addition to the safety provided by replication to an offsite datacentre, and daily backups to a different server running a different operating system.
As soon as we realised we may have to restore emails, we disabled the automated weekend cleanup job, and started collecting data from our servers.
The problem is that it is hard to know that an email is not there unless you actively look for it. We could disable cleanup temporarily, but not forever. Our turnover is about 2% of total email volume per week, so the disks would fill up if we never deleted anything ever again.
We decided the safe way forwards was to undo every deletion which had even the slightest chance of being by accident.
That way, if no action is taken, a few extra emails sit on disk gathering dust. It's possible at any later time to discover them and clean them up. There is no requirement to act quickly.
We log every single time a message is added to or expunged from any folder on out backend servers. We collected an initial dataset of nearly 30 million "Delete to Trash" events from the log files.
The next step was identifying which of these were a single action involving more than one message from the same conversation. Every message was tagged with a session identifier and timestamp as well as the folder and IMAP "UID" which uniquely identifies it, but we were not logging the CID (conversation identifier). We do now, but that doesn't help with log lines from the past!
Finding the CID involved writing custom code to read the index file on disk (which still contains the deleted record) and extract the CID field for every deleted message.
Finally of course, there was processing the logs for every single connection from the web servers over that time frame and finding which deletes were related to each other. There's nothing in the log to show that it's the same command, so we applied a heuristic of "within 10 seconds" to account for the outside case of a busy server and large folders being processed.
We use the Cyrus IMAP server. One of the utilities included is called 'unexpunge', and it can be used to recover deleted emails. This is different from our usual restore command, which extracts messages from various sources and appends them a new temporary folder.
In this case we want to restore messages permanently, so unexpunge is the right tool... except - we want to tag every message with a keyword, and we want it to be reliable. Finding the messages afterwards is messy. We chose to add a new feature to unexpunge, setting a user-defined keyword on each message as it is restored. It is robust, and there's no gap where messages appear without the keyword
The chosen keyword is RESTORED-20121107. Our web interface already supports global keyword search with "flag:$name", so the email to users includes a pre-generated URL which will perform a global search on all that user's folders for messages which were restored.
Restores are in progress now. Once they are completed, thousands of users will have some messages restored. This is almost certain to include messages which were intended to be deleted, but we must err on the side of safety here.
We have built a very robust infrastructure because of our strong commitment to data safety. These restores are in line with this commitment. It is easier to delete unwanted messages again than to recover messages which no longer exist.