Dec 7: Automated installation

This blog post is part of the FastMail 2014 Advent Calendar.

The previous post on December 6th was about how we authenticate users. The following post is on our rich text email editor.

Technical level: medium

Any server in our entire system can be reinstalled in under 10 minutes, without users being aware.

That was a very ambitious goal in 2004 when I started at FastMail. DevOps was in its infancy, the tools available to us weren’t very good yet – but the alternative of staying with what we had was not scalable. Every machine was hand-built following a “script” which was instructions on our internal wiki, and hoping you got every step perfect. Every machine was slightly different.

We chose Debian Linux as the base system for the new automated installs, using FAI to install the machines from network boot to a full operating system with all our software in place.

Our machines are listed by name in a large central configuration file, which maps from the hardware addresses of the ethernet cards (easy to detect during startup, and globally unique) to a set of roles. The installation process uses those roles to decide what software to install.

Aside: the three types of data

I am a believer in this taxonomy, which splits data into three different types for clarity of thinking:

  1. Own Output – creative effort you have produced yourself. In theory you can reproduce it, though anyone who has lost hours of work to a crash knows just how disheartening it can be to repeat yourself.
  2. Primary Copy of others’ creative output. Unreproducable. Lose this, it’s gone forever.
  3. Secondary Copy of anything. Cache. Can always be re-fetched.

There is a bit of blurring between categories in practice, particularly as you might find secondary copies in a disaster and get back primary data you thought was lost. But for planning, these categories are very valuable for deciding how to care for the data.

  • Own Output – stick it in version control. Always.

    Since the effort that goes into creating is so high compared to data storage cost, there’s no reason to discard anything, ever. Version control software is designed for precisely this purpose.

    The repository then becomes a Primary Copy, and we fall through to;

  • Primary Copy – back it up. Replicate it. Everything you can to ensure it is never lost. This stuff is gold.

    In FastMail’s case as an email host, it is other people’s precious memories. We store emails on RAIDed drives with battery backed RAID units on every backend server, and each copy is replicated to two other servers, giving a total of 3 copies on RAID1 or 6 disks in total with a full copy of every message on them.

    One of those copies is in a datacentre a third of the distance around the world from the other two.

    On top of this, we run nightly backups to a completely separate format on a separate system.

  • Secondary Copy – disposable. Who cares. You can always get it again.

    Actually, we do keep backups of Debian package repositories for every package we use just in case we want to reinstall and the mirror is down. And we keep a local cache of the repository for fast reinstalls in each datacentre too.

It’s amazing how much stuff on a computer is just cache. For example, Operating System installs. It is so frustrating when installing a home computer how intermingled the operating system and updates (all cache) becomes with your preference selections and personal data (own creative output or primary copy). You find yourself backing up a lot more than is strictly necessary.

Operating system is purely cache

We avoid the need to do full server backups at FastMail by never changing config files directly. All configuration goes in version-controlled templates files. No ifs, no buts. It took a while to train ourselves with good habits here – reinstalling frequently and throwing out anything that wasn’t stored in version control until everyone got the hint.

The process of installing a machine is a netboot with FAI which wipes the system drive, installs the operating system, and then builds the config from git onto the system and reboots ready-to-go. This process is entirely repeatable, meaning the OS install and system partition is 100% disposable, on every machine.

If we were starting today, we would probably build on puppet or one of the other automation toolkits that didn’t exist or weren’t complete enough when I first built this. Right now we still use Makefiles and perl’s Template-Toolkit to generate the configuration files. You can run make diff on each configuration directory to see what’s different between a running machine and the new configuration, then make install to upgrade the config files and restart the related service. It works fine.

It doesn’t matter what exact toolkit is used to automate system installs, so long as it exists. It’s the same process regardless of whether we just want to reinstall to ensure a clean system, are recovering from a potential system compromise, are replacing failed hardware, or we have new machines to add to our cluster.

User data on separate filesystems

Most of our machines are totally stateless. They perform compute roles, generating web pages, scanning for spam, routing email. We don’t store any data on them except cached copies of frequently accessed files.

The places that user data are stored are:

  • Email storage backends (of course!)
  • File storage backends
  • Database servers
  • Backup servers
  • Outbound mail queue (this one is a bit of a special case – email can be=
    held for hours because the receiving server is down, misconfigured, or tem=
    porarily blocking us. We use drbd between two machines for the outbound sp=
    ool, because postfix doesn’t like it when the inode changes)

The reinstall leaves these partitions untouched. We create data partitions using either LUKS or the built-in encryption of our SSDs, and then create partitions with labels so they can be automatically mounted. All the data partitions are currently created with the ext4 filesystem, which we have found to be the most stable and reliable choice on Linux for our workload.

All data is on multiple machines

We use different replication systems for different data. As mentioned in the Integrity post, we use an application level replication system for email data so we can get strong integrity guarantees. We use a multi-master replication system for our Mysql database, which we will write about in this series as well. I’d love to write about the email backup protocol as well, but I’m not sure I’ll have time in this series! And the filestorage backend is another protocol again.

The important thing is every type of data is replicated over multiple machines, so with just a couple of minutes’ notice you can take a machine out of production and reinstall or perform maintenance on it
(the slowest part of shutting down an IMAP server these days is copying the search databases from tmpfs to real disk so we don’t have to rebuild them after the reboot).

Our own work

We use the git version control system for all our own software. When I started at FastMail we used CVS, and we converted to Subversion and then to finally to Git.

We have a reasonably complex workflow, involving per-host branches, per-user development branches, and a master branch where everything eventually winds up. The important thing is that nothing is considered “done” until it’s in git. Even for simple one-off tasks, we will write a script and archive it in git for later reference. The amount of code that a single person can write is so small these days compared to the size of disks that it makes sense to keep everything we ever do, just in case.

Dec 6: User authentication

This blog post is part of the FastMail 2014 Advent Calendar.

The previous post on December 5th was about the importance of data integrity. The following post on December 7th is about how we install servers.

Technical level: medium/high

Today we talk about the internals of our authentication system, which is how we decide that you are who you say you are and from there, figure out what you’re allowed to do.

On every server we have a service running called “saslperld”. As with many of our internal services, its name is now only loosely related to its function. saslperld is our authentication service, and exists to answer the question “can this user, with this password, access this service?”.

Making an authentication request

Each piece of server software we have has some way of calling out to an external service for authentication information. Each server tends to implement its own protocol for doing this, so saslperld implements a range of different methods to receive authentication questions.

The simplest of these is the original saslauthd interface, used by Cyrus. It has three fields – username, password, and service name, and returns a simple yes/no answer. These days it’s barely used, and only really in internal services, because it can’t really be extended so we can’t pass in other interesting information about the authentication attempt (which I’ll talk about further down).

The real workhorse is the HTTP interface, used by nginx. Briefly, nginx is our frontend server through which all user web and mail connections go. It takes care of authentication before anything ever touches a backend server. Since it’s fundamentally a system for web requests, its callouts are all done over HTTP. That’s useful, because it means we can demonstrate an authentication handshake with simple command-line tools.

Here’s a successful authentication exchange to saslperld, using curl:

$ curl -i -H 'Auth-User: robn@fastmail.fm' -H 'Auth-Pass: ********' -H 'Auth-Protocol: imap' http://localhost:7777
HTTP/1.0 200 OK
Auth-Pass: ********
Auth-Port: 2143
Auth-Server: 10.202.80.1
Auth-Status: OK
Auth-User: robn@fastmail.fm

The response here is interesting. Because we can use any HTTP headers we like in the request and the response, we can return other useful information. In this case, `Auth-Status: OK` is the “yes, the user may login” response. The other stuff helps nginx decide where to proxy the user’s connection to. Auth-Server and Auth-Port are the location of the backend Cyrus server where my mail is currently stored. In a failover situation, these can be different. Auth-User and Auth-Pass are a username and password that will work for the login to the backend server. Auth-User will usually, but not always, be the same as the username that was logged in with (it can be different on a couple of domains that allow login with aliases). Armed with this information, nginx can set up the connection and then let the mail flow.

The failure case has a much simpler result:

HTTP/1.0 200 OK
Auth-Status: Incorrect username or password.
Auth-Wait: 3

Any status that isn’t “OK” is an error string to return to the user (if possible, not all protocols have a way to report this). Auth-Wait is a number of seconds to pause the connection before returning a result. That’s a throttling mechanism to help protect against password brute-forcing attacks.

If the user’s backend is currently down, we return:

HTTP/1.0 200 OK
Auth-Status: WAIT
Auth-Wait: 1

This tells nginx to wait one second (“Auth-Wait: 1″) and the retry the authentication. This allows the blocking saslperld daemon to answer other requests, while not returning a response to the user. This is what we use when doing a controlled failover between backend servers, so there is no user-visible downtime even though we shut down the first backend and then force replication to complete before allowing access to the other backend.

This is a simple example. In reality we pass some additional headers in, including the remote IP address, whether the conneciton is on an SSL port or not, and so on. This information contributes to the authentication result. For example, if we’ve blocked a particular IP, then we will always return “auth failed”, even if the user could have logged in otherwise. There’s a lot of flexibility in this. We also do some rate tracking and limiting based on the IP address, to protect against misbehaving clients and other things. This is all handled by another service called “ratetrack” (finally, something named correctly!) which all saslperlds communicate with. We won’t talk about that any more today.

There’s a couple of other methods available to make an authentication request, but they’re quite specialised and not as powerful as the HTTP method. We won’t talk about those because they’re really not that interesting.

Checking the password

Once saslperld has received an authentication request, it first has to make sure that the correct password has been provided for the given username. That should be simple, but we have our alternate login system that can make this quite involved.

The first test is the simplest – make sure the user exists! If it doesn’t, obviously authentication fails.

Next, we check the provided password against the user’s “master” password. There’s nothing unusual here, it’s just a regular password compare (we use the bcrypt function for our passwords). If it succeeds, which it does for most users that only have a single master password set, then the authencation succeeds.

If it fails, we look at the alternate logins the user has configured. For each one available, we do the appropriate work for its type. For most of these the provided password is a base password plus some two-factor token. We check the base password, and then perform the appropriate check against the token, for example a Yubikey server check, or comparing against the list of generated one-time-passwords, or so on. The SMS 1-hour token is particularly interesting – we add a code to our database, SMS the code to the user, and then fail the authentication. When the user then uses the code, we do a complete one-time-password check.

At this point if any of the user’s authentication methods have succeeded, we can move on to the next step. Otherwise, authentication has failed, and we report this back to the requesting service and it does whatever it does to report that to the user.

Authorising the user

At this point we’ve verified that the user is who they say they are. Now we to find out if they’re allowed to have access to the service they asked for.

First, we do some basic sanity checking on the request. For example, if you’ve tried to do an IMAP login to something other than mail.messagingengine.com, or you try to do a non-SSL login to something that isn’t on the “insecure” service, then we’ll refuse the login with an appropriate error. These don’t tend to happen very often now that we have separate service IPs for most things, but the code is still there.

Next, we check if the user is allowed to login to the given service. Each user has a set of flags indicating which services they’re allowed to login to. We can set these flags on a case-by-case basis, usually in response to some support issue. If they user is not explicitly blocked in this way, we then check their service level to see if the requested service is allowed at that service level. A great example here is CalDAV, which is not available to Lite accounts. An attempt to by a Lite user to do a CalDAV login will fail at this point. Finally, we make sure that the service is allowed according to the login type. This is how “restricted” logins are implemented – there’s a list of “allowed” services for restricted accounts, and the requested service has to be in that list.

(It’s here that we also send the “you’re not allowed to use that service” email).

Once we’ve confirmed that the user is allowed to access the requested service we do a rate limit check as mentioned above. If that passes, the user is allowed in, and we return success to the requesting service.

The rest

To make things fast, we cache user information and results quite aggressively, so we can decide what to do very quickly. The downside of this is that if you change your password, add an alternate login method or upgrade your account all the saslperlds need to know this and refresh their knowledge of you. This is done by the service that made the change (usually a web server or some internal tool) by sending a network broadcast with the username in it. saslperld is listening for this broadcast and drops its cache when it receives it.

There’s not much else to say. It’s a single piece of our infrastructure that does a single task very well, with a lot of flexibility and power built in. We have a few other services in our infrastructure that could be described similarly. We’ll be writing more about some of those this month.

Posted in Uncategorized. Comments Off

Dec 5: Security – Integrity

This blog post is part of the FastMail 2014 Advent Calendar.

The previous post on December 4th was about how we build our mail servers. The next post on December 6th is about how we authenticate users.

Technical level: medium

On Tuesday I started this series of posts on security with an overview of the elements of security: Confidentiality, Integrity and Availability.

Integrity is, in my opinion, the most important part of security when it comes to email, so I’m starting there.

I believe that email is your electronic memory. I spoke about this at Oslo University back in 2011, where I answered the question “is email dead” with the following points:

  • Compatibility
  • Unchangeable
  • Privacy
  • Business / Orders / Receipts

Email is built on standards. It’s the world’s most interoperable network.

Once you get an email, it’s your own immutable copy. Your own private immutable copy. It can’t be retracted or edited, all the sender can do is send you a new email asking you to disregard the last one. You never get a situation where you remember seeing something, but it doesn’t exist any more. Unless they link out to a website, then the content can disappear far too easily. Forget about diamonds, email is forever.

At that talk, I addressed the idea that social networks with their private-garden messaging systems would replace email. I use social networks – I organise to catch up with friends via Facebook. I even use it to find people to cover classes (my other job: teaching gym classes). But I wouldn’t use social networks for business receipts, or orders, or something I wanted to remember forever. Email is still the gold standard here (unless you have a fax machine).

Recently, I was trying to find an old conversation that a friend and I had via Facebook messages. We have a few from 2007, and then a gap through until 2010. Nothing from the years in between. That whole conversation is lost, because we didn’t copy it anywhere else, and Facebook didn’t keep it.

My email memory goes back to a little before I moved everything to FastMail – because I messed up and lost everything with a stupid mistake in 2002. I don’t expect to ever “forget” anything I’ve received since then.

Email is, by the design of both the POP3 and IMAP protocols, immutable at the message level. You are not allowed to change the contents of a message once it’s been seen by a client.

In the Cyrus mail server, we take this a step further, by storing the sha1 digest of every raw email message in the index file, and hence being able to detect any accidental corruption or malicious modification of the file on disk.

Integrity at FastMail

This is where we really shine. We’re fanatical about data integrity. We only blog about the cases where things go wrong, which you can read about in examples like:

Even with the nasty bug in 2011, a single misplaced comma which caused all our disks to fill up super-fast and required a full index reconstruct, we didn’t lose anyone’s email, because we had enough sanity checks in place. It just took a while to rebuild indexes. We don’t corrupt indexes on full disk any more. The 2014 bug we lost a handful of emails for 70 users because of a further bug in handling emails which were expunged at one end of a replica pair and not at the other. That bug is now fixed as well..

If you look at the change history on the Cyrus IMAPd server leading up to the 2.4 release, and even earlier as well, you’ll see us adding integrity checks at every level of the Cyrus data structures. You’ll also see patches to the replication system as we, over months, tracked down every case where it was incomplete or incorrect, and fixed them until our replicas were perfect. Our “checkreplication” script still runs weekly over all users looking for mismatches.

And then for 2.4 we rewrote Cyrus replication completely, to be more efficient so we could have replicas in another country – and to make the replicas also do integrity checks on the data coming over the wire, so you can never replicate a broken mailbox and break the other copy as well.

This is why we replicate at the application level rather than at the filesystem level using something like DRBD – because at the application level we have enough information to ensure that only consistent mailboxes are replicated.

Our backup system also does separate checks, using the same index record, but a completely different piece of code (written in Perl rather than C).

This wasn’t built in response to the idea that some attacker would come in and subtly change your emails (though it does provide some very strong protections against those attacks), it was written in response to risks like the faulty RAM we saw in early 2014, or bugs in the kernel silently corrupting files.

Our backup system is available to all our customers, self serve. Just click on a button and a restore will be run for you in the background. So even if you delete email by mistake, you have at least a week to get it back.

Interestingly, some choices reduce integrity in exchange for other things, for example storing all email on encrypted filesystems is a risk to data integrity – it’s much more likely that we will lose everything on a filesystem in face of a partial failure. Data recovery is less possible – so we’re relying more heavily on replicas and backups. The tradeoff here is that if we discard a failed disk, or one of our servers is accidentally sold off on ebay (hey, it’s happened) with user data still on it, then it won’t be readable. It’s a confidentiality vs integrity tradeoff that we are comfortable making.

Integrity and Hosting Jurisdictions

There’s not a huge risk of a government sponsored or other well funded attacker trying to modify your email on our servers. The chance of detection by our regular integrity checking systems is very high (and we can tell the difference between an email with 4096 bytes of garbage where a block was corrupted on disk and one with subtly changed wording), and the benefits are low.

As for the accidental corruption that we do sometimes see – it’s going to come down to dirty vs clean power, and environmental conditions. Temperature fluctuations, humidity, vibration – these are all risks to data integrity, and they are more about a specific datacentre than choice of country. We recently moved our servers to new racks in a cold-containment-aisle area inside our NYI datacentre, which will give consistent cooling up the full height of the rack. All servers have dual power supplies, on two separate circuits, and the power is well filtered by the time it reaches us.

Integrity and The Future

There is one more thing that I want to add to Cyrus to improve integrity even further. At the moment it is possible to fully delete a mailbox or a user on a Cyrus server, and have that delete replicate immediately. In future, I will make it so that it is not possible, even with a single Cyrus server compromised, to permanently delete anything from its replicas. Removing a user will have to be done explicitly on each copy.

I also want to extend the backup system to be something “standard”, at least within the Cyrus world, and open source for everybody. For now it’s quite specific to our systems. A standard interchange format for mailbox archives would make life better for everyone. I have some draft notes from a meeting with David Carter at Cambridge (the original author of the Cyrus replication code), but haven’t finished it yet.

And finally, I want to back up everything else about a user, to the point where it has the same integrity guarantees as the email. Often if someone has deleted their entire account by mistake or let it lapse, we can recover the email – but some of the database-backed items are lost forever.

This will also allow mothballing accounts for cases like a poor fellow I answered a support request for recently. His father had Alzheimer’s Disease and forgot to renew his FastMail account. By the time the son realised that the account had been closed, all email history had been cleaned off our servers. By keeping full backups for a much longer time in the case of payment lapses rather than deliberate account closure, we could save people from losing email in these cases.

As I said at the start, your email is your memory. We take our job of keeping that memory intact very seriously.

aside: for those concerned about sha1 collision attacks, not only is there no known sha1 collision at all yet, it’s very hard to cause a collision by sending an email, because many headers are added between SMTP delivery and final injection into the mailbox, and they are hard to predict. Not impossible, which is why we’re working on a series of patches to include a random string in the Received header added by Cyrus.

Finally, it’s possible to change the hash algorithm with a simple index upgrade, checking the old one against sha1 and then calculating a new hash. We’ve already done this once from md5 to sha1.

Posted in Advent 2014. Comments Off

Dec 4: Standalone Mail Servers

This blog post is part of the FastMail 2014 Advent Calendar.

The previous post on December 3rd was about how we do real-time notifications. The next post on December 5th is about data integrity.

Technical level: highly technical

We’ve written a lot about our slots/stores architecture before – so I’ll refer you to our documentation rather than rehashing the details here.

We have evolved over the years, and particularly during the Opera time, I had to resist the forces suggesting a “put all your storage on a SAN and your processing on compute nodes” design, or “why don’t you just virtualise it”, as if that’s a magic wand that solves all your scalability and IO challenges.

Luckily I had a great example to point to: Berkeley University had a week-long outage on their Cyrus systems when their SAN lost a drive. They were sitting so close to the capability limits of their hardware that their mail architecture couldn’t handle the extra load of adding a new disk, and everything fell over. Because there was one single pool of IO, this meant every single user was offline.

I spent my evenings that week (I was living in Oslo) logging in to their servers and helping them recover. Unfortunately, the whole thing is very hard to google – search for “Berkeley Cyrus” and you’ll get lots of stuff about the Berkeley DB backend in Cyrus and how horrible it is to upgrade…

So we are very careful to keep our IO spread out across multiple servers with nothing shared, so an issue in one place won’t spread to the users on other machines.

The history of our hardware is also, to quite a large degree, the history of the Cyrus IMAPd mail server. I’ve been on the Cyrus Governance board for the past 4 years, and writing patches for a lot longer than that.

Email is the core of what we do, and it’s worth putting our time into making it the best we can. There are things you can outsource, but hardware design and the mail server itself have never been one of those things for us.

Early hardware – meta data on spinning disks

When I started at FastMail 10 years ago, our IMAP servers were honking great IBM machines (6 rack units each) with a shared disk array between them, and a shiny new 4U machine with a single external RAID6 unit. We were running a pre-release CVS 2.3 version of Cyrus on them, with a handful of our own patches on top.

One day, that RAID6 unit lost two hard disks in a row, and a third started having errors. We had no replicas, we had backups, but it took a week to get everyone’s email restored onto the new servers we had just purchased and were still testing. At least we had new servers! For that week though, users didn’t have access to their old email. We didn’t want this to ever happen again.

Our new machines were built more along the lines of what we have now, and we started experimenting with replication. The machines were 2U boxes from Polywell (long since retired now), with 12 disks – 4 high speed small drives in two sets of RAID1 for metadata, and 8 bigger drives (500Gb! – massive for the day) in two sets of RAID5 for email spool.

Even then I knew this was the right way – standalone machines with lots of IO capability, and enough RAM and processor (they had 32Gb of RAM) to run the mail server locally, so there are minimal dependencies in our architecture. You can scale that as widely as you want, with a proxy in front that can direct connections to the right host.

We also had 1U machines with a pair of attached SATA to SCSI drive units on either side. Those drive units had the same disk layout as the Polywell boxes, except the OS drives were in the 1U box – I won’t talk any more about these, they’re all retired too.

This ran happily for a long time on Cyrus 2.3. We wrote a tool to verify that replicas were identical to masters in all the things that matter (what can be seen via IMAP), and pushed tons of patches back to the Cyrus project to improve replication as we found bugs.

We also added checksums to verify data integrity after various corruptions were detected between replicas which showed a small rate of bitrot (on the order of 20 damaged emails per year across our entire system) – and tooling to allow the damage to be fixed by pulling back the affected email from either a replica or the backup system and restoring it into place.

Metadata on SSD

Cyrus has two metadata files per mailbox (actually, there are more these days) cyrus.index and cyrus.cache. With the growing popularity of SSDs around 2008-2009, we wanted to use SSDs, but cyrus.cache was just too big for the SSDs we could afford. It’s also only used for search and some sort commands, but the architecture of Cyrus meant that you had to MMAP the whole file every time a mailbox was opened, just in case a message was expunged. People had tried running with cache on slow disk and index on SSD, and it was still too slow.

There’s another small directory which contains the global databases – mailboxes database, seen and subscription files for each user, sieve scripts, etc. It’s a very small percentage of the data, but our calculations on a production server showed that 50% of the IO went to that config directory, about 40% to cyrus.index, and only 10% to cache and spool files.

So I spent a year and concentrated on rewriting the entire internals of Cyrus. This became Cyrus 2.4 in 2010. It has consistent locking semantics, which actually make it a robust QRESYNC/CONDSTORE compatible server (new standards which required stronger guarantees than the cyrus 2.3 datastructures could provide), and also meant that cache wasn’t loaded until it was actually needed.

This was a massive improvement for SSD-based machines, and we bought a bunch of 2U machines from E23 (our existing external drive unit vendor) and then later from Dell through Opera’s sysadmin team.

These machines had 12 x 2Tb drives in them, and two Intel x25E 64Gb SSDs. Our original layout was 5 sets of RAID1 for the 2Tb drives, with two hotspares.

Email on SSD

We ran happily for years with the 5 x 2Tb split, but something else came along. Search. We wanted dedicated IO bandwidth for search. We also wanted to load the initial mailbox view even faster. We decided that almost all users get enough email in a week that their initial mailbox view is going to be able to be generated from a week’s worth of email.

So I patched Cyrus again. For now, this set of patches is only in the FastMail tree, it’s not in upstream Cyrus. I plan to add it after Cyrus 2.5 is released. All new email is delivered to the SSD, and only archived off later. A mailbox can be split, with some emails on the SSD, and some not.

We purchased larger SSDs (Intel DC3700 – 400Gb), and we now run a daily job to archive emails that are bigger than 1Mb or older than 7 days to the slow drives.

This cut the IO to the big disks so much that we can put them back into a single RAID6 per machine. So our 2U boxes are now in a config imaginatively called ‘t15′, because they have 15 x 1Tb spool partitions on them. We call one of these spools plus its share of SSD and search drive a “teraslot”,
as opposed to our earlier 300Gb and 500Gb slot sizes.

They have 10 drives in a RAID6 for 16Tb available space, 1Tb for operating system and 15 1Tb slots.

They also have 2 drives in a RAID1 for search, and two SSDs for the metadata.


Filesystem Size Used Avail Use% Mounted on
/dev/mapper/sdb1 917G 691G 227G 76% /mnt/i14t01
/dev/mapper/sdb2 917G 588G 329G 65% /mnt/i14t02
/dev/mapper/sdb3 917G 789G 129G 86% /mnt/i14t03
/dev/mapper/sdb4 917G 72M 917G 1% /mnt/i14t04
/dev/mapper/sdb5 917G 721G 197G 79% /mnt/i14t05
/dev/mapper/sdb6 917G 805G 112G 88% /mnt/i14t06
/dev/mapper/sdb7 917G 750G 168G 82% /mnt/i14t07
/dev/mapper/sdb8 917G 765G 152G 84% /mnt/i14t08
/dev/mapper/sdb9 917G 72M 917G 1% /mnt/i14t09
/dev/mapper/sdb10 917G 800G 118G 88% /mnt/i14t10
/dev/mapper/sdb11 917G 755G 163G 83% /mnt/i14t11
/dev/mapper/sdb12 917G 778G 140G 85% /mnt/i14t12
/dev/mapper/sdb13 917G 789G 129G 87% /mnt/i14t13
/dev/mapper/sdb14 917G 783G 134G 86% /mnt/i14t14
/dev/mapper/sdb15 917G 745G 173G 82% /mnt/i14t15
/dev/mapper/sdc1 1.8T 977G 857G 54% /mnt/i14search
/dev/md0 367G 248G 120G 68% /mnt/ssd14

The SSDs use software RAID1, and since Intel DC3700s have strong onboard crypto, we are using that rather than OS level encryption. The slot and search drives are all mapper devices because they use LUKS encryption. I’ll talk more about this when we get to the confidentiality post in the security series.

The current generation

Finally we come to our current generation of hardware. The 2U machines are pretty good, but they have some issues. For a start, the operating system shares IO with the slots, so interactive performance can get pretty terrible when working on those machines.

Also, we only get 15 teraslots per 2U.

So our new machines are 4U boxes with 40 teraslots on them. They have 24 disks in the front on an Areca RAID controller:

05-front-02

And 12 drives in the back connected directly to the motherboard SATA:

05-back-02

The front drives are divided into two lots of 2Tb x 12 drive RAID6 sets, for 20 teraslots each.

In the back, there are 6 2Tb drives in a pair of software RAID1 sets (3 drives per set, striped, for 3Tb usable) for search, and 4 Intel DC3700s as a pair of RAID1s. Finally, a couple of old 500Gb drives for the OS – we have tons of old 500Gb drives, so we may well recycle them. In a way, this is really two servers in one, because they are completely separate RAID sets just sharing the same hardware.

Finally, they have 192Gb of RAM. Processor isn’t so important, but cache certainly is!

Here’s a snippet from the config file showing how the disk is distributed in a single Cyrus instance. Each instance has its own config file, and own paths on the disks for storage:


servername: sloti33d1t01

configdirectory: /mnt/ssd33d1/sloti33d1t01/store1/conf
sievedir: /mnt/ssd33d1/sloti33d1t01/store1/conf/sieve

duplicate_db_path: /var/run/cyrus/sloti33d1t01/duplicate.db
statuscache_db_path: /var/run/cyrus/sloti33d1t01/statuscache.db

partition-default: /mnt/ssd33d1/sloti33d1t01/store1/spool
archivepartition-default: /mnt/i33d1t01/sloti33d1t01/store1/spool-archive

tempsearchpartition-default: /var/run/cyrus/search-sloti33d1t01
metasearchpartition-default: /mnt/ssd33d1/sloti33d1t01/store1/search
datasearchpartition-default: /mnt/i33d1search/sloti33d1t01/store1/search
archivesearchpartition-default: /mnt/i33d1search/sloti33d1t01/store1/search-archive

The disks themselves – we have a tool to spit out the drive config of the SATA attached drives. It just pokes around in /sys for details:

$ utils/saslist.pl
1 - HDD 500G RDY sdc 3QG023NC
2 - HDD 500G RDY sdd 3QG023TR
3 E SSD 400G RDY sde md0/0 BTTV332303FA400HGN
4 - HDD 2T RDY sdf md2/0 WDWMAY04568236
5 - HDD 2T RDY sdg md3/0 WDWMAY04585688
6 E SSD 400G RDY sdh md0/1 BTTV3322038L400HGN
7 - HDD 2T RDY sdi md2/1 WDWMAY04606266
8 - HDD 2T RDY sdj md3/1 WDWMAY04567563
9 E SSD 400G RDY sdk md1/0 BTTV323101EM400HGN
10 - HDD 2T RDY sdl md2/2 WDWMAY00250279
11 - HDD 2T RDY sdm md3/2 WDWMAY04567237
12 E SSD 400G RDY sdn md1/1 BTTV324100F9400HGN

And the Areca tools work for the drives in front:

$ utils/cli64 vsf info
# Name Raid Name Level Capacity Ch/Id/Lun State
===============================================================================
1 i33d1spool i33d1spool Raid6 20000.0GB 00/00/00 Normal
2 i33d2spool i33d2spool Raid6 20000.0GB 00/01/00 Normal
===============================================================================
GuiErrMsg: Success.
$ utils/cli64 disk info
# Enc# Slot# ModelName Capacity Usage
===============================================================================
1 01 Slot#1 N.A. 0.0GB N.A.
2 01 Slot#2 N.A. 0.0GB N.A.
3 01 Slot#3 N.A. 0.0GB N.A.
4 01 Slot#4 N.A. 0.0GB N.A.
5 01 Slot#5 N.A. 0.0GB N.A.
6 01 Slot#6 N.A. 0.0GB N.A.
7 01 Slot#7 N.A. 0.0GB N.A.
8 01 Slot#8 N.A. 0.0GB N.A.
9 02 Slot 01 WDC WD2003FYYS-02W0B0 2000.4GB i33d1spool
10 02 Slot 02 WDC WD2003FYYS-02W0B0 2000.4GB i33d1spool
11 02 Slot 03 WDC WD2000FYYZ-01UL1B0 2000.4GB i33d1spool
12 02 Slot 04 WDC WD2003FYYS-02W0B0 2000.4GB i33d1spool
13 02 Slot 05 WDC WD2003FYYS-02W0B0 2000.4GB i33d1spool
14 02 Slot 06 WDC WD2003FYYS-02W0B0 2000.4GB i33d1spool
15 02 Slot 07 WDC WD2003FYYS-02W0B1 2000.4GB i33d1spool
16 02 Slot 08 WDC WD2003FYYS-02W0B0 2000.4GB i33d1spool
17 02 Slot 09 WDC WD2000F9YZ-09N20L0 2000.4GB i33d1spool
18 02 Slot 10 WDC WD2003FYYS-02W0B1 2000.4GB i33d1spool
19 02 Slot 11 WDC WD2003FYYS-02W0B1 2000.4GB i33d1spool
20 02 Slot 12 WDC WD2003FYYS-02W0B1 2000.4GB i33d1spool
21 02 Slot 13 WDC WD2003FYYS-02W0B0 2000.4GB i33d2spool
22 02 Slot 14 WDC WD2003FYYS-02W0B0 2000.4GB i33d2spool
23 02 Slot 15 WDC WD2003FYYS-02W0B0 2000.4GB i33d2spool
24 02 Slot 16 WDC WD2000FYYZ-01UL1B0 2000.4GB i33d2spool
25 02 Slot 17 WDC WD2003FYYS-02W0B0 2000.4GB i33d2spool
26 02 Slot 18 WDC WD2003FYYS-02W0B0 2000.4GB i33d2spool
27 02 Slot 19 WDC WD2003FYYS-02W0B0 2000.4GB i33d2spool
28 02 Slot 20 WDC WD2003FYYS-02W0B0 2000.4GB i33d2spool
29 02 Slot 21 WDC WD2003FYYS-02W0B0 2000.4GB i33d2spool
30 02 Slot 22 WDC WD2003FYYS-02W0B0 2000.4GB i33d2spool
31 02 Slot 23 WDC WD2002FYPS-01U1B1 2000.4GB i33d2spool
32 02 Slot 24 WDC WD2003FYYS-02W0B0 2000.4GB i33d2spool
===============================================================================
GuiErrMsg: Success.

We always keep a few free slots on every machine, so we have the capacity to absorb the slots from a failed machine. We never want to be in the state where we don’t have enough hardware!


Filesystem Size Used Avail Use% Mounted on
/dev/mapper/md2 2.7T 977G 1.8T 36% /mnt/i33d1search
/dev/mapper/md3 2.7T 936G 1.8T 35% /mnt/i33d2search
/dev/mapper/sda1 917G 730G 188G 80% /mnt/i33d1t01
/dev/mapper/sda2 917G 805G 113G 88% /mnt/i33d1t02
/dev/mapper/sda3 917G 709G 208G 78% /mnt/i33d1t03
/dev/mapper/sda4 917G 684G 234G 75% /mnt/i33d1t04
/dev/mapper/sda5 917G 825G 92G 91% /mnt/i33d1t05
/dev/mapper/sda6 917G 722G 195G 79% /mnt/i33d1t06
/dev/mapper/sda7 917G 804G 113G 88% /mnt/i33d1t07
/dev/mapper/sda8 917G 788G 129G 86% /mnt/i33d1t08
/dev/mapper/sda9 917G 661G 257G 73% /mnt/i33d1t09
/dev/mapper/sda10 917G 799G 119G 88% /mnt/i33d1t10
/dev/mapper/sda11 917G 691G 227G 76% /mnt/i33d1t11
/dev/mapper/sda12 917G 755G 162G 83% /mnt/i33d1t12
/dev/mapper/sda13 917G 746G 172G 82% /mnt/i33d1t13
/dev/mapper/sda14 917G 802G 115G 88% /mnt/i33d1t14
/dev/mapper/sda15 917G 159G 759G 18% /mnt/i33d1t15
/dev/mapper/sda16 917G 72M 917G 1% /mnt/i33d1t16
/dev/mapper/sda17 917G 706G 211G 78% /mnt/i33d1t17
/dev/mapper/sda18 917G 72M 917G 1% /mnt/i33d1t18
/dev/mapper/sda19 917G 72M 917G 1% /mnt/i33d1t19
/dev/mapper/sda20 917G 72M 917G 1% /mnt/i33d1t20
/dev/mapper/sdb1 917G 740G 178G 81% /mnt/i33d2t01
/dev/mapper/sdb2 917G 772G 146G 85% /mnt/i33d2t02
/dev/mapper/sdb3 917G 797G 120G 87% /mnt/i33d2t03
/dev/mapper/sdb4 917G 762G 155G 84% /mnt/i33d2t04
/dev/mapper/sdb5 917G 730G 187G 80% /mnt/i33d2t05
/dev/mapper/sdb6 917G 803G 114G 88% /mnt/i33d2t06
/dev/mapper/sdb7 917G 806G 112G 88% /mnt/i33d2t07
/dev/mapper/sdb8 917G 786G 131G 86% /mnt/i33d2t08
/dev/mapper/sdb9 917G 663G 254G 73% /mnt/i33d2t09
/dev/mapper/sdb10 917G 776G 142G 85% /mnt/i33d2t10
/dev/mapper/sdb11 917G 743G 174G 82% /mnt/i33d2t11
/dev/mapper/sdb12 917G 750G 168G 82% /mnt/i33d2t12
/dev/mapper/sdb13 917G 743G 174G 82% /mnt/i33d2t13
/dev/mapper/sdb14 917G 196G 722G 22% /mnt/i33d2t14
/dev/mapper/sdb15 917G 477G 441G 52% /mnt/i33d2t15
/dev/mapper/sdb16 917G 539G 378G 59% /mnt/i33d2t16
/dev/mapper/sdb17 917G 72M 917G 1% /mnt/i33d2t17
/dev/mapper/sdb18 917G 72M 917G 1% /mnt/i33d2t18
/dev/mapper/sdb19 917G 72M 917G 1% /mnt/i33d2t19
/dev/mapper/sdb20 917G 72M 917G 1% /mnt/i33d2t20
/dev/md0 367G 301G 67G 82% /mnt/ssd33d1
/dev/md1 367G 300G 67G 82% /mnt/ssd33d2

Some more copy’n’paste:


$ free
total used free shared buffers cached
Mem: 198201540 197649396 552144 0 7084596 120032948
-/+ buffers/cache: 70531852 127669688
Swap: 2040248 1263264 776984

Yes, we use that cache! Of course, the Swap is a little pointless at that size…


$ grep 'model name' /proc/cpuinfo
model name : Intel(R) Xeon(R) CPU E5-2609 v2 @ 2.50GHz
model name : Intel(R) Xeon(R) CPU E5-2609 v2 @ 2.50GHz
model name : Intel(R) Xeon(R) CPU E5-2609 v2 @ 2.50GHz
model name : Intel(R) Xeon(R) CPU E5-2609 v2 @ 2.50GHz
model name : Intel(R) Xeon(R) CPU E5-2609 v2 @ 2.50GHz
model name : Intel(R) Xeon(R) CPU E5-2609 v2 @ 2.50GHz
model name : Intel(R) Xeon(R) CPU E5-2609 v2 @ 2.50GHz
model name : Intel(R) Xeon(R) CPU E5-2609 v2 @ 2.50GHz

IMAP serving is actually very low on CPU usage. We don’t need a super-powerful CPU to drive this box. The CPU load is always low, it’s mostly IO wait – so we just have a pair of 4 core CPUs.

The future?

We’re in a pretty sweet spot right now with our hardware. We can scale these IMAP boxes horizontally “forever”. They speak to the one central database for a few things, but that could be easily distributed. In front of these boxes are frontends with nginx running an IMAP/POP/SMTP proxy, and compute servers doing spam scanning before delivering via LMTP. Both look up the correct backend from the central database for every connection.

For now, these 4U boxes come in at about US$20,000 fully stocked, and our entire software stack is optimised to get the best out of them.

We may containerise the Cyrus instances to allow fairer IO and memory sharing between them if there is contention on the box. For now, it hasn’t been necessary because the machines are quite beefy, and anything which adds overhead between the software and the metal is a bad thing. As container software gets more efficient and easier to manage, it might become worthwhile rather than running multiple instances on the single operating system as we do now.

Posted in Advent 2014, Technical. Comments Off

Dec 3: Push it real good

This blog post is part of the FastMail 2014 Advent Calendar.

The previous post on December 2nd was an intro to our approach to security. The next post on December 4th is about our IMAP server hardware.

Technical level: lots

Bron (one of our team, you’ve heard of him) does this great demo when he shows FastMail to other people. He opens his inbox up on the screen, and gets someone to send him an email. A nice animation slides the new message into view on the screen. At the same time, his phone plays a sound. He uses his watch to delete the email, and it disappears from the screen too. It’s a huge “wow” moment, made possible by our push notification system. Today we’ll talk about exactly how we let you know when something interesting happens in your mailbox.

Cyrus has two mechanisms for telling the world that something has changed, idled and mboxevent. idled is the simpler of the two. When your mail client issues an IDLE command, it is saying “put me to sleep and tell me when something changes”. idled is the server component that manages this, holding the connection open and sending a response when something changes. An example protocol exchange looks like something like this (taken from RFC 2177, the relevant protocol spec):

C: A001 SELECT INBOX
S: * FLAGS (Deleted Seen)
S: * 3 EXISTS
S: * 0 RECENT
S: * OK [UIDVALIDITY 1]
S: A001 OK SELECT completed
C: A002 IDLE
S: + idling
...time passes; new mail arrives...
S: * 4 EXISTS
C: DONE
S: A002 OK IDLE terminated

It’s a fairly simple mechanism, only designed for use with IMAP. We’ll say no more about it.

Of far more interest is Cyrus’ “mboxevent” mechanism, which is based in part on RFC 5423. Cyrus can be configured to send events to another program any time something changes in a mailbox. The event contains details about the type of action that occurred, identifying information about the message and other useful information. Cyrus generates events for pretty much everything – every user action, data change, and other interesting things like calendar alarms. For example, here’s a delivery event for a system notification message I received a few minutes ago:

{
 "event" : "MessageNew",
 "messages" : 1068,
 "modseq" : 1777287, 
 "pid" : 2087223,
 "serverFQDN" : "sloti30t01",
 "service" : "lmtp",
 "uidnext" : 40818,
 "uri" : "imap://robn@fastmail.fm@sloti30t01/INBOX;UIDVALIDITY=1335827579/;UID=40817",
 "vnd.cmu.envelope" : "(\"Wed, 03 Dec 2014 08:49:40 +1100\" \"Re: Blog day 3: Push it real good\" ((\"Robert Norris\" NIL \"robn\" \"fastmail.com\")) ((\"Robert Norris\" NIL \"robn\" \"fastmail.com\")) ((\"Robert Norris\" NIL \"robn\" \"fastmail.com\")) ((NIL NIL \"staff\" \"fastmail.fm\")) NIL NIL \"<1417521960.1705207.197773721.1B9AB1E1.3158739225@webmail.messagingengine.com>\" \"<1417556980.2079156.198025625.704CB1C1@webmail.messagingengine.com>\")",
 "vnd.cmu.mailboxACL" : "robn@fastmail.fm\tlrswipkxtecdn\tadmin\tlrswipkxtecdan\tanyone\tp\t",
 "vnd.cmu.mbtype" : "",
 "vnd.cmu.unseenMessages" : 90,
 "vnd.fastmail.cid" : "b9384d25e93fc71c",
 "vnd.fastmail.convExists" : 413,
 "vnd.fastmail.convUnseen" : 82,
 "vnd.fastmail.counters" : "0 1777287 1777287 1761500 1758760 1416223082",
 "vnd.fastmail.sessionId" : "sloti30t01-2087223-1417556981-1"
}

The event contains all sorts of information: the action that happened, the message involved, how many messages I have, the count of unread messages and conversations, the folder involved, stuff about the message headers, and more. This information isn’t particularly useful in its raw form, but we can use it to do all kinds of interesting things.

We have a pair of programs that do various processing on the events that Cyrus produce. They are called “pusher” and “notifyd”, and run on every server where Cyrus runs. Their names don’t quite convey their purpose as they’ve grown up over time into their current forms.

pusher is the program that receives all events coming from Cyrus. It actions many events itself, but only the ones it can handle “fast”. All other events are handed off to notifyd, which handles the “slow” events. The line between the two is a little fuzzy, but the rule of thumb is that if an event needs to access the central database or send mail, it’s “slow” and should be handled by notifyd. Sending stuff to open connections or making simple HTTP calls are “fast”, and are handled in pusher. I’ve got scare quotes around “slow” and “fast” because the slow events aren’t actually slow. It’s more about how well each program responds when flooded with events (for programmers: pusher is non-blocking, notifyd can block).

We’ll talk about notifyd first, because it’s the simpler of the two. The two main styles of events it handles are calendar email notifications and Sieve (filter/rule) notifications.

Calendar email notifications are what you get when you say, for example, “10 minutes before this event, send me an email”. All the information that is needed to generate an email is placed into the event that comes from Cyrus, including the name of the event, the start and end time, the attendees, location, and so on. notifyd constructs an email and sends it to the recipient. It does database work to look up the recipient’s language settings to try and localise the email it sends. Here we see database and email work, and so it goes on the slow path.

The Sieve filter system we use has a notification mechanism (also see RFC 5435) where you can write rules that cause new emails or SMS messages to be sent. notifyd handles these too, taking the appropriate action on the email (summarise, squeeze) and then sending an email or posting to a SMS provider. For SMS, it needs to fetch the recipient’s number and SMS credit from the database so again, it’s on the slow path.

On to pusher, which is where the really interesting stuff happens. The two main outputs it has are the EventSource facility used by the web client, and the device push support used by the Android and iOS apps.

EventSource is a facility available in most modern web browsers that allow it to create a long-lived connection to a server and receive a constant stream of events. The events we send are very simple. For the above event, pusher would send the following to all web sessions I currently have open:

event: push
id: 1760138
data: {"mailModSeq":1777287,"calendarModSeq":1761500,"contactsModSeq":1758760}

These are the various “modseq” (modification sequence, see also RFC 7162) numbers for mail, calendar and contacts portions of my mail store. The basic idea behind a modseq is that every time something changes, the modseq number goes up. If the client sees the number change, it knows that it needs to request an update from the server. By sending the old modseq number in this request, it receives only the changes that have happened since then, making this a very efficient operation.

If you’re interested, we’ve written about our how we use EventSource in a lot more detail in this blog post from a couple of years ago. Some of the details have changed since then, but the ideas are still the same.

The other thing that pusher handles is pushing updates to the mobile apps. The basic idea is the same. When you log in to one of the apps, they obtain a device token from the device’s push service (Apple Push Notification Service (APNS) for iOS or Google Cloud Messaging (GCM) for Android), and then make a special call to our servers to register that token with pusher. When the inbox changes in some way, a push event is created and sent along with the device token to the push service (Apple’s or Google’s, depending on the token type).

On iOS, a new message event must contain the actual text that is displayed in the notification panel, so pusher extracts that information from the “vnd.cmu.envelope” parameter in the event it received from Cyrus. It also includes an unread count, which is used to update the red “badge” on the app icon, and a URL which is passed to the app when the notification is tapped. An example APNS push event might look like:

{
  "aps" : {
    "alert" : "Robert Norris\nHoliday pics",
    "badge" : 82,
    "sound" : "default"
  },
  "url" : "?u=12345678#/mail/Inbox/eb26398990c4b29b-f45463209u40683
}

For other inbox changes, like deleting a message, we send a much simpler event to update the badge:

{
  "aps" : {
    "badge" : 81
  }
}

The Android push operates a little differently. On Android it’s possible to have a service running in the background. So instead of sending message details in the push, we only send the user ID.

{
  "data" : {
    "uparam" : "12345678"
  }
}

(The Android app actually only uses the user ID to avoid a particular bug, but it will be useful in the future when we support multiple accounts).

On receiving this, the background service makes a server call to get any new messages since the last time it checked. It’s not unlike what the web client does, but simpler. If it finds new messages, it constructs a system notification and displays it in the notification panel. If it sees a message has been deleted and it currently has it visible in the notification, it removes it. It also adds a couple of buttons (archive and delete) which result in server actions being taken.

So that’s all the individual moving parts. If you put them all together, then you get some really impressive results. When Bron uses the “delete” notification action on his watch (an extension of the phone notification system), it causes the app to send a delete instruction the the server. Cyrus deletes the message and sends a “MessageDelete” event to pusher. pusher sends a modseq update via EventSource to the web clients which respond by requesting an update from the server, noting the message is deleted and removing it from the message list. pusher also notices this is an inbox-related change, so sends a new push to any registered Android devices and, because it’s not a “MessageNew” event, sends a badge update to registered iOS devices.

One of the things I find most interesting about all of this is that in a lot of ways it wasn’t actually planned, but has evolved over time. notifyd is over ten years old and existed just to support Sieve notifications. Then pusher came along when we started the current web client and it needed live updates. Calendar notifications came later and most recently, device push. It’s really nice having an easy, obvious place to work with mailbox events. I fully expect that we’ll extend these tools further in the future to support new kinds of realtime updates and notifications.

Posted in Advent 2014, Technical. Comments Off

Dec 2: Security – Confidentiality, Integrity and Availability

This blog post is part of the FastMail 2014 Advent Calendar.

The previous post on December 1st was about the Email Search System. The next post on December 3rd is all about how we do real-time push notifications.

Technical level: low

This is the first of a series of blog posts on security, both FastMail’s approach to various threats, and how the location of our servers interacts with security risks. We’re not digging into the technical details yet, just looking at an overview of what security means.

I always recommend that people read this humorous paper by James Mickens at Microsoft Research (pdf). There are a ton of security in-jokes there, but he makes some really good points.

Another great place to learn more about security best practices is Bruce Schneier’s blog. He’s been thinking about this stuff for a long time, and is one of the world’s acknowledged experts on computer security.

Security consists of three things: Confidentiality, Integrity and Availability. There’s a good writeup on wikipedia and also a fairly good post on blog overflow – except that it falls for the trap of defining integrity as only protecting information from being modified by unauthorized parties.

Honestly, the biggest “security risk” to data integrity in the history of email has been the unreliable hard drives in people’s home computers dying, and all the email downloaded by POP3 over the years being lost or corrupted badly in a single screeching head-crash. For us, the biggest integrity risk is hardware or disk failures corrupting data, and I’ll write more about some of the corruption cases we’ve dealt with as well.

We care about all three security components at FastMail, and work to strike a sensible balance between them. There’s a joke that to perfectly secure a server you need to encase it in concrete deep under ground, and then cut off the power and network cables. It’s funny because there’s a hint of truth.

To be useful, a server has to be online. And that server is running imperfect software on imperfect hardware, which may have even been covertly modified (not just by the NSA either – anyone with a big enough budget and no regard for the law can pull off something like that).

Thankfully, the same security processes and architectures that defend against system failures are also good for protecting again active attackers. We follow best practices like running separate physical networks for internal traffic, restrictive firewalls that only allow expected traffic into our servers, following security announcement mailing lists for all our software, and only choosing software with a good security record.

That’s the baseline of good security. In the following blogs, we will look at some of the specific things that FastMail does to protect our systems and our users’ data.

Posted in Advent 2014. Comments Off

Dec 1: Email Search System

This blog post is part of the FastMail 2014 Advent Calendar.

The next post on December 2nd is Security – Confidentiality, Integrity and Availability.

Technical level: medium

Our email search system was originally written by Greg Banks, who has moved on to a role at Linked In, so I maintain it now. It’s a custom extension to the Cyrus IMAPd mail server.

Fast search was a core required feature for our new web interface. My work account has over half a million emails in it, and even though our interface allows fast scroll, it’s still impossible to find anything more than a few weeks old without either knowing exactly when it was sent, or having a powerful search facility.

Greg tried a few different engines, and settled on the Xapian project as the best fit for the one-database-per-user that we wanted.

We tried indexing new emails as they arrived, even directly to fast SSDs, and discovered that the load was just too high. Our servers were overloaded trying to index in time – because adding a single email causes a lot of updates.

Luckily, Xapian supports searching from multiple databases at once, so we came up with the idea of a tiered database structure.

New messages get indexed to tmpfs in a small database. A job runs every hour to see if tmpfs is getting too full (over 50% of the defined size), in which case it compacts immediately, otherwise we automatically compact during the quiet time of the day. Compacted databases are more efficient, but read only.

This allows us to index all email immediately, and return a message that arrived just a second ago in your search results complete with highlighted search terms, yet not overload the servers. It also means that search data can be stored on inexpensive disks, keeping the costs of our accounts down.

Technical level: extreme

Here’s some very technical information about how the tiers are implemented, and an example of running the compaction.

We have 4 tiers at FastMail, though we don’t actually use the ‘meta’ one (SSD) at the moment:

  • temp
  • meta
  • data
  • archive

The temp level is on tmpfs, purely in memory. Meta is on SSD, but we don’t use that except during shutdown. Data is the main version, and we re-compact all the data level indexes once per week. Finally archive is never automatically updated, but we build it when users are moved or renamed, or can create it manually.

Both external locking (Xapian isn’t always happy with multiple writers on one database) and the compaction logic are managed via a separate file called xapianactive. The xapianactive looks like this:

% cat /mnt/ssd30/sloti30t01/store23/conf/user/b/brong.xapianactive
temp:264 archive:2 data:37

The first item in the active file is always the writable index – all the others are read-only.

These map to paths on disk according to the config file:

% grep search /etc/cyrus/imapd-sloti30t01.conf
search_engine: xapian
search_index_headers: no
search_batchsize: 8192
defaultsearchtier: temp
tempsearchpartition-default: /var/run/cyrus/search-sloti30t01
metasearchpartition-default: /mnt/ssd30/sloti30t01/store23/search
datasearchpartition-default: /mnt/i30search/sloti30t01/store23/search
archivesearchpartition-default: /mnt/i30search/sloti30t01/store23/search-archive

(the ‘default tier’ is to tell the system where to create a new search item)

So based on these paths, we find.

% du -s /var/run/cyrus/search-sloti30t01/b/user/brong/* /mnt/i30search/sloti30t01/store23/search/b/user/brong/* /mnt/i30search/sloti30t01/store23/search-archive/b/user/brong/*
3328 /var/run/cyrus/search-sloti30t01/b/user/brong/xapian.264
1520432 /mnt/i30search/sloti30t01/store23/search/b/user/brong/xapian.37
3365336 /mnt/i30search/sloti30t01/store23/search-archive/b/user/brong/xapian.2

I haven’t compacted to archive for a while. Let’s watch one of those. I’m selecting all the tiers, and compressing to a single tier. The process is as follows:

  1. take an exclusive lock on the xapianactive file
  2. insert a new default tier database on the front (in this example it will be temp:265) and unlock xapianactive again
  3. start compacting all the selected databases to a single database on the given tier
  4. take an exclusive lock on the xapianactive file again
  5. if the xapianactive file has changed, discard all our work (we lock against this, but it’s a sanity check) and exit
  6. replace all the source databases for the compact with a reference to the destination database and unlock xapianactive again
  7. delete all now-unused databases

Note that the xapianactive file is only locked for two VERY SHORT times. All the rest of the time, the compact runs in parallel, and both searching on the read-only source databases and indexing to the new temp database can continue.

This allows us to only ever have a single thread compacting to disk, so our search drives are mostly idle, and able to serve
customer search requests very quickly.

When holding an exclusive xapianactive lock, it’s always safe to delete any databases which aren’t mentioned in the file – at worst you will race against another task which is also deleting the same databases, so this system is self-cleaning after any failures.

Here goes:

% time sudo -u cyrus /usr/cyrus/bin/squatter -C /etc/cyrus/imapd-sloti30t01.conf -v -z archive -t temp,meta,data,archive -u brong
compressing temp:264,archive:2,data:37 to archive:3 for user.brong (active temp:264,archive:2,data:37)
adding new initial search location temp:265
compacting databases
Compressing messages for brong
done /mnt/i30search/sloti30t01/store23/search-archive/b/user/brong/xapian.3.NEW
renaming tempdir into place
finished compact of user.brong (active temp:265,archive:3)

real 4m52.285s
user 2m29.348s
sys 0m13.948s

% du -s /var/run/cyrus/search-sloti30t01/b/user/brong/* /mnt/i30search/sloti30t01/store23/search/b/user/brong/* /mnt/i30search/sloti30t01/store23/search-archive/b/user/brong/*
368 /var/run/cyrus/search-sloti30t01/b/user/brong/xapian.265
du: cannot access `/mnt/i30search/sloti30t01/store23/search/b/user/brong/*': No such file or directory
4614368 /mnt/i30search/sloti30t01/store23/search-archive/b/user/brong/xapian.3

If you want to look at the code, it’s all open source. I push the fastmail branch to github regularly. The xapianactive code is in imap/search_xapian.c and the C++ wrapper in imap/xapian_wrap.cpp.

Posted in Advent 2014, Technical. Comments Off
Follow

Get every new post delivered to your Inbox.

Join 5,811 other followers