Classic interface moving to

Since we released our new interface over 3 years ago, the vast majority of users have moved to using it and love the improvements that it has brought.

In a small but regular number of cases however, people are accidentally logging in to the old classic interface when they wanted to log in to the new interface. This tends to cause confusion and unnecessary support tickets.

To make it clearer, we’ve decided to separate the login screens for the current interface and the classic interface by putting them on separate domains. So from Monday, 23 November 2015 you will only be able to log in to the regular interface at To log in to the classic interface, you will have to go to is available immediately, so if you’re a classic interface user, we recommend you go there now, bookmark it and use it going forward.

Posted in Technical. Comments Off on Classic interface moving to

An open source JMAP proxy, JavaScript library and webmail demo

Last December we announced the JMAP project, our effort to develop a new open protocol for mail, calendar and contact clients that’s faster and more powerful than the current standards. Since then, we’ve continued to refine the specification, and other companies have come on board to help build the future of email. Atmail is using JMAP to power their next-generation mobile apps. The next version of Roundcube, the world’s most popular open-source webmail, will be built on JMAP, with support from Kolab. Thunderbird has started looking at integrating support.

As we’re now reaching the stage where clients and servers are beginning to be developed, the need for an existing server (if you’re building a client) or client (if you’re building a server) to test against becomes vital. To help with this, we are proud to be open sourcing several new JMAP projects, all under the liberal MIT license.

JMAP proxy

The proxy can sit in front of any IMAP server (although works best with those supporting CONDSTORE), CardDAV and CalDAV server, and provides an almost complete JMAP interface to them (JMAP authentication is not implemented yet, and there may be one or two other minor issues). The proxy is reasonably stable, and suitable for testing new JMAP client implementations against.

There is a hosted version available to use at – log in with a FastMail, iCloud, Gmail or other IMAP test account to start playing with JMAP today! Alternatively, if you want to run it yourself, the code is available at

JMAP JavaScript client library

The JS client library is a full implementation of the JMAP mail, calendar and contacts model in JavaScript. It supports asynchronous local changes and is tolerant of network failure – actions will update the UI instantly, then synchronise changes back to the server when it can. It also has multi-level undo/redo support.

The library is the basis for the next-generation FastMail interface, and has a sole dependency on a subset of our Overture library.

JMAP demo webmail

So you can try the client library and proxy in action today, we built a simple – but in some ways quite sophisticated – webmail demo on top of the JS client library. Combined with the JMAP proxy, this provides the ability to try JMAP now, with a real account. The webmail demo does not support compose, but does support:

  • Drag-drop to move between folders
  • Instant actions (the UI updates immediately when you archive etc.)
  • Multi-level undo/redo (Ctrl-Shift-Z to redo)
  • Full mailbox access with no paging (and you can jump to an arbitrary point in the scroll)
  • Conversation threading
  • Simple keyboard shortcuts (j/k to move next/previous)
  • A simple calendar view

We do not intend to add further features to this demo ourselves, but we welcome others to fork it and develop it into a more fully fledged client. The webmail is installed on our hosted version of the proxy, so you can try it today with a real account.

The future

In addition to these new projects, we’re also working hard to build JMAP into the open-source Cyrus IMAP server, the bedrock of FastMail and many other email providers around the world. This will (probably) be the first fully production-ready implementation and will of course be what we run here at FastMail. We already have a version of our own web UI running on JMAP internally (via the proxy we have open sourced) and we’re excited by the extra speed and efficiency it gives us over our (already very fast) webmail today.

If you run your own email service and want to get involved with the JMAP project, you can find full details of the specification, implementation advice and a link to the code on the JMAP website. If you would like to get involved in finalising the spec, or have questions or comments about implementing JMAP, then please join the JMAP mailing list.

Posted in Technical. Comments Off on An open source JMAP proxy, JavaScript library and webmail demo

New secondary MX and nameserver IPs

We’re in the process of moving some services to our new datacentre in Amsterdam. I’ve just pointed our secondary MX and nameserver there, which means that the IP addresses of and have changed.

The new addresses will propagate over the next 24 hours, after which time the old servers in Iceland will be shut down.

Unless you’ve hard-coded the old addresses somewhere, you shouldn’t see any difference.

Posted in Technical. Comments Off on New secondary MX and nameserver IPs

Increased user security, upgrading Diffie-Hellman parameters to 2048 bits

For non-technical users, the short version is that if you’re using a modern, up-to-date web browser, mobile device or mail client to access FastMail, then there’s nothing you need to do.

If you’re using old or unusual software to access or send via FastMail, you might be affected and should read on.

What’s happening?

On 30 March 2015 we will be increasing the size of the DH parameters for DHE ciphers to 2048 bits. This will cause connection problems for old software that cannot handle DH parameters greater than 1024 bits.

1024-bit RSA crypto is generally being phased out as insecure and has been for at least the last five years.

Breaking DH parameters is generally understood to require the same amount of computation as a RSA key of equivalent size. Therefore, the recommendation is to increase the size of DH parameters in step with the size of RSA keys.

If we don’t upgrade our crypto to 2048 bits for the general case, we’re compromising the security of all our users for a few that have old clients. We don’t consider that to be acceptable.

Will this affect me?

The main software we’re aware of that will be affected is iOS 5 and Java 6 and 7 (which often means business software that sends through our authenticated SMTP service).

If you’re unsure if you’re affected, you can test right now by pointing your software at (web) or (everything else). These servers are using the new config that will be rolled out on the 30th. Note that you shouldn’t use these names permanently; this is a test service and does not have the same redundancy as the main FastMail services.

If you can access your mail as normal using these servers, then you have nothing to worry about.

If you can’t connect through the beta servers but can through the main servers then its quite likely that you are affected and you will need to either upgrade or reconfigure your software, or switch to our “insecure” services at (web) or (everything else). Using the insecure service is not recommended as it uses encryption that is known to be weak or broken.

Please note that we’re unable to help you upgrade or reconfigure your software, particularly for those Java business apps. You’ll need to contact your software vendor for that.

Further reading

If you’d like to read more about perfect forward secrecy and DH param lengths, the following technical articles may be interesting to you:

Posted in Technical. Comments Off on Increased user security, upgrading Diffie-Hellman parameters to 2048 bits

Dec 21: File Storage

This blog post is part of the FastMail 2014 Advent Calendar.

The previous post on 20th December saw us open source our core javascript library. The following post on 22nd December is the long awaited beta release of our contact syncing solution.

Technical level: high

Why does an email service need a file storage system?

For many years, well before cloud file storage became an everyday thing, FastMail has had a file storage feature. Like many features, it started as a logical extension of what we were already doing. People were accessing their email anywhere via a web browser, it would be really nice to be able to access important files everywhere as well. And there’s the obvious email integration points as well, being able to save attachments from emails to somewhere more structured, or having files that you commonly want to attach to emails (e.g. your resume) stored at FastMail, rather than having to upload it again each time.

Lots of FastMail’s early features exist because they’re the kind of thing that we wanted for ourselves!
It also turns out that have a generic file storage is useful for other features, as we discovered later.

A generic file service

The first implementation of our file storage system used a real filesystem on a large RAID array. To make the data available to our web servers, we used NFS. While in theory and nice and straight forward solution, unfortunately it all ended up being fairly horrible. We used the NFS server built into the Linux kernel at that time, and although it was supposed to be stable, that was not our experience at all. While all our other servers had many months of uptime, the file storage server running NFS would freeze/crash somewhere between every few days and every week. This was particularly surprising to us because we weren’t actually stressing it much, the workload wasn’t high compared to the IO that some file servers perform.

Having the NFS server go offline and people losing access to their files until it is rebooted was bad enough, but there was a much worse problem. Any process that tried to access a file on the NFS mount would freeze up until the server came back. Since the number of processes handling web requests was limited, all it took was a few 100 requests by users trying to access their file storage, and suddenly there were no processes left to handle other web requests, and ALL web requests would start failing, meaning no one was able to access the FastMail site at all. Not nice. We tried a combination of soft mounts and other flags, but couldn’t find a combination that was both consistently reliable and failure safe.

Apparently I have suppressed the memories — something to do with being woken by both young children AND broken servers, but Rob M remembers, and he says: “In one of those great moments of frustration at being woken up again by a crashed NFS server, Bron wanted to do a complete rewrite, and to use an entirely different way of storing the files. Instead of storing the file system structure in a real filesystem, we decided to use a database. However we didn’t want to store the actual file data in the database, that would result in a massive monolithic database with all user file data in it, not easy to manage. So the approach he came up with is a rather neat hybrid that has worked really well.” So there you go.

One of my first major projects at FastMail was this new file storage service. I was fresh from building data management tools for late-phase clinical trials (drugs, man) for Quintiles in New Jersey (that’s right, I moved from living in New Jersey and working on servers in Melbourne to living in Melbourne and working on servers in New York). I over-engineered our file storage system using many of the same ideas I had used for clinical data.

Interestingly, a lot of the work I’d been doing at Quintiles looked very similar in design to git, though it was years before git came out. Data addressed by digest (sha1), signatures on digests of lists of digests to provide integrity and authenticity checks over large blocks of data. That product doesn’t seem to exist any more though.

The original build of the file service was based on many of the same concepts. File version forking and merging (which was too complex and got scrapped) with very clever cache hierarchy and invalidation scheme. Blobs (file contents themselves) are stored in a key-value pool spread over multiple machines, with push to many copies before the store is successful, and a background cleaning task that ensures they are spread everywhere and garbage collected when no longer referenced.

The blob storage system is very simple – we could definitely build or grab off the shelf something a lot faster and better these days, but it’s very robust, and that matters to us more than speed.

Interestingly enough, while the caching system was great when there was a lower volume of changes and slow database servers, it eventually became faster to remove a layer of caching entirely as our needs and technology changed.

Database backed nodes

The same basic architecture still exists today. The file storage is a giant database table in our central mysql database. Every entry is a “Node”, with a primary key called NodeId, and a “ParentNodeId”. Node number 1 is treated specially, and is of class ‘RootNode’. It’s the top of the tree.

Because there are hundreds of thousands of top level nodes (ParentNodeId == 1), it would be crazy to read the key ‘ND:1’ (node directories, parent 1) for normal operations. Instead, we fetch “UA:$UserId” which is the ACL for the user’s userid, and then walk the tree back up from each ACL which grants the user any access, building a tree that way.

For example:

$ DEBUG_VFS=1 vfs -u ls /
Fetching ND:504452
Fetching UA:485617
Fetching N:20872929
Fetching N:3
Fetching N:1394099
Fetching N:1394098
Fetching N:2
Fetching N:504452
d---   504452 2005-09-14 04:31:35
d---  1394098 2006-01-20 00:44:04
d---        2 2005-09-13 07:46:04

Whereas if we’re inside an ACL path we walk the tree normally from that ACL (we still need to check the other ACLs to see if they also impact the data we’re looking at…):

$ DEBUG_VFS=1 vfs -u ls '~/websites'
Fetching ND:504452
Fetching N:504452
Fetching UA:485617
Fetching N:20872929
Fetching N:3
Fetching N:1394099
Fetching N:1394098
Fetching N:2
Fetching UT:485617
darw  6548549 2007-03-27 06:03:19 cherubfest/
darw 39907741 2008-11-25 10:27:55 custom-ui/
darw 335168869 2014-10-19 23:51:38 talks/
Fetching NFZ:504465

This structure also allows us to keep deleted files for a week, because when a node is “deleted”, it’s not actually deleted – it just gets an integer field “IsDeleted” set to the NodeId of the top node being deleted.

Let’s try that too:

[brong@utility2 ~]$ vfs -u mkdir '~/test/blog'
[brong@utility2 ~]$ echo "hello world" > /tmp/hello.txt
[brong@utility2 ~]$ vfs -u cat '~/test/blog' /tmp/hello.txt
Failed to write: Could not open / for writing: Is a directory 
[brong@utility2 ~]$ vfs -u cat '~/test/blog/hello.txt' /tmp/hello.txt
[brong@utility2 ~]$ vfs -u cat '~/test/blog/hello.txt'
hello world
[brong@utility2 ~]$ vfs -u ls '~/test/blog/'
-arw 387950177 2014-12-21 00:31:56 hello.txt (12)
[brong@utility2 ~]$ vfs -u rm '~/test/blog/hello.txt'
[brong@utility2 ~]$ vfs -u ls '~/test/blog/'
[brong@utility2 ~]$ vfs -u lsdel '~/test/blog/'
/ (deleted)
-arw 387950177 2014-12-21 00:31:56 hello.txt (12)
[brong@utility2 ~]$ vfs -u undel 387950177
restored 387950177 (hello.txt) as /
[brong@utility2 ~]$ vfs -u cat '~/test/blog/hello.txt'
hello world

And the versioning:

[brong@utility2 ~]$ echo "hello world v2" > /tmp/hello.txt 
[brong@utility2 ~]$ vfs -u cat '~/test/blog/hello.txt' /tmp/hello.txt
Failed to write: Could not open / for writing: File exists
[brong@utility2 ~]$ vfs -u -f cat '~/test/blog/hello.txt' /tmp/hello.txt
[brong@utility2 ~]$ vfs -u cat '~/test/blog/hello.txt' 
hello world v2
[brong@utility2 ~]$ vfs -u ls '~/test/blog/'
-arw 387950597 2014-12-21 00:35:50 hello.txt (15)
[brong@utility2 ~]$ vfs -u lsdel '~/test/blog/'
/ (deleted)
-arw 387950177 2014-12-21 00:31:56 hello.txt (12)
[brong@utility2 ~]$ vfs -u undel 387950177 oldhello.txt
restored 387950177 (hello.txt) as /
[brong@utility2 ~]$ vfs -u ls '~/test/blog/'
-arw 387950597 2014-12-21 00:35:50 hello.txt (15)
-arw 387950745 2014-12-21 00:36:53 oldhello.txt (12)

This means that if you delete a folder and all its contents, you can undelete it without also undeleting everything ELSE that was deleted in the same action, just by setting nodes with the same IsDeleted field as the top node back to zero again.

[brong@utility2 ~]$ vfs -u rm '~/test/blog/oldhello.txt'
[brong@utility2 ~]$ vfs -u ls '~/test/blog/'
-arw 387950597 2014-12-21 00:35:50 hello.txt (15)
[brong@utility2 ~]$ vfs -u lsdel '~/test/blog/'
/ (deleted)
-arw 387950177 2014-12-21 00:31:56 hello.txt (12)
-arw 387950745 2014-12-21 00:36:53 oldhello.txt (12)
[brong@utility2 ~]$ vfs -u rm '~/test/blog'
[brong@utility2 ~]$ vfs -u lsdel '~/test/'
/ (deleted)
darw 387950021 2014-12-21 00:38:20 blog/
[brong@utility2 ~]$ vfs -u undel 387950021
restored 387950021 (blog) as /
[brong@utility2 ~]$ vfs -u ls '~/test/blog/'
-arw 387950597 2014-12-21 00:35:50 hello.txt (15)

So only the file that was in the folder when it was deleted is now present in the restored copy.

That’s the node tree. Alongside this is “Properties”, which are used not only for storing mime types of files (you can edit these in the filestorage) the expansion state of directories in the left view on the Files screen, but can be used via our WebDAV server to set any generic fields you like. There are ACLs and Locks which can also be used via DAV, and a Websites table which is used to manage our customer static-file web hosting.

The cache module can present any file as a regular Perl filehandle, including an append filehandle which already has the old content pre-loaded. Calling “close” on that filehandle returns a complete filestorage node on which other actions can be taken – and it does quota accounting on the fly. The whole system is object oriented and a joy to work with at the top level, despite all the weird bits underneath.

One thing – if you upload over the same filename again and again (webcams for example) we will only keep the most recent 30 copies, because we were finding that it slowed the directory listings too much otherwise.

File storage as a service

A long time ago, if you were in the middle of composing a draft, and you had uploaded a file, we stored that file in a temporary location on the web server’s filesystem. This was fine normally, but if we had to shut down one of our web servers, the session would move to another server, and the email would be sent without the attachment, because it couldn’t find the file.

We now keep uploaded files in a separate VFS directory for each user – and if you’re composing and we switch our web servers, you don’t even notice:

[brong@imap20 hm]$ vfs -u ls /
darw   504453 2014-10-18 16:13:19 files/
d-rw  6825201 2014-12-16 21:36:11 temp/

The home directory that you see in the Files screen is “/username.domain/files/” in the global filesystem. Your temporary files go into “/username.domain/temp/” with a special flag that says when they will be automatically deleted (default, 1 week from upload).

This has been very useful for other things too. We now upload imported calendar/addressbook files into VFS first via our general upload mechanism, and just use a VFS address for everything internally, meaning that the upload and the processing phase can be separate.

Calendar attachments are likewise done with an upload as a temporary file, and then a conversion into a long-lived URL later. Even on Mailr, which doesn’t have a file storage service for users, the file storage engine is used underneath for attachments and calendar.

Multiple ways to access

We use Perl as our backend language. We built both FTP and DAV servers, as well as our Files screen within the app. Another very valuable use of filestorage (indeed, one of the original reasons) is that you can copy attachments from emails into file storage, and attach from file storage into new emails. This is a great way to save having to download and upload giant attachments if you want to send them on to someone else.

As you can see from my example above, where my top level had 3 different directories, it’s possible to share file storage within a business. There’s a very powerful access control system to grant access only to individual folders, and separate read and write roles.

You can create websites, including password protected websites. We even have support for basic photo galleries.

Open Source?

A lot of the tooling on top of the VFS is based on open source modules, but hacked so much that they can’t really be shared easily back to the original authors. I have been more lax with this than I should have, particularly the DAV modules where I took the Net::DAV::Server module and almost entirely rewrote it based on the RFC, including a ton of testing against the clients of the day, and adding things like Apple’s quota support.

I don’t have time during this blog series to release anything nice, but there are no real secrets in the tree, so I’ve put up a tar file at which contains the current version of the files I’ve talked about in this post, and send me an email at the obvious address if you have comments or suggestions. The file itself is hosted on a FastMail static website!

A road not taken

All of this was built in 2004/2005. These days there are tons of “filesystem as a service” products out there. We could have built one – our VFS has some nice properties which would have made syncing with clients very viable – but our focus is on email, and we didn’t make the time to learn how to build good clients.

We actually have an experimental Linux FUSE module which can talk directly to our VFS, and I built some experimental Windows and Mac clients as well, but we never took it anywhere — though we did talk about the potential to pivot the company into the file storage business.

These days we have gone the other way, with Dropbox integration on the compose screen and hooks to add other services too. It makes sense to integrate with services which do what they are good at, and concentrate our efforts on being the best at our field, email.

Posted in Advent 2014, Technical. Comments Off on Dec 21: File Storage

Dec 12: FastMail’s MySQL Replication: Multi-Master, Fault Tolerance, Performance. Pick Any Three

This blog post is part of the FastMail 2014 Advent Calendar.

The previous post on 11th December was from our support team. The following post on 13th December is about hosting your own domain with us.

Technical level: medium

For those who prefer watching videos over reading, here’s a talk I gave at Melbourne Perl Mongers in 2011 on FastMail’s open-sourced, custom MySQL replication system.

Most online services store their data within a database, and thanks to the culture of Open Source, these days there are plenty of robust RDBMs to choose from. At FastMail, we use a Percona build of MySQL 5.1 because of their customised tooling and performance patches (If you haven’t heard of Percona, I recommend trying them out). However, even though MySQL 5.1 is a great platform to work with, we do something differently here – we don’t use its built-in replication system and instead opted to roll our own.

First, what’s the problem with running an online service on a single database? The most important reason against this is the lack of redundancy. If your database catches on fire or as what happens more often the oom-killer chooses to zap your database server as it’s usually the biggest memory hog on a machine, none of your applications can continue without their needed data and so your service is taken offline. By using multiple databases, when a single database server is downed, your applications still have the others to chose from and connect to.

Another reason against using a single database for your online service is degraded performance – as more and more applications connect and perform work, your database’s server load increases. Once a server can’t take the requested load any longer, you’re left with query timeout responses and even refused connections, which again takes your service offline. By using multiple database servers, you can tell your applications to spread their load across the database farm thus reducing the work a single database server has to cope with, while gaining an upshot in performance for free across the board.

Clearly the best practice is to have multiple databases, but why re-invent the wheel? Mention replication to a veteran database admin and then prepare yourself a nice cup of hot chocolate while they tell your horror stores from the past as if you’re sitting around a camp fire. We needed to re-invent the wheel because there are a few fundamental issues with MySQL’s built-in replication system.

When you’re working with multiple databases and problems arise in your replication network, your service can grind to a halt and possibly take you offline until every one of your database servers back up and replicating happily again. We never wanted to be put in a situation like this and so wanted the database replication network itself to be redundant. By design, MySQL’s built-in replication system couldn’t give us that.

What we wanted was a database replication network where every database server could be a “master”, all at the same time. In other words, all database servers could be read and written to by all connecting applications. Each time an update occurred on any master, the query would then be replicated to all the other masters. MySQL’s built-in replication system allows for this, but it comes with a very high cost – it is a nightmare to manage if a single master was downed.

To achieve master-master replication with more than two masters, MySQL’s built-in replication system needs the servers be configured in a ring network topology. Every time an updated occurs on a master, it executes the query locally, then passes it off to the next server in the ring, which applies the query to its local database, and so on – much like participants playing pass-the-parcel. And this works nicely and is in place in many companies. The nightmares begin however if a single database server is downed, thus breaking the ring. Since the path of communication is broken, queries stop travelling around the replication network and data across every database server begin to get stale.

Instead, our MySQL replication system (MySQL::Replication) is based on a peer-to-peer design. Each database server runs its own MySQL::Replication daemon which serves out its local database updates. They then run a separate MySQL::Replication client for each master it wants a feed from (think of a mesh network topology). Each time a query is executed on a master, the connected MySQL::Replication clients take a copy and applies it locally. The advantage here is that when a database server is downed, only that single feed is broken. All other communication paths continue as normal, and query flow across the database replication network continue as if nothing ever happened. And once the downed server comes back online, the MySQL::Replication clients notice and continue where they left off. Win-win.

Another issue with MySQL’s built-in replication system is that a slave’s position relative to its master is recorded in a plain text-file called which is not atomically synced to disk. Once a slave dies and comes back online, files may be in an inconsistent state. If the InnoDB tablespace was flushed to disk before the crash but wasn’t, the slave will restart replication from an incorrect position and so will replay queries, leaving your data in an invalid state.

MySQL::Replication clients store their position relative to their masters inside the InnoDB tablespace itself (sounds recursive, but it’s not since there is no binlogging of MySQL::Replication queries). As updates are done within the same transaction as replicated queries are executed in, writes are completely atomic. Once a slave dies and comes back online, we are still in a consistent state since either the transaction was committed or it will be rolled back. It’s a nice place to be in.

MySQL::Replication – multi-master, peer-to-peer, fault tolerant, performant and without the headaches. It can be found here.

Posted in Advent 2014, MySQL. Comments Off on Dec 12: FastMail’s MySQL Replication: Multi-Master, Fault Tolerance, Performance. Pick Any Three

Dec 4: Standalone Mail Servers

This blog post is part of the FastMail 2014 Advent Calendar.

The previous post on December 3rd was about how we do real-time notifications. The next post on December 5th is about data integrity.

Technical level: highly technical

We’ve written a lot about our slots/stores architecture before – so I’ll refer you to our documentation rather than rehashing the details here.

We have evolved over the years, and particularly during the Opera time, I had to resist the forces suggesting a “put all your storage on a SAN and your processing on compute nodes” design, or “why don’t you just virtualise it”, as if that’s a magic wand that solves all your scalability and IO challenges.

Luckily I had a great example to point to: Berkeley University had a week-long outage on their Cyrus systems when their SAN lost a drive. They were sitting so close to the capability limits of their hardware that their mail architecture couldn’t handle the extra load of adding a new disk, and everything fell over. Because there was one single pool of IO, this meant every single user was offline.

I spent my evenings that week (I was living in Oslo) logging in to their servers and helping them recover. Unfortunately, the whole thing is very hard to google – search for “Berkeley Cyrus” and you’ll get lots of stuff about the Berkeley DB backend in Cyrus and how horrible it is to upgrade…

So we are very careful to keep our IO spread out across multiple servers with nothing shared, so an issue in one place won’t spread to the users on other machines.

The history of our hardware is also, to quite a large degree, the history of the Cyrus IMAPd mail server. I’ve been on the Cyrus Governance board for the past 4 years, and writing patches for a lot longer than that.

Email is the core of what we do, and it’s worth putting our time into making it the best we can. There are things you can outsource, but hardware design and the mail server itself have never been one of those things for us.

Early hardware – meta data on spinning disks

When I started at FastMail 10 years ago, our IMAP servers were honking great IBM machines (6 rack units each) with a shared disk array between them, and a shiny new 4U machine with a single external RAID6 unit. We were running a pre-release CVS 2.3 version of Cyrus on them, with a handful of our own patches on top.

One day, that RAID6 unit lost two hard disks in a row, and a third started having errors. We had no replicas, we had backups, but it took a week to get everyone’s email restored onto the new servers we had just purchased and were still testing. At least we had new servers! For that week though, users didn’t have access to their old email. We didn’t want this to ever happen again.

Our new machines were built more along the lines of what we have now, and we started experimenting with replication. The machines were 2U boxes from Polywell (long since retired now), with 12 disks – 4 high speed small drives in two sets of RAID1 for metadata, and 8 bigger drives (500Gb! – massive for the day) in two sets of RAID5 for email spool.

Even then I knew this was the right way – standalone machines with lots of IO capability, and enough RAM and processor (they had 32Gb of RAM) to run the mail server locally, so there are minimal dependencies in our architecture. You can scale that as widely as you want, with a proxy in front that can direct connections to the right host.

We also had 1U machines with a pair of attached SATA to SCSI drive units on either side. Those drive units had the same disk layout as the Polywell boxes, except the OS drives were in the 1U box – I won’t talk any more about these, they’re all retired too.

This ran happily for a long time on Cyrus 2.3. We wrote a tool to verify that replicas were identical to masters in all the things that matter (what can be seen via IMAP), and pushed tons of patches back to the Cyrus project to improve replication as we found bugs.

We also added checksums to verify data integrity after various corruptions were detected between replicas which showed a small rate of bitrot (on the order of 20 damaged emails per year across our entire system) – and tooling to allow the damage to be fixed by pulling back the affected email from either a replica or the backup system and restoring it into place.

Metadata on SSD

Cyrus has two metadata files per mailbox (actually, there are more these days) cyrus.index and cyrus.cache. With the growing popularity of SSDs around 2008-2009, we wanted to use SSDs, but cyrus.cache was just too big for the SSDs we could afford. It’s also only used for search and some sort commands, but the architecture of Cyrus meant that you had to MMAP the whole file every time a mailbox was opened, just in case a message was expunged. People had tried running with cache on slow disk and index on SSD, and it was still too slow.

There’s another small directory which contains the global databases – mailboxes database, seen and subscription files for each user, sieve scripts, etc. It’s a very small percentage of the data, but our calculations on a production server showed that 50% of the IO went to that config directory, about 40% to cyrus.index, and only 10% to cache and spool files.

So I spent a year and concentrated on rewriting the entire internals of Cyrus. This became Cyrus 2.4 in 2010. It has consistent locking semantics, which actually make it a robust QRESYNC/CONDSTORE compatible server (new standards which required stronger guarantees than the cyrus 2.3 datastructures could provide), and also meant that cache wasn’t loaded until it was actually needed.

This was a massive improvement for SSD-based machines, and we bought a bunch of 2U machines from E23 (our existing external drive unit vendor) and then later from Dell through Opera’s sysadmin team.

These machines had 12 x 2Tb drives in them, and two Intel x25E 64Gb SSDs. Our original layout was 5 sets of RAID1 for the 2Tb drives, with two hotspares.

Email on SSD

We ran happily for years with the 5 x 2Tb split, but something else came along. Search. We wanted dedicated IO bandwidth for search. We also wanted to load the initial mailbox view even faster. We decided that almost all users get enough email in a week that their initial mailbox view is going to be able to be generated from a week’s worth of email.

So I patched Cyrus again. For now, this set of patches is only in the FastMail tree, it’s not in upstream Cyrus. I plan to add it after Cyrus 2.5 is released. All new email is delivered to the SSD, and only archived off later. A mailbox can be split, with some emails on the SSD, and some not.

We purchased larger SSDs (Intel DC3700 – 400Gb), and we now run a daily job to archive emails that are bigger than 1Mb or older than 7 days to the slow drives.

This cut the IO to the big disks so much that we can put them back into a single RAID6 per machine. So our 2U boxes are now in a config imaginatively called ‘t15’, because they have 15 x 1Tb spool partitions on them. We call one of these spools plus its share of SSD and search drive a “teraslot”,
as opposed to our earlier 300Gb and 500Gb slot sizes.

They have 10 drives in a RAID6 for 16Tb available space, 1Tb for operating system and 15 1Tb slots.

They also have 2 drives in a RAID1 for search, and two SSDs for the metadata.

Filesystem Size Used Avail Use% Mounted on
/dev/mapper/sdb1 917G 691G 227G 76% /mnt/i14t01
/dev/mapper/sdb2 917G 588G 329G 65% /mnt/i14t02
/dev/mapper/sdb3 917G 789G 129G 86% /mnt/i14t03
/dev/mapper/sdb4 917G 72M 917G 1% /mnt/i14t04
/dev/mapper/sdb5 917G 721G 197G 79% /mnt/i14t05
/dev/mapper/sdb6 917G 805G 112G 88% /mnt/i14t06
/dev/mapper/sdb7 917G 750G 168G 82% /mnt/i14t07
/dev/mapper/sdb8 917G 765G 152G 84% /mnt/i14t08
/dev/mapper/sdb9 917G 72M 917G 1% /mnt/i14t09
/dev/mapper/sdb10 917G 800G 118G 88% /mnt/i14t10
/dev/mapper/sdb11 917G 755G 163G 83% /mnt/i14t11
/dev/mapper/sdb12 917G 778G 140G 85% /mnt/i14t12
/dev/mapper/sdb13 917G 789G 129G 87% /mnt/i14t13
/dev/mapper/sdb14 917G 783G 134G 86% /mnt/i14t14
/dev/mapper/sdb15 917G 745G 173G 82% /mnt/i14t15
/dev/mapper/sdc1 1.8T 977G 857G 54% /mnt/i14search
/dev/md0 367G 248G 120G 68% /mnt/ssd14

The SSDs use software RAID1, and since Intel DC3700s have strong onboard crypto, we are using that rather than OS level encryption. The slot and search drives are all mapper devices because they use LUKS encryption. I’ll talk more about this when we get to the confidentiality post in the security series.

The current generation

Finally we come to our current generation of hardware. The 2U machines are pretty good, but they have some issues. For a start, the operating system shares IO with the slots, so interactive performance can get pretty terrible when working on those machines.

Also, we only get 15 teraslots per 2U.

So our new machines are 4U boxes with 40 teraslots on them. They have 24 disks in the front on an Areca RAID controller:


And 12 drives in the back connected directly to the motherboard SATA:


The front drives are divided into two lots of 2Tb x 12 drive RAID6 sets, for 20 teraslots each.

In the back, there are 6 2Tb drives in a pair of software RAID1 sets (3 drives per set, striped, for 3Tb usable) for search, and 4 Intel DC3700s as a pair of RAID1s. Finally, a couple of old 500Gb drives for the OS – we have tons of old 500Gb drives, so we may well recycle them. In a way, this is really two servers in one, because they are completely separate RAID sets just sharing the same hardware.

Finally, they have 192Gb of RAM. Processor isn’t so important, but cache certainly is!

Here’s a snippet from the config file showing how the disk is distributed in a single Cyrus instance. Each instance has its own config file, and own paths on the disks for storage:

servername: sloti33d1t01

configdirectory: /mnt/ssd33d1/sloti33d1t01/store1/conf
sievedir: /mnt/ssd33d1/sloti33d1t01/store1/conf/sieve

duplicate_db_path: /var/run/cyrus/sloti33d1t01/duplicate.db
statuscache_db_path: /var/run/cyrus/sloti33d1t01/statuscache.db

partition-default: /mnt/ssd33d1/sloti33d1t01/store1/spool
archivepartition-default: /mnt/i33d1t01/sloti33d1t01/store1/spool-archive

tempsearchpartition-default: /var/run/cyrus/search-sloti33d1t01
metasearchpartition-default: /mnt/ssd33d1/sloti33d1t01/store1/search
datasearchpartition-default: /mnt/i33d1search/sloti33d1t01/store1/search
archivesearchpartition-default: /mnt/i33d1search/sloti33d1t01/store1/search-archive

The disks themselves – we have a tool to spit out the drive config of the SATA attached drives. It just pokes around in /sys for details:

$ utils/
1 - HDD 500G RDY sdc 3QG023NC
2 - HDD 500G RDY sdd 3QG023TR
3 E SSD 400G RDY sde md0/0 BTTV332303FA400HGN
4 - HDD 2T RDY sdf md2/0 WDWMAY04568236
5 - HDD 2T RDY sdg md3/0 WDWMAY04585688
6 E SSD 400G RDY sdh md0/1 BTTV3322038L400HGN
7 - HDD 2T RDY sdi md2/1 WDWMAY04606266
8 - HDD 2T RDY sdj md3/1 WDWMAY04567563
9 E SSD 400G RDY sdk md1/0 BTTV323101EM400HGN
10 - HDD 2T RDY sdl md2/2 WDWMAY00250279
11 - HDD 2T RDY sdm md3/2 WDWMAY04567237
12 E SSD 400G RDY sdn md1/1 BTTV324100F9400HGN

And the Areca tools work for the drives in front:

$ utils/cli64 vsf info
# Name Raid Name Level Capacity Ch/Id/Lun State
1 i33d1spool i33d1spool Raid6 20000.0GB 00/00/00 Normal
2 i33d2spool i33d2spool Raid6 20000.0GB 00/01/00 Normal
GuiErrMsg: Success.
$ utils/cli64 disk info
# Enc# Slot# ModelName Capacity Usage
1 01 Slot#1 N.A. 0.0GB N.A.
2 01 Slot#2 N.A. 0.0GB N.A.
3 01 Slot#3 N.A. 0.0GB N.A.
4 01 Slot#4 N.A. 0.0GB N.A.
5 01 Slot#5 N.A. 0.0GB N.A.
6 01 Slot#6 N.A. 0.0GB N.A.
7 01 Slot#7 N.A. 0.0GB N.A.
8 01 Slot#8 N.A. 0.0GB N.A.
9 02 Slot 01 WDC WD2003FYYS-02W0B0 2000.4GB i33d1spool
10 02 Slot 02 WDC WD2003FYYS-02W0B0 2000.4GB i33d1spool
11 02 Slot 03 WDC WD2000FYYZ-01UL1B0 2000.4GB i33d1spool
12 02 Slot 04 WDC WD2003FYYS-02W0B0 2000.4GB i33d1spool
13 02 Slot 05 WDC WD2003FYYS-02W0B0 2000.4GB i33d1spool
14 02 Slot 06 WDC WD2003FYYS-02W0B0 2000.4GB i33d1spool
15 02 Slot 07 WDC WD2003FYYS-02W0B1 2000.4GB i33d1spool
16 02 Slot 08 WDC WD2003FYYS-02W0B0 2000.4GB i33d1spool
17 02 Slot 09 WDC WD2000F9YZ-09N20L0 2000.4GB i33d1spool
18 02 Slot 10 WDC WD2003FYYS-02W0B1 2000.4GB i33d1spool
19 02 Slot 11 WDC WD2003FYYS-02W0B1 2000.4GB i33d1spool
20 02 Slot 12 WDC WD2003FYYS-02W0B1 2000.4GB i33d1spool
21 02 Slot 13 WDC WD2003FYYS-02W0B0 2000.4GB i33d2spool
22 02 Slot 14 WDC WD2003FYYS-02W0B0 2000.4GB i33d2spool
23 02 Slot 15 WDC WD2003FYYS-02W0B0 2000.4GB i33d2spool
24 02 Slot 16 WDC WD2000FYYZ-01UL1B0 2000.4GB i33d2spool
25 02 Slot 17 WDC WD2003FYYS-02W0B0 2000.4GB i33d2spool
26 02 Slot 18 WDC WD2003FYYS-02W0B0 2000.4GB i33d2spool
27 02 Slot 19 WDC WD2003FYYS-02W0B0 2000.4GB i33d2spool
28 02 Slot 20 WDC WD2003FYYS-02W0B0 2000.4GB i33d2spool
29 02 Slot 21 WDC WD2003FYYS-02W0B0 2000.4GB i33d2spool
30 02 Slot 22 WDC WD2003FYYS-02W0B0 2000.4GB i33d2spool
31 02 Slot 23 WDC WD2002FYPS-01U1B1 2000.4GB i33d2spool
32 02 Slot 24 WDC WD2003FYYS-02W0B0 2000.4GB i33d2spool
GuiErrMsg: Success.

We always keep a few free slots on every machine, so we have the capacity to absorb the slots from a failed machine. We never want to be in the state where we don’t have enough hardware!

Filesystem Size Used Avail Use% Mounted on
/dev/mapper/md2 2.7T 977G 1.8T 36% /mnt/i33d1search
/dev/mapper/md3 2.7T 936G 1.8T 35% /mnt/i33d2search
/dev/mapper/sda1 917G 730G 188G 80% /mnt/i33d1t01
/dev/mapper/sda2 917G 805G 113G 88% /mnt/i33d1t02
/dev/mapper/sda3 917G 709G 208G 78% /mnt/i33d1t03
/dev/mapper/sda4 917G 684G 234G 75% /mnt/i33d1t04
/dev/mapper/sda5 917G 825G 92G 91% /mnt/i33d1t05
/dev/mapper/sda6 917G 722G 195G 79% /mnt/i33d1t06
/dev/mapper/sda7 917G 804G 113G 88% /mnt/i33d1t07
/dev/mapper/sda8 917G 788G 129G 86% /mnt/i33d1t08
/dev/mapper/sda9 917G 661G 257G 73% /mnt/i33d1t09
/dev/mapper/sda10 917G 799G 119G 88% /mnt/i33d1t10
/dev/mapper/sda11 917G 691G 227G 76% /mnt/i33d1t11
/dev/mapper/sda12 917G 755G 162G 83% /mnt/i33d1t12
/dev/mapper/sda13 917G 746G 172G 82% /mnt/i33d1t13
/dev/mapper/sda14 917G 802G 115G 88% /mnt/i33d1t14
/dev/mapper/sda15 917G 159G 759G 18% /mnt/i33d1t15
/dev/mapper/sda16 917G 72M 917G 1% /mnt/i33d1t16
/dev/mapper/sda17 917G 706G 211G 78% /mnt/i33d1t17
/dev/mapper/sda18 917G 72M 917G 1% /mnt/i33d1t18
/dev/mapper/sda19 917G 72M 917G 1% /mnt/i33d1t19
/dev/mapper/sda20 917G 72M 917G 1% /mnt/i33d1t20
/dev/mapper/sdb1 917G 740G 178G 81% /mnt/i33d2t01
/dev/mapper/sdb2 917G 772G 146G 85% /mnt/i33d2t02
/dev/mapper/sdb3 917G 797G 120G 87% /mnt/i33d2t03
/dev/mapper/sdb4 917G 762G 155G 84% /mnt/i33d2t04
/dev/mapper/sdb5 917G 730G 187G 80% /mnt/i33d2t05
/dev/mapper/sdb6 917G 803G 114G 88% /mnt/i33d2t06
/dev/mapper/sdb7 917G 806G 112G 88% /mnt/i33d2t07
/dev/mapper/sdb8 917G 786G 131G 86% /mnt/i33d2t08
/dev/mapper/sdb9 917G 663G 254G 73% /mnt/i33d2t09
/dev/mapper/sdb10 917G 776G 142G 85% /mnt/i33d2t10
/dev/mapper/sdb11 917G 743G 174G 82% /mnt/i33d2t11
/dev/mapper/sdb12 917G 750G 168G 82% /mnt/i33d2t12
/dev/mapper/sdb13 917G 743G 174G 82% /mnt/i33d2t13
/dev/mapper/sdb14 917G 196G 722G 22% /mnt/i33d2t14
/dev/mapper/sdb15 917G 477G 441G 52% /mnt/i33d2t15
/dev/mapper/sdb16 917G 539G 378G 59% /mnt/i33d2t16
/dev/mapper/sdb17 917G 72M 917G 1% /mnt/i33d2t17
/dev/mapper/sdb18 917G 72M 917G 1% /mnt/i33d2t18
/dev/mapper/sdb19 917G 72M 917G 1% /mnt/i33d2t19
/dev/mapper/sdb20 917G 72M 917G 1% /mnt/i33d2t20
/dev/md0 367G 301G 67G 82% /mnt/ssd33d1
/dev/md1 367G 300G 67G 82% /mnt/ssd33d2

Some more copy’n’paste:

$ free
total used free shared buffers cached
Mem: 198201540 197649396 552144 0 7084596 120032948
-/+ buffers/cache: 70531852 127669688
Swap: 2040248 1263264 776984

Yes, we use that cache! Of course, the Swap is a little pointless at that size…

$ grep 'model name' /proc/cpuinfo
model name : Intel(R) Xeon(R) CPU E5-2609 v2 @ 2.50GHz
model name : Intel(R) Xeon(R) CPU E5-2609 v2 @ 2.50GHz
model name : Intel(R) Xeon(R) CPU E5-2609 v2 @ 2.50GHz
model name : Intel(R) Xeon(R) CPU E5-2609 v2 @ 2.50GHz
model name : Intel(R) Xeon(R) CPU E5-2609 v2 @ 2.50GHz
model name : Intel(R) Xeon(R) CPU E5-2609 v2 @ 2.50GHz
model name : Intel(R) Xeon(R) CPU E5-2609 v2 @ 2.50GHz
model name : Intel(R) Xeon(R) CPU E5-2609 v2 @ 2.50GHz

IMAP serving is actually very low on CPU usage. We don’t need a super-powerful CPU to drive this box. The CPU load is always low, it’s mostly IO wait – so we just have a pair of 4 core CPUs.

The future?

We’re in a pretty sweet spot right now with our hardware. We can scale these IMAP boxes horizontally “forever”. They speak to the one central database for a few things, but that could be easily distributed. In front of these boxes are frontends with nginx running an IMAP/POP/SMTP proxy, and compute servers doing spam scanning before delivering via LMTP. Both look up the correct backend from the central database for every connection.

For now, these 4U boxes come in at about US$20,000 fully stocked, and our entire software stack is optimised to get the best out of them.

We may containerise the Cyrus instances to allow fairer IO and memory sharing between them if there is contention on the box. For now, it hasn’t been necessary because the machines are quite beefy, and anything which adds overhead between the software and the metal is a bad thing. As container software gets more efficient and easier to manage, it might become worthwhile rather than running multiple instances on the single operating system as we do now.

Posted in Advent 2014, Technical. Comments Off on Dec 4: Standalone Mail Servers

Get every new post delivered to your Inbox.

Join 7,293 other followers

%d bloggers like this: