Dec 12: FastMail’s MySQL Replication: Multi-Master, Fault Tolerance, Performance. Pick Any Three

This blog post is part of the FastMail 2014 Advent Calendar.

The previous post on 11th December was from our support team. The following post on 13th December is about hosting your own domain with us.

Technical level: medium

For those who prefer watching videos over reading, here’s a talk I gave at Melbourne Perl Mongers in 2011 on FastMail’s open-sourced, custom MySQL replication system.

Most online services store their data within a database, and thanks to the culture of Open Source, these days there are plenty of robust RDBMs to choose from. At FastMail, we use a Percona build of MySQL 5.1 because of their customised tooling and performance patches (If you haven’t heard of Percona, I recommend trying them out). However, even though MySQL 5.1 is a great platform to work with, we do something differently here – we don’t use its built-in replication system and instead opted to roll our own.

First, what’s the problem with running an online service on a single database? The most important reason against this is the lack of redundancy. If your database catches on fire or as what happens more often the oom-killer chooses to zap your database server as it’s usually the biggest memory hog on a machine, none of your applications can continue without their needed data and so your service is taken offline. By using multiple databases, when a single database server is downed, your applications still have the others to chose from and connect to.

Another reason against using a single database for your online service is degraded performance – as more and more applications connect and perform work, your database’s server load increases. Once a server can’t take the requested load any longer, you’re left with query timeout responses and even refused connections, which again takes your service offline. By using multiple database servers, you can tell your applications to spread their load across the database farm thus reducing the work a single database server has to cope with, while gaining an upshot in performance for free across the board.

Clearly the best practice is to have multiple databases, but why re-invent the wheel? Mention replication to a veteran database admin and then prepare yourself a nice cup of hot chocolate while they tell your horror stores from the past as if you’re sitting around a camp fire. We needed to re-invent the wheel because there are a few fundamental issues with MySQL’s built-in replication system.

When you’re working with multiple databases and problems arise in your replication network, your service can grind to a halt and possibly take you offline until every one of your database servers back up and replicating happily again. We never wanted to be put in a situation like this and so wanted the database replication network itself to be redundant. By design, MySQL’s built-in replication system couldn’t give us that.

What we wanted was a database replication network where every database server could be a “master”, all at the same time. In other words, all database servers could be read and written to by all connecting applications. Each time an update occurred on any master, the query would then be replicated to all the other masters. MySQL’s built-in replication system allows for this, but it comes with a very high cost – it is a nightmare to manage if a single master was downed.

To achieve master-master replication with more than two masters, MySQL’s built-in replication system needs the servers be configured in a ring network topology. Every time an updated occurs on a master, it executes the query locally, then passes it off to the next server in the ring, which applies the query to its local database, and so on – much like participants playing pass-the-parcel. And this works nicely and is in place in many companies. The nightmares begin however if a single database server is downed, thus breaking the ring. Since the path of communication is broken, queries stop travelling around the replication network and data across every database server begin to get stale.

Instead, our MySQL replication system (MySQL::Replication) is based on a peer-to-peer design. Each database server runs its own MySQL::Replication daemon which serves out its local database updates. They then run a separate MySQL::Replication client for each master it wants a feed from (think of a mesh network topology). Each time a query is executed on a master, the connected MySQL::Replication clients take a copy and applies it locally. The advantage here is that when a database server is downed, only that single feed is broken. All other communication paths continue as normal, and query flow across the database replication network continue as if nothing ever happened. And once the downed server comes back online, the MySQL::Replication clients notice and continue where they left off. Win-win.

Another issue with MySQL’s built-in replication system is that a slave’s position relative to its master is recorded in a plain text-file called relay-log.info which is not atomically synced to disk. Once a slave dies and comes back online, files may be in an inconsistent state. If the InnoDB tablespace was flushed to disk before the crash but relay-log.info wasn’t, the slave will restart replication from an incorrect position and so will replay queries, leaving your data in an invalid state.

MySQL::Replication clients store their position relative to their masters inside the InnoDB tablespace itself (sounds recursive, but it’s not since there is no binlogging of MySQL::Replication queries). As updates are done within the same transaction as replicated queries are executed in, writes are completely atomic. Once a slave dies and comes back online, we are still in a consistent state since either the transaction was committed or it will be rolled back. It’s a nice place to be in.

MySQL::Replication – multi-master, peer-to-peer, fault tolerant, performant and without the headaches. It can be found here.

Dec 4: Standalone Mail Servers

This blog post is part of the FastMail 2014 Advent Calendar.

The previous post on December 3rd was about how we do real-time notifications. The next post on December 5th is about data integrity.

Technical level: highly technical

We’ve written a lot about our slots/stores architecture before – so I’ll refer you to our documentation rather than rehashing the details here.

We have evolved over the years, and particularly during the Opera time, I had to resist the forces suggesting a “put all your storage on a SAN and your processing on compute nodes” design, or “why don’t you just virtualise it”, as if that’s a magic wand that solves all your scalability and IO challenges.

Luckily I had a great example to point to: Berkeley University had a week-long outage on their Cyrus systems when their SAN lost a drive. They were sitting so close to the capability limits of their hardware that their mail architecture couldn’t handle the extra load of adding a new disk, and everything fell over. Because there was one single pool of IO, this meant every single user was offline.

I spent my evenings that week (I was living in Oslo) logging in to their servers and helping them recover. Unfortunately, the whole thing is very hard to google – search for “Berkeley Cyrus” and you’ll get lots of stuff about the Berkeley DB backend in Cyrus and how horrible it is to upgrade…

So we are very careful to keep our IO spread out across multiple servers with nothing shared, so an issue in one place won’t spread to the users on other machines.

The history of our hardware is also, to quite a large degree, the history of the Cyrus IMAPd mail server. I’ve been on the Cyrus Governance board for the past 4 years, and writing patches for a lot longer than that.

Email is the core of what we do, and it’s worth putting our time into making it the best we can. There are things you can outsource, but hardware design and the mail server itself have never been one of those things for us.

Early hardware – meta data on spinning disks

When I started at FastMail 10 years ago, our IMAP servers were honking great IBM machines (6 rack units each) with a shared disk array between them, and a shiny new 4U machine with a single external RAID6 unit. We were running a pre-release CVS 2.3 version of Cyrus on them, with a handful of our own patches on top.

One day, that RAID6 unit lost two hard disks in a row, and a third started having errors. We had no replicas, we had backups, but it took a week to get everyone’s email restored onto the new servers we had just purchased and were still testing. At least we had new servers! For that week though, users didn’t have access to their old email. We didn’t want this to ever happen again.

Our new machines were built more along the lines of what we have now, and we started experimenting with replication. The machines were 2U boxes from Polywell (long since retired now), with 12 disks – 4 high speed small drives in two sets of RAID1 for metadata, and 8 bigger drives (500Gb! – massive for the day) in two sets of RAID5 for email spool.

Even then I knew this was the right way – standalone machines with lots of IO capability, and enough RAM and processor (they had 32Gb of RAM) to run the mail server locally, so there are minimal dependencies in our architecture. You can scale that as widely as you want, with a proxy in front that can direct connections to the right host.

We also had 1U machines with a pair of attached SATA to SCSI drive units on either side. Those drive units had the same disk layout as the Polywell boxes, except the OS drives were in the 1U box – I won’t talk any more about these, they’re all retired too.

This ran happily for a long time on Cyrus 2.3. We wrote a tool to verify that replicas were identical to masters in all the things that matter (what can be seen via IMAP), and pushed tons of patches back to the Cyrus project to improve replication as we found bugs.

We also added checksums to verify data integrity after various corruptions were detected between replicas which showed a small rate of bitrot (on the order of 20 damaged emails per year across our entire system) – and tooling to allow the damage to be fixed by pulling back the affected email from either a replica or the backup system and restoring it into place.

Metadata on SSD

Cyrus has two metadata files per mailbox (actually, there are more these days) cyrus.index and cyrus.cache. With the growing popularity of SSDs around 2008-2009, we wanted to use SSDs, but cyrus.cache was just too big for the SSDs we could afford. It’s also only used for search and some sort commands, but the architecture of Cyrus meant that you had to MMAP the whole file every time a mailbox was opened, just in case a message was expunged. People had tried running with cache on slow disk and index on SSD, and it was still too slow.

There’s another small directory which contains the global databases – mailboxes database, seen and subscription files for each user, sieve scripts, etc. It’s a very small percentage of the data, but our calculations on a production server showed that 50% of the IO went to that config directory, about 40% to cyrus.index, and only 10% to cache and spool files.

So I spent a year and concentrated on rewriting the entire internals of Cyrus. This became Cyrus 2.4 in 2010. It has consistent locking semantics, which actually make it a robust QRESYNC/CONDSTORE compatible server (new standards which required stronger guarantees than the cyrus 2.3 datastructures could provide), and also meant that cache wasn’t loaded until it was actually needed.

This was a massive improvement for SSD-based machines, and we bought a bunch of 2U machines from E23 (our existing external drive unit vendor) and then later from Dell through Opera’s sysadmin team.

These machines had 12 x 2Tb drives in them, and two Intel x25E 64Gb SSDs. Our original layout was 5 sets of RAID1 for the 2Tb drives, with two hotspares.

Email on SSD

We ran happily for years with the 5 x 2Tb split, but something else came along. Search. We wanted dedicated IO bandwidth for search. We also wanted to load the initial mailbox view even faster. We decided that almost all users get enough email in a week that their initial mailbox view is going to be able to be generated from a week’s worth of email.

So I patched Cyrus again. For now, this set of patches is only in the FastMail tree, it’s not in upstream Cyrus. I plan to add it after Cyrus 2.5 is released. All new email is delivered to the SSD, and only archived off later. A mailbox can be split, with some emails on the SSD, and some not.

We purchased larger SSDs (Intel DC3700 – 400Gb), and we now run a daily job to archive emails that are bigger than 1Mb or older than 7 days to the slow drives.

This cut the IO to the big disks so much that we can put them back into a single RAID6 per machine. So our 2U boxes are now in a config imaginatively called ‘t15′, because they have 15 x 1Tb spool partitions on them. We call one of these spools plus its share of SSD and search drive a “teraslot”,
as opposed to our earlier 300Gb and 500Gb slot sizes.

They have 10 drives in a RAID6 for 16Tb available space, 1Tb for operating system and 15 1Tb slots.

They also have 2 drives in a RAID1 for search, and two SSDs for the metadata.


Filesystem Size Used Avail Use% Mounted on
/dev/mapper/sdb1 917G 691G 227G 76% /mnt/i14t01
/dev/mapper/sdb2 917G 588G 329G 65% /mnt/i14t02
/dev/mapper/sdb3 917G 789G 129G 86% /mnt/i14t03
/dev/mapper/sdb4 917G 72M 917G 1% /mnt/i14t04
/dev/mapper/sdb5 917G 721G 197G 79% /mnt/i14t05
/dev/mapper/sdb6 917G 805G 112G 88% /mnt/i14t06
/dev/mapper/sdb7 917G 750G 168G 82% /mnt/i14t07
/dev/mapper/sdb8 917G 765G 152G 84% /mnt/i14t08
/dev/mapper/sdb9 917G 72M 917G 1% /mnt/i14t09
/dev/mapper/sdb10 917G 800G 118G 88% /mnt/i14t10
/dev/mapper/sdb11 917G 755G 163G 83% /mnt/i14t11
/dev/mapper/sdb12 917G 778G 140G 85% /mnt/i14t12
/dev/mapper/sdb13 917G 789G 129G 87% /mnt/i14t13
/dev/mapper/sdb14 917G 783G 134G 86% /mnt/i14t14
/dev/mapper/sdb15 917G 745G 173G 82% /mnt/i14t15
/dev/mapper/sdc1 1.8T 977G 857G 54% /mnt/i14search
/dev/md0 367G 248G 120G 68% /mnt/ssd14

The SSDs use software RAID1, and since Intel DC3700s have strong onboard crypto, we are using that rather than OS level encryption. The slot and search drives are all mapper devices because they use LUKS encryption. I’ll talk more about this when we get to the confidentiality post in the security series.

The current generation

Finally we come to our current generation of hardware. The 2U machines are pretty good, but they have some issues. For a start, the operating system shares IO with the slots, so interactive performance can get pretty terrible when working on those machines.

Also, we only get 15 teraslots per 2U.

So our new machines are 4U boxes with 40 teraslots on them. They have 24 disks in the front on an Areca RAID controller:

05-front-02

And 12 drives in the back connected directly to the motherboard SATA:

05-back-02

The front drives are divided into two lots of 2Tb x 12 drive RAID6 sets, for 20 teraslots each.

In the back, there are 6 2Tb drives in a pair of software RAID1 sets (3 drives per set, striped, for 3Tb usable) for search, and 4 Intel DC3700s as a pair of RAID1s. Finally, a couple of old 500Gb drives for the OS – we have tons of old 500Gb drives, so we may well recycle them. In a way, this is really two servers in one, because they are completely separate RAID sets just sharing the same hardware.

Finally, they have 192Gb of RAM. Processor isn’t so important, but cache certainly is!

Here’s a snippet from the config file showing how the disk is distributed in a single Cyrus instance. Each instance has its own config file, and own paths on the disks for storage:


servername: sloti33d1t01

configdirectory: /mnt/ssd33d1/sloti33d1t01/store1/conf
sievedir: /mnt/ssd33d1/sloti33d1t01/store1/conf/sieve

duplicate_db_path: /var/run/cyrus/sloti33d1t01/duplicate.db
statuscache_db_path: /var/run/cyrus/sloti33d1t01/statuscache.db

partition-default: /mnt/ssd33d1/sloti33d1t01/store1/spool
archivepartition-default: /mnt/i33d1t01/sloti33d1t01/store1/spool-archive

tempsearchpartition-default: /var/run/cyrus/search-sloti33d1t01
metasearchpartition-default: /mnt/ssd33d1/sloti33d1t01/store1/search
datasearchpartition-default: /mnt/i33d1search/sloti33d1t01/store1/search
archivesearchpartition-default: /mnt/i33d1search/sloti33d1t01/store1/search-archive

The disks themselves – we have a tool to spit out the drive config of the SATA attached drives. It just pokes around in /sys for details:

$ utils/saslist.pl
1 - HDD 500G RDY sdc 3QG023NC
2 - HDD 500G RDY sdd 3QG023TR
3 E SSD 400G RDY sde md0/0 BTTV332303FA400HGN
4 - HDD 2T RDY sdf md2/0 WDWMAY04568236
5 - HDD 2T RDY sdg md3/0 WDWMAY04585688
6 E SSD 400G RDY sdh md0/1 BTTV3322038L400HGN
7 - HDD 2T RDY sdi md2/1 WDWMAY04606266
8 - HDD 2T RDY sdj md3/1 WDWMAY04567563
9 E SSD 400G RDY sdk md1/0 BTTV323101EM400HGN
10 - HDD 2T RDY sdl md2/2 WDWMAY00250279
11 - HDD 2T RDY sdm md3/2 WDWMAY04567237
12 E SSD 400G RDY sdn md1/1 BTTV324100F9400HGN

And the Areca tools work for the drives in front:

$ utils/cli64 vsf info
# Name Raid Name Level Capacity Ch/Id/Lun State
===============================================================================
1 i33d1spool i33d1spool Raid6 20000.0GB 00/00/00 Normal
2 i33d2spool i33d2spool Raid6 20000.0GB 00/01/00 Normal
===============================================================================
GuiErrMsg: Success.
$ utils/cli64 disk info
# Enc# Slot# ModelName Capacity Usage
===============================================================================
1 01 Slot#1 N.A. 0.0GB N.A.
2 01 Slot#2 N.A. 0.0GB N.A.
3 01 Slot#3 N.A. 0.0GB N.A.
4 01 Slot#4 N.A. 0.0GB N.A.
5 01 Slot#5 N.A. 0.0GB N.A.
6 01 Slot#6 N.A. 0.0GB N.A.
7 01 Slot#7 N.A. 0.0GB N.A.
8 01 Slot#8 N.A. 0.0GB N.A.
9 02 Slot 01 WDC WD2003FYYS-02W0B0 2000.4GB i33d1spool
10 02 Slot 02 WDC WD2003FYYS-02W0B0 2000.4GB i33d1spool
11 02 Slot 03 WDC WD2000FYYZ-01UL1B0 2000.4GB i33d1spool
12 02 Slot 04 WDC WD2003FYYS-02W0B0 2000.4GB i33d1spool
13 02 Slot 05 WDC WD2003FYYS-02W0B0 2000.4GB i33d1spool
14 02 Slot 06 WDC WD2003FYYS-02W0B0 2000.4GB i33d1spool
15 02 Slot 07 WDC WD2003FYYS-02W0B1 2000.4GB i33d1spool
16 02 Slot 08 WDC WD2003FYYS-02W0B0 2000.4GB i33d1spool
17 02 Slot 09 WDC WD2000F9YZ-09N20L0 2000.4GB i33d1spool
18 02 Slot 10 WDC WD2003FYYS-02W0B1 2000.4GB i33d1spool
19 02 Slot 11 WDC WD2003FYYS-02W0B1 2000.4GB i33d1spool
20 02 Slot 12 WDC WD2003FYYS-02W0B1 2000.4GB i33d1spool
21 02 Slot 13 WDC WD2003FYYS-02W0B0 2000.4GB i33d2spool
22 02 Slot 14 WDC WD2003FYYS-02W0B0 2000.4GB i33d2spool
23 02 Slot 15 WDC WD2003FYYS-02W0B0 2000.4GB i33d2spool
24 02 Slot 16 WDC WD2000FYYZ-01UL1B0 2000.4GB i33d2spool
25 02 Slot 17 WDC WD2003FYYS-02W0B0 2000.4GB i33d2spool
26 02 Slot 18 WDC WD2003FYYS-02W0B0 2000.4GB i33d2spool
27 02 Slot 19 WDC WD2003FYYS-02W0B0 2000.4GB i33d2spool
28 02 Slot 20 WDC WD2003FYYS-02W0B0 2000.4GB i33d2spool
29 02 Slot 21 WDC WD2003FYYS-02W0B0 2000.4GB i33d2spool
30 02 Slot 22 WDC WD2003FYYS-02W0B0 2000.4GB i33d2spool
31 02 Slot 23 WDC WD2002FYPS-01U1B1 2000.4GB i33d2spool
32 02 Slot 24 WDC WD2003FYYS-02W0B0 2000.4GB i33d2spool
===============================================================================
GuiErrMsg: Success.

We always keep a few free slots on every machine, so we have the capacity to absorb the slots from a failed machine. We never want to be in the state where we don’t have enough hardware!


Filesystem Size Used Avail Use% Mounted on
/dev/mapper/md2 2.7T 977G 1.8T 36% /mnt/i33d1search
/dev/mapper/md3 2.7T 936G 1.8T 35% /mnt/i33d2search
/dev/mapper/sda1 917G 730G 188G 80% /mnt/i33d1t01
/dev/mapper/sda2 917G 805G 113G 88% /mnt/i33d1t02
/dev/mapper/sda3 917G 709G 208G 78% /mnt/i33d1t03
/dev/mapper/sda4 917G 684G 234G 75% /mnt/i33d1t04
/dev/mapper/sda5 917G 825G 92G 91% /mnt/i33d1t05
/dev/mapper/sda6 917G 722G 195G 79% /mnt/i33d1t06
/dev/mapper/sda7 917G 804G 113G 88% /mnt/i33d1t07
/dev/mapper/sda8 917G 788G 129G 86% /mnt/i33d1t08
/dev/mapper/sda9 917G 661G 257G 73% /mnt/i33d1t09
/dev/mapper/sda10 917G 799G 119G 88% /mnt/i33d1t10
/dev/mapper/sda11 917G 691G 227G 76% /mnt/i33d1t11
/dev/mapper/sda12 917G 755G 162G 83% /mnt/i33d1t12
/dev/mapper/sda13 917G 746G 172G 82% /mnt/i33d1t13
/dev/mapper/sda14 917G 802G 115G 88% /mnt/i33d1t14
/dev/mapper/sda15 917G 159G 759G 18% /mnt/i33d1t15
/dev/mapper/sda16 917G 72M 917G 1% /mnt/i33d1t16
/dev/mapper/sda17 917G 706G 211G 78% /mnt/i33d1t17
/dev/mapper/sda18 917G 72M 917G 1% /mnt/i33d1t18
/dev/mapper/sda19 917G 72M 917G 1% /mnt/i33d1t19
/dev/mapper/sda20 917G 72M 917G 1% /mnt/i33d1t20
/dev/mapper/sdb1 917G 740G 178G 81% /mnt/i33d2t01
/dev/mapper/sdb2 917G 772G 146G 85% /mnt/i33d2t02
/dev/mapper/sdb3 917G 797G 120G 87% /mnt/i33d2t03
/dev/mapper/sdb4 917G 762G 155G 84% /mnt/i33d2t04
/dev/mapper/sdb5 917G 730G 187G 80% /mnt/i33d2t05
/dev/mapper/sdb6 917G 803G 114G 88% /mnt/i33d2t06
/dev/mapper/sdb7 917G 806G 112G 88% /mnt/i33d2t07
/dev/mapper/sdb8 917G 786G 131G 86% /mnt/i33d2t08
/dev/mapper/sdb9 917G 663G 254G 73% /mnt/i33d2t09
/dev/mapper/sdb10 917G 776G 142G 85% /mnt/i33d2t10
/dev/mapper/sdb11 917G 743G 174G 82% /mnt/i33d2t11
/dev/mapper/sdb12 917G 750G 168G 82% /mnt/i33d2t12
/dev/mapper/sdb13 917G 743G 174G 82% /mnt/i33d2t13
/dev/mapper/sdb14 917G 196G 722G 22% /mnt/i33d2t14
/dev/mapper/sdb15 917G 477G 441G 52% /mnt/i33d2t15
/dev/mapper/sdb16 917G 539G 378G 59% /mnt/i33d2t16
/dev/mapper/sdb17 917G 72M 917G 1% /mnt/i33d2t17
/dev/mapper/sdb18 917G 72M 917G 1% /mnt/i33d2t18
/dev/mapper/sdb19 917G 72M 917G 1% /mnt/i33d2t19
/dev/mapper/sdb20 917G 72M 917G 1% /mnt/i33d2t20
/dev/md0 367G 301G 67G 82% /mnt/ssd33d1
/dev/md1 367G 300G 67G 82% /mnt/ssd33d2

Some more copy’n’paste:


$ free
total used free shared buffers cached
Mem: 198201540 197649396 552144 0 7084596 120032948
-/+ buffers/cache: 70531852 127669688
Swap: 2040248 1263264 776984

Yes, we use that cache! Of course, the Swap is a little pointless at that size…


$ grep 'model name' /proc/cpuinfo
model name : Intel(R) Xeon(R) CPU E5-2609 v2 @ 2.50GHz
model name : Intel(R) Xeon(R) CPU E5-2609 v2 @ 2.50GHz
model name : Intel(R) Xeon(R) CPU E5-2609 v2 @ 2.50GHz
model name : Intel(R) Xeon(R) CPU E5-2609 v2 @ 2.50GHz
model name : Intel(R) Xeon(R) CPU E5-2609 v2 @ 2.50GHz
model name : Intel(R) Xeon(R) CPU E5-2609 v2 @ 2.50GHz
model name : Intel(R) Xeon(R) CPU E5-2609 v2 @ 2.50GHz
model name : Intel(R) Xeon(R) CPU E5-2609 v2 @ 2.50GHz

IMAP serving is actually very low on CPU usage. We don’t need a super-powerful CPU to drive this box. The CPU load is always low, it’s mostly IO wait – so we just have a pair of 4 core CPUs.

The future?

We’re in a pretty sweet spot right now with our hardware. We can scale these IMAP boxes horizontally “forever”. They speak to the one central database for a few things, but that could be easily distributed. In front of these boxes are frontends with nginx running an IMAP/POP/SMTP proxy, and compute servers doing spam scanning before delivering via LMTP. Both look up the correct backend from the central database for every connection.

For now, these 4U boxes come in at about US$20,000 fully stocked, and our entire software stack is optimised to get the best out of them.

We may containerise the Cyrus instances to allow fairer IO and memory sharing between them if there is contention on the box. For now, it hasn’t been necessary because the machines are quite beefy, and anything which adds overhead between the software and the metal is a bad thing. As container software gets more efficient and easier to manage, it might become worthwhile rather than running multiple instances on the single operating system as we do now.

Posted in Advent 2014, Technical. Comments Off

Dec 3: Push it real good

This blog post is part of the FastMail 2014 Advent Calendar.

The previous post on December 2nd was an intro to our approach to security. The next post on December 4th is about our IMAP server hardware.

Technical level: lots

Bron (one of our team, you’ve heard of him) does this great demo when he shows FastMail to other people. He opens his inbox up on the screen, and gets someone to send him an email. A nice animation slides the new message into view on the screen. At the same time, his phone plays a sound. He uses his watch to delete the email, and it disappears from the screen too. It’s a huge “wow” moment, made possible by our push notification system. Today we’ll talk about exactly how we let you know when something interesting happens in your mailbox.

Cyrus has two mechanisms for telling the world that something has changed, idled and mboxevent. idled is the simpler of the two. When your mail client issues an IDLE command, it is saying “put me to sleep and tell me when something changes”. idled is the server component that manages this, holding the connection open and sending a response when something changes. An example protocol exchange looks like something like this (taken from RFC 2177, the relevant protocol spec):

C: A001 SELECT INBOX
S: * FLAGS (Deleted Seen)
S: * 3 EXISTS
S: * 0 RECENT
S: * OK [UIDVALIDITY 1]
S: A001 OK SELECT completed
C: A002 IDLE
S: + idling
...time passes; new mail arrives...
S: * 4 EXISTS
C: DONE
S: A002 OK IDLE terminated

It’s a fairly simple mechanism, only designed for use with IMAP. We’ll say no more about it.

Of far more interest is Cyrus’ “mboxevent” mechanism, which is based in part on RFC 5423. Cyrus can be configured to send events to another program any time something changes in a mailbox. The event contains details about the type of action that occurred, identifying information about the message and other useful information. Cyrus generates events for pretty much everything – every user action, data change, and other interesting things like calendar alarms. For example, here’s a delivery event for a system notification message I received a few minutes ago:

{
 "event" : "MessageNew",
 "messages" : 1068,
 "modseq" : 1777287, 
 "pid" : 2087223,
 "serverFQDN" : "sloti30t01",
 "service" : "lmtp",
 "uidnext" : 40818,
 "uri" : "imap://robn@fastmail.fm@sloti30t01/INBOX;UIDVALIDITY=1335827579/;UID=40817",
 "vnd.cmu.envelope" : "(\"Wed, 03 Dec 2014 08:49:40 +1100\" \"Re: Blog day 3: Push it real good\" ((\"Robert Norris\" NIL \"robn\" \"fastmail.com\")) ((\"Robert Norris\" NIL \"robn\" \"fastmail.com\")) ((\"Robert Norris\" NIL \"robn\" \"fastmail.com\")) ((NIL NIL \"staff\" \"fastmail.fm\")) NIL NIL \"<1417521960.1705207.197773721.1B9AB1E1.3158739225@webmail.messagingengine.com>\" \"<1417556980.2079156.198025625.704CB1C1@webmail.messagingengine.com>\")",
 "vnd.cmu.mailboxACL" : "robn@fastmail.fm\tlrswipkxtecdn\tadmin\tlrswipkxtecdan\tanyone\tp\t",
 "vnd.cmu.mbtype" : "",
 "vnd.cmu.unseenMessages" : 90,
 "vnd.fastmail.cid" : "b9384d25e93fc71c",
 "vnd.fastmail.convExists" : 413,
 "vnd.fastmail.convUnseen" : 82,
 "vnd.fastmail.counters" : "0 1777287 1777287 1761500 1758760 1416223082",
 "vnd.fastmail.sessionId" : "sloti30t01-2087223-1417556981-1"
}

The event contains all sorts of information: the action that happened, the message involved, how many messages I have, the count of unread messages and conversations, the folder involved, stuff about the message headers, and more. This information isn’t particularly useful in its raw form, but we can use it to do all kinds of interesting things.

We have a pair of programs that do various processing on the events that Cyrus produce. They are called “pusher” and “notifyd”, and run on every server where Cyrus runs. Their names don’t quite convey their purpose as they’ve grown up over time into their current forms.

pusher is the program that receives all events coming from Cyrus. It actions many events itself, but only the ones it can handle “fast”. All other events are handed off to notifyd, which handles the “slow” events. The line between the two is a little fuzzy, but the rule of thumb is that if an event needs to access the central database or send mail, it’s “slow” and should be handled by notifyd. Sending stuff to open connections or making simple HTTP calls are “fast”, and are handled in pusher. I’ve got scare quotes around “slow” and “fast” because the slow events aren’t actually slow. It’s more about how well each program responds when flooded with events (for programmers: pusher is non-blocking, notifyd can block).

We’ll talk about notifyd first, because it’s the simpler of the two. The two main styles of events it handles are calendar email notifications and Sieve (filter/rule) notifications.

Calendar email notifications are what you get when you say, for example, “10 minutes before this event, send me an email”. All the information that is needed to generate an email is placed into the event that comes from Cyrus, including the name of the event, the start and end time, the attendees, location, and so on. notifyd constructs an email and sends it to the recipient. It does database work to look up the recipient’s language settings to try and localise the email it sends. Here we see database and email work, and so it goes on the slow path.

The Sieve filter system we use has a notification mechanism (also see RFC 5435) where you can write rules that cause new emails or SMS messages to be sent. notifyd handles these too, taking the appropriate action on the email (summarise, squeeze) and then sending an email or posting to a SMS provider. For SMS, it needs to fetch the recipient’s number and SMS credit from the database so again, it’s on the slow path.

On to pusher, which is where the really interesting stuff happens. The two main outputs it has are the EventSource facility used by the web client, and the device push support used by the Android and iOS apps.

EventSource is a facility available in most modern web browsers that allow it to create a long-lived connection to a server and receive a constant stream of events. The events we send are very simple. For the above event, pusher would send the following to all web sessions I currently have open:

event: push
id: 1760138
data: {"mailModSeq":1777287,"calendarModSeq":1761500,"contactsModSeq":1758760}

These are the various “modseq” (modification sequence, see also RFC 7162) numbers for mail, calendar and contacts portions of my mail store. The basic idea behind a modseq is that every time something changes, the modseq number goes up. If the client sees the number change, it knows that it needs to request an update from the server. By sending the old modseq number in this request, it receives only the changes that have happened since then, making this a very efficient operation.

If you’re interested, we’ve written about our how we use EventSource in a lot more detail in this blog post from a couple of years ago. Some of the details have changed since then, but the ideas are still the same.

The other thing that pusher handles is pushing updates to the mobile apps. The basic idea is the same. When you log in to one of the apps, they obtain a device token from the device’s push service (Apple Push Notification Service (APNS) for iOS or Google Cloud Messaging (GCM) for Android), and then make a special call to our servers to register that token with pusher. When the inbox changes in some way, a push event is created and sent along with the device token to the push service (Apple’s or Google’s, depending on the token type).

On iOS, a new message event must contain the actual text that is displayed in the notification panel, so pusher extracts that information from the “vnd.cmu.envelope” parameter in the event it received from Cyrus. It also includes an unread count, which is used to update the red “badge” on the app icon, and a URL which is passed to the app when the notification is tapped. An example APNS push event might look like:

{
  "aps" : {
    "alert" : "Robert Norris\nHoliday pics",
    "badge" : 82,
    "sound" : "default"
  },
  "url" : "?u=12345678#/mail/Inbox/eb26398990c4b29b-f45463209u40683
}

For other inbox changes, like deleting a message, we send a much simpler event to update the badge:

{
  "aps" : {
    "badge" : 81
  }
}

The Android push operates a little differently. On Android it’s possible to have a service running in the background. So instead of sending message details in the push, we only send the user ID.

{
  "data" : {
    "uparam" : "12345678"
  }
}

(The Android app actually only uses the user ID to avoid a particular bug, but it will be useful in the future when we support multiple accounts).

On receiving this, the background service makes a server call to get any new messages since the last time it checked. It’s not unlike what the web client does, but simpler. If it finds new messages, it constructs a system notification and displays it in the notification panel. If it sees a message has been deleted and it currently has it visible in the notification, it removes it. It also adds a couple of buttons (archive and delete) which result in server actions being taken.

So that’s all the individual moving parts. If you put them all together, then you get some really impressive results. When Bron uses the “delete” notification action on his watch (an extension of the phone notification system), it causes the app to send a delete instruction the the server. Cyrus deletes the message and sends a “MessageDelete” event to pusher. pusher sends a modseq update via EventSource to the web clients which respond by requesting an update from the server, noting the message is deleted and removing it from the message list. pusher also notices this is an inbox-related change, so sends a new push to any registered Android devices and, because it’s not a “MessageNew” event, sends a badge update to registered iOS devices.

One of the things I find most interesting about all of this is that in a lot of ways it wasn’t actually planned, but has evolved over time. notifyd is over ten years old and existed just to support Sieve notifications. Then pusher came along when we started the current web client and it needed live updates. Calendar notifications came later and most recently, device push. It’s really nice having an easy, obvious place to work with mailbox events. I fully expect that we’ll extend these tools further in the future to support new kinds of realtime updates and notifications.

Posted in Advent 2014, Technical. Comments Off

Dec 1: Email Search System

This blog post is part of the FastMail 2014 Advent Calendar.

The next post on December 2nd is Security – Confidentiality, Integrity and Availability.

Technical level: medium

Our email search system was originally written by Greg Banks, who has moved on to a role at Linked In, so I maintain it now. It’s a custom extension to the Cyrus IMAPd mail server.

Fast search was a core required feature for our new web interface. My work account has over half a million emails in it, and even though our interface allows fast scroll, it’s still impossible to find anything more than a few weeks old without either knowing exactly when it was sent, or having a powerful search facility.

Greg tried a few different engines, and settled on the Xapian project as the best fit for the one-database-per-user that we wanted.

We tried indexing new emails as they arrived, even directly to fast SSDs, and discovered that the load was just too high. Our servers were overloaded trying to index in time – because adding a single email causes a lot of updates.

Luckily, Xapian supports searching from multiple databases at once, so we came up with the idea of a tiered database structure.

New messages get indexed to tmpfs in a small database. A job runs every hour to see if tmpfs is getting too full (over 50% of the defined size), in which case it compacts immediately, otherwise we automatically compact during the quiet time of the day. Compacted databases are more efficient, but read only.

This allows us to index all email immediately, and return a message that arrived just a second ago in your search results complete with highlighted search terms, yet not overload the servers. It also means that search data can be stored on inexpensive disks, keeping the costs of our accounts down.

Technical level: extreme

Here’s some very technical information about how the tiers are implemented, and an example of running the compaction.

We have 4 tiers at FastMail, though we don’t actually use the ‘meta’ one (SSD) at the moment:

  • temp
  • meta
  • data
  • archive

The temp level is on tmpfs, purely in memory. Meta is on SSD, but we don’t use that except during shutdown. Data is the main version, and we re-compact all the data level indexes once per week. Finally archive is never automatically updated, but we build it when users are moved or renamed, or can create it manually.

Both external locking (Xapian isn’t always happy with multiple writers on one database) and the compaction logic are managed via a separate file called xapianactive. The xapianactive looks like this:

% cat /mnt/ssd30/sloti30t01/store23/conf/user/b/brong.xapianactive
temp:264 archive:2 data:37

The first item in the active file is always the writable index – all the others are read-only.

These map to paths on disk according to the config file:

% grep search /etc/cyrus/imapd-sloti30t01.conf
search_engine: xapian
search_index_headers: no
search_batchsize: 8192
defaultsearchtier: temp
tempsearchpartition-default: /var/run/cyrus/search-sloti30t01
metasearchpartition-default: /mnt/ssd30/sloti30t01/store23/search
datasearchpartition-default: /mnt/i30search/sloti30t01/store23/search
archivesearchpartition-default: /mnt/i30search/sloti30t01/store23/search-archive

(the ‘default tier’ is to tell the system where to create a new search item)

So based on these paths, we find.

% du -s /var/run/cyrus/search-sloti30t01/b/user/brong/* /mnt/i30search/sloti30t01/store23/search/b/user/brong/* /mnt/i30search/sloti30t01/store23/search-archive/b/user/brong/*
3328 /var/run/cyrus/search-sloti30t01/b/user/brong/xapian.264
1520432 /mnt/i30search/sloti30t01/store23/search/b/user/brong/xapian.37
3365336 /mnt/i30search/sloti30t01/store23/search-archive/b/user/brong/xapian.2

I haven’t compacted to archive for a while. Let’s watch one of those. I’m selecting all the tiers, and compressing to a single tier. The process is as follows:

  1. take an exclusive lock on the xapianactive file
  2. insert a new default tier database on the front (in this example it will be temp:265) and unlock xapianactive again
  3. start compacting all the selected databases to a single database on the given tier
  4. take an exclusive lock on the xapianactive file again
  5. if the xapianactive file has changed, discard all our work (we lock against this, but it’s a sanity check) and exit
  6. replace all the source databases for the compact with a reference to the destination database and unlock xapianactive again
  7. delete all now-unused databases

Note that the xapianactive file is only locked for two VERY SHORT times. All the rest of the time, the compact runs in parallel, and both searching on the read-only source databases and indexing to the new temp database can continue.

This allows us to only ever have a single thread compacting to disk, so our search drives are mostly idle, and able to serve
customer search requests very quickly.

When holding an exclusive xapianactive lock, it’s always safe to delete any databases which aren’t mentioned in the file – at worst you will race against another task which is also deleting the same databases, so this system is self-cleaning after any failures.

Here goes:

% time sudo -u cyrus /usr/cyrus/bin/squatter -C /etc/cyrus/imapd-sloti30t01.conf -v -z archive -t temp,meta,data,archive -u brong
compressing temp:264,archive:2,data:37 to archive:3 for user.brong (active temp:264,archive:2,data:37)
adding new initial search location temp:265
compacting databases
Compressing messages for brong
done /mnt/i30search/sloti30t01/store23/search-archive/b/user/brong/xapian.3.NEW
renaming tempdir into place
finished compact of user.brong (active temp:265,archive:3)

real 4m52.285s
user 2m29.348s
sys 0m13.948s

% du -s /var/run/cyrus/search-sloti30t01/b/user/brong/* /mnt/i30search/sloti30t01/store23/search/b/user/brong/* /mnt/i30search/sloti30t01/store23/search-archive/b/user/brong/*
368 /var/run/cyrus/search-sloti30t01/b/user/brong/xapian.265
du: cannot access `/mnt/i30search/sloti30t01/store23/search/b/user/brong/*': No such file or directory
4614368 /mnt/i30search/sloti30t01/store23/search-archive/b/user/brong/xapian.3

If you want to look at the code, it’s all open source. I push the fastmail branch to github regularly. The xapianactive code is in imap/search_xapian.c and the C++ wrapper in imap/xapian_wrap.cpp.

Posted in Advent 2014, Technical. Comments Off

Updating our SSL certificates to SHA-256

This is a technical post. The important points to take away are that if, like most of our customers, you’re using FastMail’s web client with a modern, regularly updated browser like Chrome, Firefox, Internet Explorer or Safari, then everything will be fine. If you’re using an old browser or operating system (including long-unsupported mobile devices like old Nokia or WebOS devices), it may start failing to connect to FastMail during December, and you’ll need to make changes to the settings you use to access FastMail. Read on for details.

For many years the standard algorithm used to sign SSL certificates has been SHA-1. Recently, weaknesses have been exposed in that algorithm which make it unsuitable for encryption work. It’s not broken yet, but it’s reasonable to expect that it will be broken within the next year or two.

A replacement algorithm is available, called SHA-256 (sometimes called SHA-2), and its been the recommended algorithm for new certificates for the last couple of years.

Back in April, we updated our certificates with new ones that used SHA-256. This caused problems for certain older clients that didn’t have support for SHA-256. After some investigation, we reverted to SHA-1 certificates.

Recently Google announced that they would start deprecating SHA-1 support this year. Chrome 40 (currently in testing, due for release in January) will start showing the padlock icon on fastmail.com as “secure, with minor errors”. Crucially, it will no longer display the green “EV” badge.

As a result, we are intending to update our certificates to SHA-256 during December. Its something we wanted to do back in April anyway, as we’d much prefer to proactively support modern security best practice rather than scramble frantically to fix things when breaches are discovered.

Unfortunately, this will cause problems for customers using older browsers. Most desktop browsers should not have any problem, though Windows XP users will need to update to Service Pack 3. Many more obscure devices (notably Nokia and WebOS devices) do not support SHA-256 at all, and will not be able to connect to us securely.

We will be attempting to support a SHA-1 certificate on insecure.fastmail.com and insecure.messagingengine.com, but only if our certificate authority will agree to issue one to us. Once we have that information I’ll update this post.

If you have any questions about this change, please contact support.

Further reading:

Posted in Technical. Comments Off

beta.fastmail.fm now redirects to beta.fastmail.com

In preparation for our our move to fastmail.com, we’ll be doing some testing on beta.fastmail.fm. So if you use the beta server, expect some changes and potential issues over the next few days.

Currently that means if you go to beta.fastmail.fm, you’ll immediately be redirected to https://beta.fastmail.com. This is expected. Note that you can’t currently create @fastmail.com aliases or rename your account to @fastmail.com. This is expected. This will only be available from Thursday as described in the original blog post.

Posted in Technical. Comments Off

SSL 3.0 disabled due to security vulnerability

This morning Google published news of a new vulnerability in SSL 3.0. You can read more about it in the original announcement and in CloudFlare’s analysis of the problem.

This is a serious issue that can leak user data. Unfortunately there’s no workaround – the only option we have is to disable SSL 3.0 on our servers entirely. We don’t like having to do this because we want our users to be able to use any client they choose to access their mail, but when there’s a security hole and no way to plug it we have no choice but to break things for some people in order to protect everyone.

Happily, this should not affect the majority of our users. The only significant browser to be affected is Internet Explorer 6 on Windows XP, which will now not be able to connect to www.fastmail.fm at all. Similar changes have been made to our IMAP, POP and other backend services, so you may also have connection issues with older mail clients.

If you are unable or unwilling to upgrade your client software at this time, you can use insecure.fastmail.fm (web) and insecure.messagingengine.com (IMAP/POP/SMTP), both of which support SSL 3.0. As always, we highly discourage the use of these service names because they leave your data open to attack, and we may remove them in the future.

Update 16 Oct 2014 01:00 UTC: We’ve heard of at least two mail clients (Airmail and Windows Phone) that can receive but not send mail. Changing the outgoing settings to use port 587 instead of 465 has resolved the problem for some users. If you’re seeing similar problems, give that a try.

Posted in Technical. Comments Off
Follow

Get every new post delivered to your Inbox.

Join 5,809 other followers