Dec 4: Standalone Mail Servers

This blog post is part of the FastMail 2014 Advent Calendar.

The previous post on December 3rd was about how we do real-time notifications. The next post on December 5th is about data integrity.

Technical level: highly technical

We’ve written a lot about our slots/stores architecture before – so I’ll refer you to our documentation rather than rehashing the details here.

We have evolved over the years, and particularly during the Opera time, I had to resist the forces suggesting a “put all your storage on a SAN and your processing on compute nodes” design, or “why don’t you just virtualise it”, as if that’s a magic wand that solves all your scalability and IO challenges.

Luckily I had a great example to point to: Berkeley University had a week-long outage on their Cyrus systems when their SAN lost a drive. They were sitting so close to the capability limits of their hardware that their mail architecture couldn’t handle the extra load of adding a new disk, and everything fell over. Because there was one single pool of IO, this meant every single user was offline.

I spent my evenings that week (I was living in Oslo) logging in to their servers and helping them recover. Unfortunately, the whole thing is very hard to google – search for “Berkeley Cyrus” and you’ll get lots of stuff about the Berkeley DB backend in Cyrus and how horrible it is to upgrade…

So we are very careful to keep our IO spread out across multiple servers with nothing shared, so an issue in one place won’t spread to the users on other machines.

The history of our hardware is also, to quite a large degree, the history of the Cyrus IMAPd mail server. I’ve been on the Cyrus Governance board for the past 4 years, and writing patches for a lot longer than that.

Email is the core of what we do, and it’s worth putting our time into making it the best we can. There are things you can outsource, but hardware design and the mail server itself have never been one of those things for us.

Early hardware – meta data on spinning disks

When I started at FastMail 10 years ago, our IMAP servers were honking great IBM machines (6 rack units each) with a shared disk array between them, and a shiny new 4U machine with a single external RAID6 unit. We were running a pre-release CVS 2.3 version of Cyrus on them, with a handful of our own patches on top.

One day, that RAID6 unit lost two hard disks in a row, and a third started having errors. We had no replicas, we had backups, but it took a week to get everyone’s email restored onto the new servers we had just purchased and were still testing. At least we had new servers! For that week though, users didn’t have access to their old email. We didn’t want this to ever happen again.

Our new machines were built more along the lines of what we have now, and we started experimenting with replication. The machines were 2U boxes from Polywell (long since retired now), with 12 disks – 4 high speed small drives in two sets of RAID1 for metadata, and 8 bigger drives (500Gb! – massive for the day) in two sets of RAID5 for email spool.

Even then I knew this was the right way – standalone machines with lots of IO capability, and enough RAM and processor (they had 32Gb of RAM) to run the mail server locally, so there are minimal dependencies in our architecture. You can scale that as widely as you want, with a proxy in front that can direct connections to the right host.

We also had 1U machines with a pair of attached SATA to SCSI drive units on either side. Those drive units had the same disk layout as the Polywell boxes, except the OS drives were in the 1U box – I won’t talk any more about these, they’re all retired too.

This ran happily for a long time on Cyrus 2.3. We wrote a tool to verify that replicas were identical to masters in all the things that matter (what can be seen via IMAP), and pushed tons of patches back to the Cyrus project to improve replication as we found bugs.

We also added checksums to verify data integrity after various corruptions were detected between replicas which showed a small rate of bitrot (on the order of 20 damaged emails per year across our entire system) – and tooling to allow the damage to be fixed by pulling back the affected email from either a replica or the backup system and restoring it into place.

Metadata on SSD

Cyrus has two metadata files per mailbox (actually, there are more these days) cyrus.index and cyrus.cache. With the growing popularity of SSDs around 2008-2009, we wanted to use SSDs, but cyrus.cache was just too big for the SSDs we could afford. It’s also only used for search and some sort commands, but the architecture of Cyrus meant that you had to MMAP the whole file every time a mailbox was opened, just in case a message was expunged. People had tried running with cache on slow disk and index on SSD, and it was still too slow.

There’s another small directory which contains the global databases – mailboxes database, seen and subscription files for each user, sieve scripts, etc. It’s a very small percentage of the data, but our calculations on a production server showed that 50% of the IO went to that config directory, about 40% to cyrus.index, and only 10% to cache and spool files.

So I spent a year and concentrated on rewriting the entire internals of Cyrus. This became Cyrus 2.4 in 2010. It has consistent locking semantics, which actually make it a robust QRESYNC/CONDSTORE compatible server (new standards which required stronger guarantees than the cyrus 2.3 datastructures could provide), and also meant that cache wasn’t loaded until it was actually needed.

This was a massive improvement for SSD-based machines, and we bought a bunch of 2U machines from E23 (our existing external drive unit vendor) and then later from Dell through Opera’s sysadmin team.

These machines had 12 x 2Tb drives in them, and two Intel x25E 64Gb SSDs. Our original layout was 5 sets of RAID1 for the 2Tb drives, with two hotspares.

Email on SSD

We ran happily for years with the 5 x 2Tb split, but something else came along. Search. We wanted dedicated IO bandwidth for search. We also wanted to load the initial mailbox view even faster. We decided that almost all users get enough email in a week that their initial mailbox view is going to be able to be generated from a week’s worth of email.

So I patched Cyrus again. For now, this set of patches is only in the FastMail tree, it’s not in upstream Cyrus. I plan to add it after Cyrus 2.5 is released. All new email is delivered to the SSD, and only archived off later. A mailbox can be split, with some emails on the SSD, and some not.

We purchased larger SSDs (Intel DC3700 – 400Gb), and we now run a daily job to archive emails that are bigger than 1Mb or older than 7 days to the slow drives.

This cut the IO to the big disks so much that we can put them back into a single RAID6 per machine. So our 2U boxes are now in a config imaginatively called ‘t15′, because they have 15 x 1Tb spool partitions on them. We call one of these spools plus its share of SSD and search drive a “teraslot”,
as opposed to our earlier 300Gb and 500Gb slot sizes.

They have 10 drives in a RAID6 for 16Tb available space, 1Tb for operating system and 15 1Tb slots.

They also have 2 drives in a RAID1 for search, and two SSDs for the metadata.


Filesystem Size Used Avail Use% Mounted on
/dev/mapper/sdb1 917G 691G 227G 76% /mnt/i14t01
/dev/mapper/sdb2 917G 588G 329G 65% /mnt/i14t02
/dev/mapper/sdb3 917G 789G 129G 86% /mnt/i14t03
/dev/mapper/sdb4 917G 72M 917G 1% /mnt/i14t04
/dev/mapper/sdb5 917G 721G 197G 79% /mnt/i14t05
/dev/mapper/sdb6 917G 805G 112G 88% /mnt/i14t06
/dev/mapper/sdb7 917G 750G 168G 82% /mnt/i14t07
/dev/mapper/sdb8 917G 765G 152G 84% /mnt/i14t08
/dev/mapper/sdb9 917G 72M 917G 1% /mnt/i14t09
/dev/mapper/sdb10 917G 800G 118G 88% /mnt/i14t10
/dev/mapper/sdb11 917G 755G 163G 83% /mnt/i14t11
/dev/mapper/sdb12 917G 778G 140G 85% /mnt/i14t12
/dev/mapper/sdb13 917G 789G 129G 87% /mnt/i14t13
/dev/mapper/sdb14 917G 783G 134G 86% /mnt/i14t14
/dev/mapper/sdb15 917G 745G 173G 82% /mnt/i14t15
/dev/mapper/sdc1 1.8T 977G 857G 54% /mnt/i14search
/dev/md0 367G 248G 120G 68% /mnt/ssd14

The SSDs use software RAID1, and since Intel DC3700s have strong onboard crypto, we are using that rather than OS level encryption. The slot and search drives are all mapper devices because they use LUKS encryption. I’ll talk more about this when we get to the confidentiality post in the security series.

The current generation

Finally we come to our current generation of hardware. The 2U machines are pretty good, but they have some issues. For a start, the operating system shares IO with the slots, so interactive performance can get pretty terrible when working on those machines.

Also, we only get 15 teraslots per 2U.

So our new machines are 4U boxes with 40 teraslots on them. They have 24 disks in the front on an Areca RAID controller:

05-front-02

And 12 drives in the back connected directly to the motherboard SATA:

05-back-02

The front drives are divided into two lots of 2Tb x 12 drive RAID6 sets, for 20 teraslots each.

In the back, there are 6 2Tb drives in a pair of software RAID1 sets (3 drives per set, striped, for 3Tb usable) for search, and 4 Intel DC3700s as a pair of RAID1s. Finally, a couple of old 500Gb drives for the OS – we have tons of old 500Gb drives, so we may well recycle them. In a way, this is really two servers in one, because they are completely separate RAID sets just sharing the same hardware.

Finally, they have 192Gb of RAM. Processor isn’t so important, but cache certainly is!

Here’s a snippet from the config file showing how the disk is distributed in a single Cyrus instance. Each instance has its own config file, and own paths on the disks for storage:


servername: sloti33d1t01

configdirectory: /mnt/ssd33d1/sloti33d1t01/store1/conf
sievedir: /mnt/ssd33d1/sloti33d1t01/store1/conf/sieve

duplicate_db_path: /var/run/cyrus/sloti33d1t01/duplicate.db
statuscache_db_path: /var/run/cyrus/sloti33d1t01/statuscache.db

partition-default: /mnt/ssd33d1/sloti33d1t01/store1/spool
archivepartition-default: /mnt/i33d1t01/sloti33d1t01/store1/spool-archive

tempsearchpartition-default: /var/run/cyrus/search-sloti33d1t01
metasearchpartition-default: /mnt/ssd33d1/sloti33d1t01/store1/search
datasearchpartition-default: /mnt/i33d1search/sloti33d1t01/store1/search
archivesearchpartition-default: /mnt/i33d1search/sloti33d1t01/store1/search-archive

The disks themselves – we have a tool to spit out the drive config of the SATA attached drives. It just pokes around in /sys for details:

$ utils/saslist.pl
1 - HDD 500G RDY sdc 3QG023NC
2 - HDD 500G RDY sdd 3QG023TR
3 E SSD 400G RDY sde md0/0 BTTV332303FA400HGN
4 - HDD 2T RDY sdf md2/0 WDWMAY04568236
5 - HDD 2T RDY sdg md3/0 WDWMAY04585688
6 E SSD 400G RDY sdh md0/1 BTTV3322038L400HGN
7 - HDD 2T RDY sdi md2/1 WDWMAY04606266
8 - HDD 2T RDY sdj md3/1 WDWMAY04567563
9 E SSD 400G RDY sdk md1/0 BTTV323101EM400HGN
10 - HDD 2T RDY sdl md2/2 WDWMAY00250279
11 - HDD 2T RDY sdm md3/2 WDWMAY04567237
12 E SSD 400G RDY sdn md1/1 BTTV324100F9400HGN

And the Areca tools work for the drives in front:

$ utils/cli64 vsf info
# Name Raid Name Level Capacity Ch/Id/Lun State
===============================================================================
1 i33d1spool i33d1spool Raid6 20000.0GB 00/00/00 Normal
2 i33d2spool i33d2spool Raid6 20000.0GB 00/01/00 Normal
===============================================================================
GuiErrMsg: Success.
$ utils/cli64 disk info
# Enc# Slot# ModelName Capacity Usage
===============================================================================
1 01 Slot#1 N.A. 0.0GB N.A.
2 01 Slot#2 N.A. 0.0GB N.A.
3 01 Slot#3 N.A. 0.0GB N.A.
4 01 Slot#4 N.A. 0.0GB N.A.
5 01 Slot#5 N.A. 0.0GB N.A.
6 01 Slot#6 N.A. 0.0GB N.A.
7 01 Slot#7 N.A. 0.0GB N.A.
8 01 Slot#8 N.A. 0.0GB N.A.
9 02 Slot 01 WDC WD2003FYYS-02W0B0 2000.4GB i33d1spool
10 02 Slot 02 WDC WD2003FYYS-02W0B0 2000.4GB i33d1spool
11 02 Slot 03 WDC WD2000FYYZ-01UL1B0 2000.4GB i33d1spool
12 02 Slot 04 WDC WD2003FYYS-02W0B0 2000.4GB i33d1spool
13 02 Slot 05 WDC WD2003FYYS-02W0B0 2000.4GB i33d1spool
14 02 Slot 06 WDC WD2003FYYS-02W0B0 2000.4GB i33d1spool
15 02 Slot 07 WDC WD2003FYYS-02W0B1 2000.4GB i33d1spool
16 02 Slot 08 WDC WD2003FYYS-02W0B0 2000.4GB i33d1spool
17 02 Slot 09 WDC WD2000F9YZ-09N20L0 2000.4GB i33d1spool
18 02 Slot 10 WDC WD2003FYYS-02W0B1 2000.4GB i33d1spool
19 02 Slot 11 WDC WD2003FYYS-02W0B1 2000.4GB i33d1spool
20 02 Slot 12 WDC WD2003FYYS-02W0B1 2000.4GB i33d1spool
21 02 Slot 13 WDC WD2003FYYS-02W0B0 2000.4GB i33d2spool
22 02 Slot 14 WDC WD2003FYYS-02W0B0 2000.4GB i33d2spool
23 02 Slot 15 WDC WD2003FYYS-02W0B0 2000.4GB i33d2spool
24 02 Slot 16 WDC WD2000FYYZ-01UL1B0 2000.4GB i33d2spool
25 02 Slot 17 WDC WD2003FYYS-02W0B0 2000.4GB i33d2spool
26 02 Slot 18 WDC WD2003FYYS-02W0B0 2000.4GB i33d2spool
27 02 Slot 19 WDC WD2003FYYS-02W0B0 2000.4GB i33d2spool
28 02 Slot 20 WDC WD2003FYYS-02W0B0 2000.4GB i33d2spool
29 02 Slot 21 WDC WD2003FYYS-02W0B0 2000.4GB i33d2spool
30 02 Slot 22 WDC WD2003FYYS-02W0B0 2000.4GB i33d2spool
31 02 Slot 23 WDC WD2002FYPS-01U1B1 2000.4GB i33d2spool
32 02 Slot 24 WDC WD2003FYYS-02W0B0 2000.4GB i33d2spool
===============================================================================
GuiErrMsg: Success.

We always keep a few free slots on every machine, so we have the capacity to absorb the slots from a failed machine. We never want to be in the state where we don’t have enough hardware!


Filesystem Size Used Avail Use% Mounted on
/dev/mapper/md2 2.7T 977G 1.8T 36% /mnt/i33d1search
/dev/mapper/md3 2.7T 936G 1.8T 35% /mnt/i33d2search
/dev/mapper/sda1 917G 730G 188G 80% /mnt/i33d1t01
/dev/mapper/sda2 917G 805G 113G 88% /mnt/i33d1t02
/dev/mapper/sda3 917G 709G 208G 78% /mnt/i33d1t03
/dev/mapper/sda4 917G 684G 234G 75% /mnt/i33d1t04
/dev/mapper/sda5 917G 825G 92G 91% /mnt/i33d1t05
/dev/mapper/sda6 917G 722G 195G 79% /mnt/i33d1t06
/dev/mapper/sda7 917G 804G 113G 88% /mnt/i33d1t07
/dev/mapper/sda8 917G 788G 129G 86% /mnt/i33d1t08
/dev/mapper/sda9 917G 661G 257G 73% /mnt/i33d1t09
/dev/mapper/sda10 917G 799G 119G 88% /mnt/i33d1t10
/dev/mapper/sda11 917G 691G 227G 76% /mnt/i33d1t11
/dev/mapper/sda12 917G 755G 162G 83% /mnt/i33d1t12
/dev/mapper/sda13 917G 746G 172G 82% /mnt/i33d1t13
/dev/mapper/sda14 917G 802G 115G 88% /mnt/i33d1t14
/dev/mapper/sda15 917G 159G 759G 18% /mnt/i33d1t15
/dev/mapper/sda16 917G 72M 917G 1% /mnt/i33d1t16
/dev/mapper/sda17 917G 706G 211G 78% /mnt/i33d1t17
/dev/mapper/sda18 917G 72M 917G 1% /mnt/i33d1t18
/dev/mapper/sda19 917G 72M 917G 1% /mnt/i33d1t19
/dev/mapper/sda20 917G 72M 917G 1% /mnt/i33d1t20
/dev/mapper/sdb1 917G 740G 178G 81% /mnt/i33d2t01
/dev/mapper/sdb2 917G 772G 146G 85% /mnt/i33d2t02
/dev/mapper/sdb3 917G 797G 120G 87% /mnt/i33d2t03
/dev/mapper/sdb4 917G 762G 155G 84% /mnt/i33d2t04
/dev/mapper/sdb5 917G 730G 187G 80% /mnt/i33d2t05
/dev/mapper/sdb6 917G 803G 114G 88% /mnt/i33d2t06
/dev/mapper/sdb7 917G 806G 112G 88% /mnt/i33d2t07
/dev/mapper/sdb8 917G 786G 131G 86% /mnt/i33d2t08
/dev/mapper/sdb9 917G 663G 254G 73% /mnt/i33d2t09
/dev/mapper/sdb10 917G 776G 142G 85% /mnt/i33d2t10
/dev/mapper/sdb11 917G 743G 174G 82% /mnt/i33d2t11
/dev/mapper/sdb12 917G 750G 168G 82% /mnt/i33d2t12
/dev/mapper/sdb13 917G 743G 174G 82% /mnt/i33d2t13
/dev/mapper/sdb14 917G 196G 722G 22% /mnt/i33d2t14
/dev/mapper/sdb15 917G 477G 441G 52% /mnt/i33d2t15
/dev/mapper/sdb16 917G 539G 378G 59% /mnt/i33d2t16
/dev/mapper/sdb17 917G 72M 917G 1% /mnt/i33d2t17
/dev/mapper/sdb18 917G 72M 917G 1% /mnt/i33d2t18
/dev/mapper/sdb19 917G 72M 917G 1% /mnt/i33d2t19
/dev/mapper/sdb20 917G 72M 917G 1% /mnt/i33d2t20
/dev/md0 367G 301G 67G 82% /mnt/ssd33d1
/dev/md1 367G 300G 67G 82% /mnt/ssd33d2

Some more copy’n’paste:


$ free
total used free shared buffers cached
Mem: 198201540 197649396 552144 0 7084596 120032948
-/+ buffers/cache: 70531852 127669688
Swap: 2040248 1263264 776984

Yes, we use that cache! Of course, the Swap is a little pointless at that size…


$ grep 'model name' /proc/cpuinfo
model name : Intel(R) Xeon(R) CPU E5-2609 v2 @ 2.50GHz
model name : Intel(R) Xeon(R) CPU E5-2609 v2 @ 2.50GHz
model name : Intel(R) Xeon(R) CPU E5-2609 v2 @ 2.50GHz
model name : Intel(R) Xeon(R) CPU E5-2609 v2 @ 2.50GHz
model name : Intel(R) Xeon(R) CPU E5-2609 v2 @ 2.50GHz
model name : Intel(R) Xeon(R) CPU E5-2609 v2 @ 2.50GHz
model name : Intel(R) Xeon(R) CPU E5-2609 v2 @ 2.50GHz
model name : Intel(R) Xeon(R) CPU E5-2609 v2 @ 2.50GHz

IMAP serving is actually very low on CPU usage. We don’t need a super-powerful CPU to drive this box. The CPU load is always low, it’s mostly IO wait – so we just have a pair of 4 core CPUs.

The future?

We’re in a pretty sweet spot right now with our hardware. We can scale these IMAP boxes horizontally “forever”. They speak to the one central database for a few things, but that could be easily distributed. In front of these boxes are frontends with nginx running an IMAP/POP/SMTP proxy, and compute servers doing spam scanning before delivering via LMTP. Both look up the correct backend from the central database for every connection.

For now, these 4U boxes come in at about US$20,000 fully stocked, and our entire software stack is optimised to get the best out of them.

We may containerise the Cyrus instances to allow fairer IO and memory sharing between them if there is contention on the box. For now, it hasn’t been necessary because the machines are quite beefy, and anything which adds overhead between the software and the metal is a bad thing. As container software gets more efficient and easier to manage, it might become worthwhile rather than running multiple instances on the single operating system as we do now.

Posted in Advent 2014, Technical. Comments Off on Dec 4: Standalone Mail Servers

Dec 3: Push it real good

This blog post is part of the FastMail 2014 Advent Calendar.

The previous post on December 2nd was an intro to our approach to security. The next post on December 4th is about our IMAP server hardware.

Technical level: lots

Bron (one of our team, you’ve heard of him) does this great demo when he shows FastMail to other people. He opens his inbox up on the screen, and gets someone to send him an email. A nice animation slides the new message into view on the screen. At the same time, his phone plays a sound. He uses his watch to delete the email, and it disappears from the screen too. It’s a huge “wow” moment, made possible by our push notification system. Today we’ll talk about exactly how we let you know when something interesting happens in your mailbox.

Cyrus has two mechanisms for telling the world that something has changed, idled and mboxevent. idled is the simpler of the two. When your mail client issues an IDLE command, it is saying “put me to sleep and tell me when something changes”. idled is the server component that manages this, holding the connection open and sending a response when something changes. An example protocol exchange looks like something like this (taken from RFC 2177, the relevant protocol spec):

C: A001 SELECT INBOX
S: * FLAGS (Deleted Seen)
S: * 3 EXISTS
S: * 0 RECENT
S: * OK [UIDVALIDITY 1]
S: A001 OK SELECT completed
C: A002 IDLE
S: + idling
...time passes; new mail arrives...
S: * 4 EXISTS
C: DONE
S: A002 OK IDLE terminated

It’s a fairly simple mechanism, only designed for use with IMAP. We’ll say no more about it.

Of far more interest is Cyrus’ “mboxevent” mechanism, which is based in part on RFC 5423. Cyrus can be configured to send events to another program any time something changes in a mailbox. The event contains details about the type of action that occurred, identifying information about the message and other useful information. Cyrus generates events for pretty much everything – every user action, data change, and other interesting things like calendar alarms. For example, here’s a delivery event for a system notification message I received a few minutes ago:

{
 "event" : "MessageNew",
 "messages" : 1068,
 "modseq" : 1777287, 
 "pid" : 2087223,
 "serverFQDN" : "sloti30t01",
 "service" : "lmtp",
 "uidnext" : 40818,
 "uri" : "imap://robn@fastmail.fm@sloti30t01/INBOX;UIDVALIDITY=1335827579/;UID=40817",
 "vnd.cmu.envelope" : "(\"Wed, 03 Dec 2014 08:49:40 +1100\" \"Re: Blog day 3: Push it real good\" ((\"Robert Norris\" NIL \"robn\" \"fastmail.com\")) ((\"Robert Norris\" NIL \"robn\" \"fastmail.com\")) ((\"Robert Norris\" NIL \"robn\" \"fastmail.com\")) ((NIL NIL \"staff\" \"fastmail.fm\")) NIL NIL \"<1417521960.1705207.197773721.1B9AB1E1.3158739225@webmail.messagingengine.com>\" \"<1417556980.2079156.198025625.704CB1C1@webmail.messagingengine.com>\")",
 "vnd.cmu.mailboxACL" : "robn@fastmail.fm\tlrswipkxtecdn\tadmin\tlrswipkxtecdan\tanyone\tp\t",
 "vnd.cmu.mbtype" : "",
 "vnd.cmu.unseenMessages" : 90,
 "vnd.fastmail.cid" : "b9384d25e93fc71c",
 "vnd.fastmail.convExists" : 413,
 "vnd.fastmail.convUnseen" : 82,
 "vnd.fastmail.counters" : "0 1777287 1777287 1761500 1758760 1416223082",
 "vnd.fastmail.sessionId" : "sloti30t01-2087223-1417556981-1"
}

The event contains all sorts of information: the action that happened, the message involved, how many messages I have, the count of unread messages and conversations, the folder involved, stuff about the message headers, and more. This information isn’t particularly useful in its raw form, but we can use it to do all kinds of interesting things.

We have a pair of programs that do various processing on the events that Cyrus produce. They are called “pusher” and “notifyd”, and run on every server where Cyrus runs. Their names don’t quite convey their purpose as they’ve grown up over time into their current forms.

pusher is the program that receives all events coming from Cyrus. It actions many events itself, but only the ones it can handle “fast”. All other events are handed off to notifyd, which handles the “slow” events. The line between the two is a little fuzzy, but the rule of thumb is that if an event needs to access the central database or send mail, it’s “slow” and should be handled by notifyd. Sending stuff to open connections or making simple HTTP calls are “fast”, and are handled in pusher. I’ve got scare quotes around “slow” and “fast” because the slow events aren’t actually slow. It’s more about how well each program responds when flooded with events (for programmers: pusher is non-blocking, notifyd can block).

We’ll talk about notifyd first, because it’s the simpler of the two. The two main styles of events it handles are calendar email notifications and Sieve (filter/rule) notifications.

Calendar email notifications are what you get when you say, for example, “10 minutes before this event, send me an email”. All the information that is needed to generate an email is placed into the event that comes from Cyrus, including the name of the event, the start and end time, the attendees, location, and so on. notifyd constructs an email and sends it to the recipient. It does database work to look up the recipient’s language settings to try and localise the email it sends. Here we see database and email work, and so it goes on the slow path.

The Sieve filter system we use has a notification mechanism (also see RFC 5435) where you can write rules that cause new emails or SMS messages to be sent. notifyd handles these too, taking the appropriate action on the email (summarise, squeeze) and then sending an email or posting to a SMS provider. For SMS, it needs to fetch the recipient’s number and SMS credit from the database so again, it’s on the slow path.

On to pusher, which is where the really interesting stuff happens. The two main outputs it has are the EventSource facility used by the web client, and the device push support used by the Android and iOS apps.

EventSource is a facility available in most modern web browsers that allow it to create a long-lived connection to a server and receive a constant stream of events. The events we send are very simple. For the above event, pusher would send the following to all web sessions I currently have open:

event: push
id: 1760138
data: {"mailModSeq":1777287,"calendarModSeq":1761500,"contactsModSeq":1758760}

These are the various “modseq” (modification sequence, see also RFC 7162) numbers for mail, calendar and contacts portions of my mail store. The basic idea behind a modseq is that every time something changes, the modseq number goes up. If the client sees the number change, it knows that it needs to request an update from the server. By sending the old modseq number in this request, it receives only the changes that have happened since then, making this a very efficient operation.

If you’re interested, we’ve written about our how we use EventSource in a lot more detail in this blog post from a couple of years ago. Some of the details have changed since then, but the ideas are still the same.

The other thing that pusher handles is pushing updates to the mobile apps. The basic idea is the same. When you log in to one of the apps, they obtain a device token from the device’s push service (Apple Push Notification Service (APNS) for iOS or Google Cloud Messaging (GCM) for Android), and then make a special call to our servers to register that token with pusher. When the inbox changes in some way, a push event is created and sent along with the device token to the push service (Apple’s or Google’s, depending on the token type).

On iOS, a new message event must contain the actual text that is displayed in the notification panel, so pusher extracts that information from the “vnd.cmu.envelope” parameter in the event it received from Cyrus. It also includes an unread count, which is used to update the red “badge” on the app icon, and a URL which is passed to the app when the notification is tapped. An example APNS push event might look like:

{
  "aps" : {
    "alert" : "Robert Norris\nHoliday pics",
    "badge" : 82,
    "sound" : "default"
  },
  "url" : "?u=12345678#/mail/Inbox/eb26398990c4b29b-f45463209u40683
}

For other inbox changes, like deleting a message, we send a much simpler event to update the badge:

{
  "aps" : {
    "badge" : 81
  }
}

The Android push operates a little differently. On Android it’s possible to have a service running in the background. So instead of sending message details in the push, we only send the user ID.

{
  "data" : {
    "uparam" : "12345678"
  }
}

(The Android app actually only uses the user ID to avoid a particular bug, but it will be useful in the future when we support multiple accounts).

On receiving this, the background service makes a server call to get any new messages since the last time it checked. It’s not unlike what the web client does, but simpler. If it finds new messages, it constructs a system notification and displays it in the notification panel. If it sees a message has been deleted and it currently has it visible in the notification, it removes it. It also adds a couple of buttons (archive and delete) which result in server actions being taken.

So that’s all the individual moving parts. If you put them all together, then you get some really impressive results. When Bron uses the “delete” notification action on his watch (an extension of the phone notification system), it causes the app to send a delete instruction the the server. Cyrus deletes the message and sends a “MessageDelete” event to pusher. pusher sends a modseq update via EventSource to the web clients which respond by requesting an update from the server, noting the message is deleted and removing it from the message list. pusher also notices this is an inbox-related change, so sends a new push to any registered Android devices and, because it’s not a “MessageNew” event, sends a badge update to registered iOS devices.

One of the things I find most interesting about all of this is that in a lot of ways it wasn’t actually planned, but has evolved over time. notifyd is over ten years old and existed just to support Sieve notifications. Then pusher came along when we started the current web client and it needed live updates. Calendar notifications came later and most recently, device push. It’s really nice having an easy, obvious place to work with mailbox events. I fully expect that we’ll extend these tools further in the future to support new kinds of realtime updates and notifications.

Posted in Advent 2014, Technical. Comments Off on Dec 3: Push it real good

Dec 2: Security – Confidentiality, Integrity and Availability

This blog post is part of the FastMail 2014 Advent Calendar.

The previous post on December 1st was about the Email Search System. The next post on December 3rd is all about how we do real-time push notifications.

Technical level: low

This is the first of a series of blog posts on security, both FastMail’s approach to various threats, and how the location of our servers interacts with security risks. We’re not digging into the technical details yet, just looking at an overview of what security means.

I always recommend that people read this humorous paper by James Mickens at Microsoft Research (pdf). There are a ton of security in-jokes there, but he makes some really good points.

Another great place to learn more about security best practices is Bruce Schneier’s blog. He’s been thinking about this stuff for a long time, and is one of the world’s acknowledged experts on computer security.

Security consists of three things: Confidentiality, Integrity and Availability. There’s a good writeup on wikipedia and also a fairly good post on blog overflow – except that it falls for the trap of defining integrity as only protecting information from being modified by unauthorized parties.

Honestly, the biggest “security risk” to data integrity in the history of email has been the unreliable hard drives in people’s home computers dying, and all the email downloaded by POP3 over the years being lost or corrupted badly in a single screeching head-crash. For us, the biggest integrity risk is hardware or disk failures corrupting data, and I’ll write more about some of the corruption cases we’ve dealt with as well.

We care about all three security components at FastMail, and work to strike a sensible balance between them. There’s a joke that to perfectly secure a server you need to encase it in concrete deep under ground, and then cut off the power and network cables. It’s funny because there’s a hint of truth.

To be useful, a server has to be online. And that server is running imperfect software on imperfect hardware, which may have even been covertly modified (not just by the NSA either – anyone with a big enough budget and no regard for the law can pull off something like that).

Thankfully, the same security processes and architectures that defend against system failures are also good for protecting again active attackers. We follow best practices like running separate physical networks for internal traffic, restrictive firewalls that only allow expected traffic into our servers, following security announcement mailing lists for all our software, and only choosing software with a good security record.

That’s the baseline of good security. In the following blogs, we will look at some of the specific things that FastMail does to protect our systems and our users’ data.

Posted in Advent 2014. Comments Off on Dec 2: Security – Confidentiality, Integrity and Availability

Dec 1: Email Search System

This blog post is part of the FastMail 2014 Advent Calendar.

The next post on December 2nd is Security – Confidentiality, Integrity and Availability.

Technical level: medium

Our email search system was originally written by Greg Banks, who has moved on to a role at Linked In, so I maintain it now. It’s a custom extension to the Cyrus IMAPd mail server.

Fast search was a core required feature for our new web interface. My work account has over half a million emails in it, and even though our interface allows fast scroll, it’s still impossible to find anything more than a few weeks old without either knowing exactly when it was sent, or having a powerful search facility.

Greg tried a few different engines, and settled on the Xapian project as the best fit for the one-database-per-user that we wanted.

We tried indexing new emails as they arrived, even directly to fast SSDs, and discovered that the load was just too high. Our servers were overloaded trying to index in time – because adding a single email causes a lot of updates.

Luckily, Xapian supports searching from multiple databases at once, so we came up with the idea of a tiered database structure.

New messages get indexed to tmpfs in a small database. A job runs every hour to see if tmpfs is getting too full (over 50% of the defined size), in which case it compacts immediately, otherwise we automatically compact during the quiet time of the day. Compacted databases are more efficient, but read only.

This allows us to index all email immediately, and return a message that arrived just a second ago in your search results complete with highlighted search terms, yet not overload the servers. It also means that search data can be stored on inexpensive disks, keeping the costs of our accounts down.

Technical level: extreme

Here’s some very technical information about how the tiers are implemented, and an example of running the compaction.

We have 4 tiers at FastMail, though we don’t actually use the ‘meta’ one (SSD) at the moment:

  • temp
  • meta
  • data
  • archive

The temp level is on tmpfs, purely in memory. Meta is on SSD, but we don’t use that except during shutdown. Data is the main version, and we re-compact all the data level indexes once per week. Finally archive is never automatically updated, but we build it when users are moved or renamed, or can create it manually.

Both external locking (Xapian isn’t always happy with multiple writers on one database) and the compaction logic are managed via a separate file called xapianactive. The xapianactive looks like this:

% cat /mnt/ssd30/sloti30t01/store23/conf/user/b/brong.xapianactive
temp:264 archive:2 data:37

The first item in the active file is always the writable index – all the others are read-only.

These map to paths on disk according to the config file:

% grep search /etc/cyrus/imapd-sloti30t01.conf
search_engine: xapian
search_index_headers: no
search_batchsize: 8192
defaultsearchtier: temp
tempsearchpartition-default: /var/run/cyrus/search-sloti30t01
metasearchpartition-default: /mnt/ssd30/sloti30t01/store23/search
datasearchpartition-default: /mnt/i30search/sloti30t01/store23/search
archivesearchpartition-default: /mnt/i30search/sloti30t01/store23/search-archive

(the ‘default tier’ is to tell the system where to create a new search item)

So based on these paths, we find.

% du -s /var/run/cyrus/search-sloti30t01/b/user/brong/* /mnt/i30search/sloti30t01/store23/search/b/user/brong/* /mnt/i30search/sloti30t01/store23/search-archive/b/user/brong/*
3328 /var/run/cyrus/search-sloti30t01/b/user/brong/xapian.264
1520432 /mnt/i30search/sloti30t01/store23/search/b/user/brong/xapian.37
3365336 /mnt/i30search/sloti30t01/store23/search-archive/b/user/brong/xapian.2

I haven’t compacted to archive for a while. Let’s watch one of those. I’m selecting all the tiers, and compressing to a single tier. The process is as follows:

  1. take an exclusive lock on the xapianactive file
  2. insert a new default tier database on the front (in this example it will be temp:265) and unlock xapianactive again
  3. start compacting all the selected databases to a single database on the given tier
  4. take an exclusive lock on the xapianactive file again
  5. if the xapianactive file has changed, discard all our work (we lock against this, but it’s a sanity check) and exit
  6. replace all the source databases for the compact with a reference to the destination database and unlock xapianactive again
  7. delete all now-unused databases

Note that the xapianactive file is only locked for two VERY SHORT times. All the rest of the time, the compact runs in parallel, and both searching on the read-only source databases and indexing to the new temp database can continue.

This allows us to only ever have a single thread compacting to disk, so our search drives are mostly idle, and able to serve
customer search requests very quickly.

When holding an exclusive xapianactive lock, it’s always safe to delete any databases which aren’t mentioned in the file – at worst you will race against another task which is also deleting the same databases, so this system is self-cleaning after any failures.

Here goes:

% time sudo -u cyrus /usr/cyrus/bin/squatter -C /etc/cyrus/imapd-sloti30t01.conf -v -z archive -t temp,meta,data,archive -u brong
compressing temp:264,archive:2,data:37 to archive:3 for user.brong (active temp:264,archive:2,data:37)
adding new initial search location temp:265
compacting databases
Compressing messages for brong
done /mnt/i30search/sloti30t01/store23/search-archive/b/user/brong/xapian.3.NEW
renaming tempdir into place
finished compact of user.brong (active temp:265,archive:3)

real 4m52.285s
user 2m29.348s
sys 0m13.948s

% du -s /var/run/cyrus/search-sloti30t01/b/user/brong/* /mnt/i30search/sloti30t01/store23/search/b/user/brong/* /mnt/i30search/sloti30t01/store23/search-archive/b/user/brong/*
368 /var/run/cyrus/search-sloti30t01/b/user/brong/xapian.265
du: cannot access `/mnt/i30search/sloti30t01/store23/search/b/user/brong/*': No such file or directory
4614368 /mnt/i30search/sloti30t01/store23/search-archive/b/user/brong/xapian.3

If you want to look at the code, it’s all open source. I push the fastmail branch to github regularly. The xapianactive code is in imap/search_xapian.c and the C++ wrapper in imap/xapian_wrap.cpp.

Posted in Advent 2014, Technical. Comments Off on Dec 1: Email Search System

FastMail Advent 2014

Welcome to the inaugural FastMail Advent Blog. One post per day for the next 24 days.

The idea came from a reponse I made to a question on reddit about physical locations of servers and their impact on security.

I wrote that up in more detail, with links to blog posts about what FastMail does to address various categories of security risk, and suddenly found myself with something much too long for a single blog post. I wanted to split it up into separate posts, and then started thinking about frequency – and meanwhile my daughters were asking about advent calendars for the year, and it clicked.

We’ve been promising to blog more about some of our technology, and also about some of the less-well-known features – here was a perfect opportunity. You won’t just be hearing from me, I’m going to try to get everyone to write up something about their areas of expertise.

There’s a fine internet tradition in what we’re doing here, check out the perl advent calendar for example. They’ve been doing it for years.

All the days are now complete. Here are links to the individual posts:

  1. Email Search System
  2. Security – Confidentiality, Integrity and Availability
  3. Push it real good
  4. Standalone Mail Servers
  5. Security – Integrity
  6. User authentication
  7. Automated installation
  8. Squire: FastMail’s rich text editor
  9. Email Authentication
  10. Security – Availability
  11. FastMail Support
  12. FastMail’s MySQL Replication: Multi-Master, Fault Tollerance, Performance. Pick Any Three
  13. FastMail DNS hosting
  14. On Duty!
  15. Putting the fast in FastMail: Loading your mailbox quickly
  16. Security – Confidentiality
  17. Testing
  18. Billing and Payments — a potted history
  19. Mailr
  20. Open-sourcing OvertureJS – the JS lib that powers FastMail
  21. File Storage
  22. CardDAV Beta Release
  23. JMAP — A better way to email
  24. Working at FastMail
Posted in Advent 2014, News. Comments Off on FastMail Advent 2014

Updating our SSL certificates to SHA-256

This is a technical post. The important points to take away are that if, like most of our customers, you’re using FastMail’s web client with a modern, regularly updated browser like Chrome, Firefox, Internet Explorer or Safari, then everything will be fine. If you’re using an old browser or operating system (including long-unsupported mobile devices like old Nokia or WebOS devices), it may start failing to connect to FastMail during December, and you’ll need to make changes to the settings you use to access FastMail. Read on for details.

For many years the standard algorithm used to sign SSL certificates has been SHA-1. Recently, weaknesses have been exposed in that algorithm which make it unsuitable for encryption work. It’s not broken yet, but it’s reasonable to expect that it will be broken within the next year or two.

A replacement algorithm is available, called SHA-256 (sometimes called SHA-2), and its been the recommended algorithm for new certificates for the last couple of years.

Back in April, we updated our certificates with new ones that used SHA-256. This caused problems for certain older clients that didn’t have support for SHA-256. After some investigation, we reverted to SHA-1 certificates.

Recently Google announced that they would start deprecating SHA-1 support this year. Chrome 40 (currently in testing, due for release in January) will start showing the padlock icon on fastmail.com as “secure, with minor errors”. Crucially, it will no longer display the green “EV” badge.

As a result, we are intending to update our certificates to SHA-256 during December. Its something we wanted to do back in April anyway, as we’d much prefer to proactively support modern security best practice rather than scramble frantically to fix things when breaches are discovered.

Unfortunately, this will cause problems for customers using older browsers. Most desktop browsers should not have any problem, though Windows XP users will need to update to Service Pack 3. Many more obscure devices (notably Nokia and WebOS devices) do not support SHA-256 at all, and will not be able to connect to us securely.

We will be attempting to support a SHA-1 certificate on insecure.fastmail.com and insecure.messagingengine.com, but only if our certificate authority will agree to issue one to us. Once we have that information I’ll update this post.

If you have any questions about this change, please contact support.

Further reading:

Posted in Technical. Comments Off on Updating our SSL certificates to SHA-256

Recent interviews on Rocketship.fm and DomainSherpa.com

I was recently interviewed by two separate sites. Since these interviews cover some of the history of FastMail, the purchase by Opera and re-sale back to the staff, and our recent acquisition of fastmail.com, I thought it might be interesting to some of our users.

Why You Should Charge from Day One

http://rocketship.fm/episodes/ep-79-rob-mueller/

After 15 Years, FastMail Finally Acquires Their .Com – With Rob Mueller

http://www.domainsherpa.com/rob-mueller-fastmail-interview/

Posted in Off Topic. Comments Off on Recent interviews on Rocketship.fm and DomainSherpa.com
Follow

Get every new post delivered to your Inbox.

Join 6,511 other followers