Copyright © 1999–2017 FastMail Pty Ltd
This blog post is part of the FastMail 2014 Advent Calendar.
Technical level: highly technical
We've written a lot about our slots/stores architecture before - so I'll refer you to our documentation rather than rehashing the details here.
We have evolved over the years, and particularly during the Opera time, I had to resist the forces suggesting a "put all your storage on a SAN and your processing on compute nodes" design, or "why don't you just virtualise it", as if that's a magic wand that solves all your scalability and IO challenges.
Luckily I had a great example to point to: Berkeley University had a week-long outage on their Cyrus systems when their SAN lost a drive. They were sitting so close to the capability limits of their hardware that their mail architecture couldn't handle the extra load of adding a new disk, and everything fell over. Because there was one single pool of IO, this meant every single user was offline.
I spent my evenings that week (I was living in Oslo) logging in to their servers and helping them recover. Unfortunately, the whole thing is very hard to google - search for "Berkeley Cyrus" and you'll get lots of stuff about the Berkeley DB backend in Cyrus and how horrible it is to upgrade...
So we are very careful to keep our IO spread out across multiple servers with nothing shared, so an issue in one place won't spread to the users on other machines.
The history of our hardware is also, to quite a large degree, the history of the Cyrus IMAPd mail server. I've been on the Cyrus Governance board for the past 4 years, and writing patches for a lot longer than that.
Email is the core of what we do, and it's worth putting our time into making it the best we can. There are things you can outsource, but hardware design and the mail server itself have never been one of those things for us.
Early hardware - meta data on spinning disks
When I started at FastMail 10 years ago, our IMAP servers were honking great IBM machines (6 rack units each) with a shared disk array between them, and a shiny new 4U machine with a single external RAID6 unit. We were running a pre-release CVS 2.3 version of Cyrus on them, with a handful of our own patches on top.
One day, that RAID6 unit lost two hard disks in a row, and a third started having errors. We had no replicas, we had backups, but it took a week to get everyone's email restored onto the new servers we had just purchased and were still testing. At least we had new servers! For that week though, users didn't have access to their old email. We didn't want this to ever happen again.
Our new machines were built more along the lines of what we have now, and we started experimenting with replication. The machines were 2U boxes from Polywell (long since retired now), with 12 disks - 4 high speed small drives in two sets of RAID1 for metadata, and 8 bigger drives (500Gb! - massive for the day) in two sets of RAID5 for email spool.
Even then I knew this was the right way - standalone machines with lots of IO capability, and enough RAM and processor (they had 32Gb of RAM) to run the mail server locally, so there are minimal dependencies in our architecture. You can scale that as widely as you want, with a proxy in front that can direct connections to the right host.
We also had 1U machines with a pair of attached SATA to SCSI drive units on either side. Those drive units had the same disk layout as the Polywell boxes, except the OS drives were in the 1U box - I won't talk any more about these, they're all retired too.
This ran happily for a long time on Cyrus 2.3. We wrote a tool to verify that replicas were identical to masters in all the things that matter (what can be seen via IMAP), and pushed tons of patches back to the Cyrus project to improve replication as we found bugs.
We also added checksums to verify data integrity after various corruptions were detected between replicas which showed a small rate of bitrot (on the order of 20 damaged emails per year across our entire system) - and tooling to allow the damage to be fixed by pulling back the affected email from either a replica or the backup system and restoring it into place.
Metadata on SSD
Cyrus has two metadata files per mailbox (actually, there are more these days) cyrus.index and cyrus.cache. With the growing popularity of SSDs around 2008-2009, we wanted to use SSDs, but cyrus.cache was just too big for the SSDs we could afford. It's also only used for search and some sort commands, but the architecture of Cyrus meant that you had to MMAP the whole file every time a mailbox was opened, just in case a message was expunged. People had tried running with cache on slow disk and index on SSD, and it was still too slow.
There's another small directory which contains the global databases - mailboxes database, seen and subscription files for each user, sieve scripts, etc. It's a very small percentage of the data, but our calculations on a production server showed that 50% of the IO went to that config directory, about 40% to cyrus.index, and only 10% to cache and spool files.
So I spent a year and concentrated on rewriting the entire internals of Cyrus. This became Cyrus 2.4 in 2010. It has consistent locking semantics, which actually make it a robust QRESYNC/CONDSTORE compatible server (new standards which required stronger guarantees than the cyrus 2.3 datastructures could provide), and also meant that cache wasn't loaded until it was actually needed.
This was a massive improvement for SSD-based machines, and we bought a bunch of 2U machines from E23 (our existing external drive unit vendor) and then later from Dell through Opera's sysadmin team.
These machines had 12 x 2Tb drives in them, and two Intel x25E 64Gb SSDs. Our original layout was 5 sets of RAID1 for the 2Tb drives, with two hotspares.
Email on SSD
We ran happily for years with the 5 x 2Tb split, but something else came along. Search. We wanted dedicated IO bandwidth for search. We also wanted to load the initial mailbox view even faster. We decided that almost all users get enough email in a week that their initial mailbox view is going to be able to be generated from a week's worth of email.
So I patched Cyrus again. For now, this set of patches is only in the FastMail tree, it's not in upstream Cyrus. I plan to add it after Cyrus 2.5 is released. All new email is delivered to the SSD, and only archived off later. A mailbox can be split, with some emails on the SSD, and some not.
We purchased larger SSDs (Intel DC3700 - 400Gb), and we now run a daily job to archive emails that are bigger than 1Mb or older than 7 days to the slow drives.
This cut the IO to the big disks so much that we can put them back into
a single RAID6 per machine. So our 2U boxes are now in a config
imaginatively called 't15', because they have 15 x 1Tb spool partitions
on them. We call one of these spools plus its share of SSD and search
drive a "teraslot",
as opposed to our earlier 300Gb and 500Gb slot sizes.
They have 10 drives in a RAID6 for 16Tb available space, 1Tb for operating system and 15 1Tb slots.
They also have 2 drives in a RAID1 for search, and two SSDs for the metadata.
Filesystem Size Used Avail Use% Mounted on /dev/mapper/sdb1 917G 691G 227G 76% /mnt/i14t01 /dev/mapper/sdb2 917G 588G 329G 65% /mnt/i14t02 /dev/mapper/sdb3 917G 789G 129G 86% /mnt/i14t03 /dev/mapper/sdb4 917G 72M 917G 1% /mnt/i14t04 /dev/mapper/sdb5 917G 721G 197G 79% /mnt/i14t05 /dev/mapper/sdb6 917G 805G 112G 88% /mnt/i14t06 /dev/mapper/sdb7 917G 750G 168G 82% /mnt/i14t07 /dev/mapper/sdb8 917G 765G 152G 84% /mnt/i14t08 /dev/mapper/sdb9 917G 72M 917G 1% /mnt/i14t09 /dev/mapper/sdb10 917G 800G 118G 88% /mnt/i14t10 /dev/mapper/sdb11 917G 755G 163G 83% /mnt/i14t11 /dev/mapper/sdb12 917G 778G 140G 85% /mnt/i14t12 /dev/mapper/sdb13 917G 789G 129G 87% /mnt/i14t13 /dev/mapper/sdb14 917G 783G 134G 86% /mnt/i14t14 /dev/mapper/sdb15 917G 745G 173G 82% /mnt/i14t15 /dev/mapper/sdc1 1.8T 977G 857G 54% /mnt/i14search /dev/md0 367G 248G 120G 68% /mnt/ssd14
The SSDs use software RAID1, and since Intel DC3700s have strong onboard crypto, we are using that rather than OS level encryption. The slot and search drives are all mapper devices because they use LUKS encryption. I'll talk more about this when we get to the confidentiality post in the security series.
The current generation
Finally we come to our current generation of hardware. The 2U machines are pretty good, but they have some issues. For a start, the operating system shares IO with the slots, so interactive performance can get pretty terrible when working on those machines.
Also, we only get 15 teraslots per 2U.
So our new machines are 4U boxes with 40 teraslots on them. They have 24 disks in the front on an Areca RAID controller:
And 12 drives in the back connected directly to the motherboard SATA:
The front drives are divided into two lots of 2Tb x 12 drive RAID6 sets, for 20 teraslots each.
In the back, there are 6 2Tb drives in a pair of software RAID1 sets (3 drives per set, striped, for 3Tb usable) for search, and 4 Intel DC3700s as a pair of RAID1s. Finally, a couple of old 500Gb drives for the OS - we have tons of old 500Gb drives, so we may well recycle them. In a way, this is really two servers in one, because they are completely separate RAID sets just sharing the same hardware.
Finally, they have 192Gb of RAM. Processor isn't so important, but cache certainly is!
Here's a snippet from the config file showing how the disk is distributed in a single Cyrus instance. Each instance has its own config file, and own paths on the disks for storage:
servername: sloti33d1t01 configdirectory: /mnt/ssd33d1/sloti33d1t01/store1/conf sievedir: /mnt/ssd33d1/sloti33d1t01/store1/conf/sieve duplicate_db_path: /var/run/cyrus/sloti33d1t01/duplicate.db statuscache_db_path: /var/run/cyrus/sloti33d1t01/statuscache.db partition-default: /mnt/ssd33d1/sloti33d1t01/store1/spool archivepartition-default: /mnt/i33d1t01/sloti33d1t01/store1/spool-archive tempsearchpartition-default: /var/run/cyrus/search-sloti33d1t01 metasearchpartition-default: /mnt/ssd33d1/sloti33d1t01/store1/search datasearchpartition-default: /mnt/i33d1search/sloti33d1t01/store1/search archivesearchpartition-default: /mnt/i33d1search/sloti33d1t01/store1/search-archive
The disks themselves - we have a tool to spit out the drive config of the SATA attached drives. It just pokes around in /sys for details:
$ utils/saslist.pl 1 - HDD 500G RDY sdc 3QG023NC 2 - HDD 500G RDY sdd 3QG023TR 3 E SSD 400G RDY sde md0/0 BTTV332303FA400HGN 4 - HDD 2T RDY sdf md2/0 WDWMAY04568236 5 - HDD 2T RDY sdg md3/0 WDWMAY04585688 6 E SSD 400G RDY sdh md0/1 BTTV3322038L400HGN 7 - HDD 2T RDY sdi md2/1 WDWMAY04606266 8 - HDD 2T RDY sdj md3/1 WDWMAY04567563 9 E SSD 400G RDY sdk md1/0 BTTV323101EM400HGN 10 - HDD 2T RDY sdl md2/2 WDWMAY00250279 11 - HDD 2T RDY sdm md3/2 WDWMAY04567237 12 E SSD 400G RDY sdn md1/1 BTTV324100F9400HGN
And the Areca tools work for the drives in front:
$ utils/cli64 vsf info # Name Raid Name Level Capacity Ch/Id/Lun State =============================================================================== 1 i33d1spool i33d1spool Raid6 20000.0GB 00/00/00 Normal 2 i33d2spool i33d2spool Raid6 20000.0GB 00/01/00 Normal =============================================================================== GuiErrMsg: Success. $ utils/cli64 disk info # Enc# Slot# ModelName Capacity Usage =============================================================================== 1 01 Slot#1 N.A. 0.0GB N.A. 2 01 Slot#2 N.A. 0.0GB N.A. 3 01 Slot#3 N.A. 0.0GB N.A. 4 01 Slot#4 N.A. 0.0GB N.A. 5 01 Slot#5 N.A. 0.0GB N.A. 6 01 Slot#6 N.A. 0.0GB N.A. 7 01 Slot#7 N.A. 0.0GB N.A. 8 01 Slot#8 N.A. 0.0GB N.A. 9 02 Slot 01 WDC WD2003FYYS-02W0B0 2000.4GB i33d1spool 10 02 Slot 02 WDC WD2003FYYS-02W0B0 2000.4GB i33d1spool 11 02 Slot 03 WDC WD2000FYYZ-01UL1B0 2000.4GB i33d1spool 12 02 Slot 04 WDC WD2003FYYS-02W0B0 2000.4GB i33d1spool 13 02 Slot 05 WDC WD2003FYYS-02W0B0 2000.4GB i33d1spool 14 02 Slot 06 WDC WD2003FYYS-02W0B0 2000.4GB i33d1spool 15 02 Slot 07 WDC WD2003FYYS-02W0B1 2000.4GB i33d1spool 16 02 Slot 08 WDC WD2003FYYS-02W0B0 2000.4GB i33d1spool 17 02 Slot 09 WDC WD2000F9YZ-09N20L0 2000.4GB i33d1spool 18 02 Slot 10 WDC WD2003FYYS-02W0B1 2000.4GB i33d1spool 19 02 Slot 11 WDC WD2003FYYS-02W0B1 2000.4GB i33d1spool 20 02 Slot 12 WDC WD2003FYYS-02W0B1 2000.4GB i33d1spool 21 02 Slot 13 WDC WD2003FYYS-02W0B0 2000.4GB i33d2spool 22 02 Slot 14 WDC WD2003FYYS-02W0B0 2000.4GB i33d2spool 23 02 Slot 15 WDC WD2003FYYS-02W0B0 2000.4GB i33d2spool 24 02 Slot 16 WDC WD2000FYYZ-01UL1B0 2000.4GB i33d2spool 25 02 Slot 17 WDC WD2003FYYS-02W0B0 2000.4GB i33d2spool 26 02 Slot 18 WDC WD2003FYYS-02W0B0 2000.4GB i33d2spool 27 02 Slot 19 WDC WD2003FYYS-02W0B0 2000.4GB i33d2spool 28 02 Slot 20 WDC WD2003FYYS-02W0B0 2000.4GB i33d2spool 29 02 Slot 21 WDC WD2003FYYS-02W0B0 2000.4GB i33d2spool 30 02 Slot 22 WDC WD2003FYYS-02W0B0 2000.4GB i33d2spool 31 02 Slot 23 WDC WD2002FYPS-01U1B1 2000.4GB i33d2spool 32 02 Slot 24 WDC WD2003FYYS-02W0B0 2000.4GB i33d2spool =============================================================================== GuiErrMsg: Success.
We always keep a few free slots on every machine, so we have the capacity to absorb the slots from a failed machine. We never want to be in the state where we don't have enough hardware!
Filesystem Size Used Avail Use% Mounted on /dev/mapper/md2 2.7T 977G 1.8T 36% /mnt/i33d1search /dev/mapper/md3 2.7T 936G 1.8T 35% /mnt/i33d2search /dev/mapper/sda1 917G 730G 188G 80% /mnt/i33d1t01 /dev/mapper/sda2 917G 805G 113G 88% /mnt/i33d1t02 /dev/mapper/sda3 917G 709G 208G 78% /mnt/i33d1t03 /dev/mapper/sda4 917G 684G 234G 75% /mnt/i33d1t04 /dev/mapper/sda5 917G 825G 92G 91% /mnt/i33d1t05 /dev/mapper/sda6 917G 722G 195G 79% /mnt/i33d1t06 /dev/mapper/sda7 917G 804G 113G 88% /mnt/i33d1t07 /dev/mapper/sda8 917G 788G 129G 86% /mnt/i33d1t08 /dev/mapper/sda9 917G 661G 257G 73% /mnt/i33d1t09 /dev/mapper/sda10 917G 799G 119G 88% /mnt/i33d1t10 /dev/mapper/sda11 917G 691G 227G 76% /mnt/i33d1t11 /dev/mapper/sda12 917G 755G 162G 83% /mnt/i33d1t12 /dev/mapper/sda13 917G 746G 172G 82% /mnt/i33d1t13 /dev/mapper/sda14 917G 802G 115G 88% /mnt/i33d1t14 /dev/mapper/sda15 917G 159G 759G 18% /mnt/i33d1t15 /dev/mapper/sda16 917G 72M 917G 1% /mnt/i33d1t16 /dev/mapper/sda17 917G 706G 211G 78% /mnt/i33d1t17 /dev/mapper/sda18 917G 72M 917G 1% /mnt/i33d1t18 /dev/mapper/sda19 917G 72M 917G 1% /mnt/i33d1t19 /dev/mapper/sda20 917G 72M 917G 1% /mnt/i33d1t20 /dev/mapper/sdb1 917G 740G 178G 81% /mnt/i33d2t01 /dev/mapper/sdb2 917G 772G 146G 85% /mnt/i33d2t02 /dev/mapper/sdb3 917G 797G 120G 87% /mnt/i33d2t03 /dev/mapper/sdb4 917G 762G 155G 84% /mnt/i33d2t04 /dev/mapper/sdb5 917G 730G 187G 80% /mnt/i33d2t05 /dev/mapper/sdb6 917G 803G 114G 88% /mnt/i33d2t06 /dev/mapper/sdb7 917G 806G 112G 88% /mnt/i33d2t07 /dev/mapper/sdb8 917G 786G 131G 86% /mnt/i33d2t08 /dev/mapper/sdb9 917G 663G 254G 73% /mnt/i33d2t09 /dev/mapper/sdb10 917G 776G 142G 85% /mnt/i33d2t10 /dev/mapper/sdb11 917G 743G 174G 82% /mnt/i33d2t11 /dev/mapper/sdb12 917G 750G 168G 82% /mnt/i33d2t12 /dev/mapper/sdb13 917G 743G 174G 82% /mnt/i33d2t13 /dev/mapper/sdb14 917G 196G 722G 22% /mnt/i33d2t14 /dev/mapper/sdb15 917G 477G 441G 52% /mnt/i33d2t15 /dev/mapper/sdb16 917G 539G 378G 59% /mnt/i33d2t16 /dev/mapper/sdb17 917G 72M 917G 1% /mnt/i33d2t17 /dev/mapper/sdb18 917G 72M 917G 1% /mnt/i33d2t18 /dev/mapper/sdb19 917G 72M 917G 1% /mnt/i33d2t19 /dev/mapper/sdb20 917G 72M 917G 1% /mnt/i33d2t20 /dev/md0 367G 301G 67G 82% /mnt/ssd33d1 /dev/md1 367G 300G 67G 82% /mnt/ssd33d2
Some more copy'n'paste:
$ free total used free shared buffers cached Mem: 198201540 197649396 552144 0 7084596 120032948 -/+ buffers/cache: 70531852 127669688 Swap: 2040248 1263264 776984
Yes, we use that cache! Of course, the Swap is a little pointless at that size...
$ grep 'model name' /proc/cpuinfo model name : Intel(R) Xeon(R) CPU E5-2609 v2 @ 2.50GHz model name : Intel(R) Xeon(R) CPU E5-2609 v2 @ 2.50GHz model name : Intel(R) Xeon(R) CPU E5-2609 v2 @ 2.50GHz model name : Intel(R) Xeon(R) CPU E5-2609 v2 @ 2.50GHz model name : Intel(R) Xeon(R) CPU E5-2609 v2 @ 2.50GHz model name : Intel(R) Xeon(R) CPU E5-2609 v2 @ 2.50GHz model name : Intel(R) Xeon(R) CPU E5-2609 v2 @ 2.50GHz model name : Intel(R) Xeon(R) CPU E5-2609 v2 @ 2.50GHz
IMAP serving is actually very low on CPU usage. We don't need a super-powerful CPU to drive this box. The CPU load is always low, it's mostly IO wait - so we just have a pair of 4 core CPUs.
We're in a pretty sweet spot right now with our hardware. We can scale these IMAP boxes horizontally "forever". They speak to the one central database for a few things, but that could be easily distributed. In front of these boxes are frontends with nginx running an IMAP/POP/SMTP proxy, and compute servers doing spam scanning before delivering via LMTP. Both look up the correct backend from the central database for every connection.
For now, these 4U boxes come in at about US$20,000 fully stocked, and our entire software stack is optimised to get the best out of them.
We may containerise the Cyrus instances to allow fairer IO and memory sharing between them if there is contention on the box. For now, it hasn't been necessary because the machines are quite beefy, and anything which adds overhead between the software and the metal is a bad thing. As container software gets more efficient and easier to manage, it might become worthwhile rather than running multiple instances on the single operating system as we do now.