Increased user security, upgrading Diffie-Hellman parameters to 2048 bits

For non-technical users, the short version is that if you’re using a modern, up-to-date web browser, mobile device or mail client to access FastMail, then there’s nothing you need to do.

If you’re using old or unusual software to access or send via FastMail, you might be affected and should read on.

What’s happening?

On 30 March 2015 we will be increasing the size of the DH parameters for DHE ciphers to 2048 bits. This will cause connection problems for old software that cannot handle DH parameters greater than 1024 bits.

1024-bit RSA crypto is generally being phased out as insecure and has been for at least the last five years.

Breaking DH parameters is generally understood to require the same amount of computation as a RSA key of equivalent size. Therefore, the recommendation is to increase the size of DH parameters in step with the size of RSA keys.

If we don’t upgrade our crypto to 2048 bits for the general case, we’re compromising the security of all our users for a few that have old clients. We don’t consider that to be acceptable.

Will this affect me?

The main software we’re aware of that will be affected is iOS 5 and Java 6 and 7 (which often means business software that sends through our authenticated SMTP service).

If you’re unsure if you’re affected, you can test right now by pointing your software at (web) or (everything else). These servers are using the new config that will be rolled out on the 30th. Note that you shouldn’t use these names permanently; this is a test service and does not have the same redundancy as the main FastMail services.

If you can access your mail as normal using these servers, then you have nothing to worry about.

If you can’t connect through the beta servers but can through the main servers then its quite likely that you are affected and you will need to either upgrade or reconfigure your software, or switch to our “insecure” services at (web) or (everything else). Using the insecure service is not recommended as it uses encryption that is known to be weak or broken.

Please note that we’re unable to help you upgrade or reconfigure your software, particularly for those Java business apps. You’ll need to contact your software vendor for that.

Further reading

If you’d like to read more about perfect forward secrecy and DH param lengths, the following technical articles may be interesting to you:

Posted in Technical. Comments Off

Dec 21: File Storage

This blog post is part of the FastMail 2014 Advent Calendar.

The previous post on 20th December saw us open source our core javascript library. The following post on 22nd December is the long awaited beta release of our contact syncing solution.

Technical level: high

Why does an email service need a file storage system?

For many years, well before cloud file storage became an everyday thing, FastMail has had a file storage feature. Like many features, it started as a logical extension of what we were already doing. People were accessing their email anywhere via a web browser, it would be really nice to be able to access important files everywhere as well. And there’s the obvious email integration points as well, being able to save attachments from emails to somewhere more structured, or having files that you commonly want to attach to emails (e.g. your resume) stored at FastMail, rather than having to upload it again each time.

Lots of FastMail’s early features exist because they’re the kind of thing that we wanted for ourselves!
It also turns out that have a generic file storage is useful for other features, as we discovered later.

A generic file service

The first implementation of our file storage system used a real filesystem on a large RAID array. To make the data available to our web servers, we used NFS. While in theory and nice and straight forward solution, unfortunately it all ended up being fairly horrible. We used the NFS server built into the Linux kernel at that time, and although it was supposed to be stable, that was not our experience at all. While all our other servers had many months of uptime, the file storage server running NFS would freeze/crash somewhere between every few days and every week. This was particularly surprising to us because we weren’t actually stressing it much, the workload wasn’t high compared to the IO that some file servers perform.

Having the NFS server go offline and people losing access to their files until it is rebooted was bad enough, but there was a much worse problem. Any process that tried to access a file on the NFS mount would freeze up until the server came back. Since the number of processes handling web requests was limited, all it took was a few 100 requests by users trying to access their file storage, and suddenly there were no processes left to handle other web requests, and ALL web requests would start failing, meaning no one was able to access the FastMail site at all. Not nice. We tried a combination of soft mounts and other flags, but couldn’t find a combination that was both consistently reliable and failure safe.

Apparently I have suppressed the memories — something to do with being woken by both young children AND broken servers, but Rob M remembers, and he says: “In one of those great moments of frustration at being woken up again by a crashed NFS server, Bron wanted to do a complete rewrite, and to use an entirely different way of storing the files. Instead of storing the file system structure in a real filesystem, we decided to use a database. However we didn’t want to store the actual file data in the database, that would result in a massive monolithic database with all user file data in it, not easy to manage. So the approach he came up with is a rather neat hybrid that has worked really well.” So there you go.

One of my first major projects at FastMail was this new file storage service. I was fresh from building data management tools for late-phase clinical trials (drugs, man) for Quintiles in New Jersey (that’s right, I moved from living in New Jersey and working on servers in Melbourne to living in Melbourne and working on servers in New York). I over-engineered our file storage system using many of the same ideas I had used for clinical data.

Interestingly, a lot of the work I’d been doing at Quintiles looked very similar in design to git, though it was years before git came out. Data addressed by digest (sha1), signatures on digests of lists of digests to provide integrity and authenticity checks over large blocks of data. That product doesn’t seem to exist any more though.

The original build of the file service was based on many of the same concepts. File version forking and merging (which was too complex and got scrapped) with very clever cache hierarchy and invalidation scheme. Blobs (file contents themselves) are stored in a key-value pool spread over multiple machines, with push to many copies before the store is successful, and a background cleaning task that ensures they are spread everywhere and garbage collected when no longer referenced.

The blob storage system is very simple – we could definitely build or grab off the shelf something a lot faster and better these days, but it’s very robust, and that matters to us more than speed.

Interestingly enough, while the caching system was great when there was a lower volume of changes and slow database servers, it eventually became faster to remove a layer of caching entirely as our needs and technology changed.

Database backed nodes

The same basic architecture still exists today. The file storage is a giant database table in our central mysql database. Every entry is a “Node”, with a primary key called NodeId, and a “ParentNodeId”. Node number 1 is treated specially, and is of class ‘RootNode’. It’s the top of the tree.

Because there are hundreds of thousands of top level nodes (ParentNodeId == 1), it would be crazy to read the key ‘ND:1′ (node directories, parent 1) for normal operations. Instead, we fetch “UA:$UserId” which is the ACL for the user’s userid, and then walk the tree back up from each ACL which grants the user any access, building a tree that way.

For example:

$ DEBUG_VFS=1 vfs -u ls /
Fetching ND:504452
Fetching UA:485617
Fetching N:20872929
Fetching N:3
Fetching N:1394099
Fetching N:1394098
Fetching N:2
Fetching N:504452
d---   504452 2005-09-14 04:31:35
d---  1394098 2006-01-20 00:44:04
d---        2 2005-09-13 07:46:04

Whereas if we’re inside an ACL path we walk the tree normally from that ACL (we still need to check the other ACLs to see if they also impact the data we’re looking at…):

$ DEBUG_VFS=1 vfs -u ls '~/websites'
Fetching ND:504452
Fetching N:504452
Fetching UA:485617
Fetching N:20872929
Fetching N:3
Fetching N:1394099
Fetching N:1394098
Fetching N:2
Fetching UT:485617
darw  6548549 2007-03-27 06:03:19 cherubfest/
darw 39907741 2008-11-25 10:27:55 custom-ui/
darw 335168869 2014-10-19 23:51:38 talks/
Fetching NFZ:504465

This structure also allows us to keep deleted files for a week, because when a node is “deleted”, it’s not actually deleted – it just gets an integer field “IsDeleted” set to the NodeId of the top node being deleted.

Let’s try that too:

[brong@utility2 ~]$ vfs -u mkdir '~/test/blog'
[brong@utility2 ~]$ echo "hello world" > /tmp/hello.txt
[brong@utility2 ~]$ vfs -u cat '~/test/blog' /tmp/hello.txt
Failed to write: Could not open / for writing: Is a directory 
[brong@utility2 ~]$ vfs -u cat '~/test/blog/hello.txt' /tmp/hello.txt
[brong@utility2 ~]$ vfs -u cat '~/test/blog/hello.txt'
hello world
[brong@utility2 ~]$ vfs -u ls '~/test/blog/'
-arw 387950177 2014-12-21 00:31:56 hello.txt (12)
[brong@utility2 ~]$ vfs -u rm '~/test/blog/hello.txt'
[brong@utility2 ~]$ vfs -u ls '~/test/blog/'
[brong@utility2 ~]$ vfs -u lsdel '~/test/blog/'
/ (deleted)
-arw 387950177 2014-12-21 00:31:56 hello.txt (12)
[brong@utility2 ~]$ vfs -u undel 387950177
restored 387950177 (hello.txt) as /
[brong@utility2 ~]$ vfs -u cat '~/test/blog/hello.txt'
hello world

And the versioning:

[brong@utility2 ~]$ echo "hello world v2" > /tmp/hello.txt 
[brong@utility2 ~]$ vfs -u cat '~/test/blog/hello.txt' /tmp/hello.txt
Failed to write: Could not open / for writing: File exists
[brong@utility2 ~]$ vfs -u -f cat '~/test/blog/hello.txt' /tmp/hello.txt
[brong@utility2 ~]$ vfs -u cat '~/test/blog/hello.txt' 
hello world v2
[brong@utility2 ~]$ vfs -u ls '~/test/blog/'
-arw 387950597 2014-12-21 00:35:50 hello.txt (15)
[brong@utility2 ~]$ vfs -u lsdel '~/test/blog/'
/ (deleted)
-arw 387950177 2014-12-21 00:31:56 hello.txt (12)
[brong@utility2 ~]$ vfs -u undel 387950177 oldhello.txt
restored 387950177 (hello.txt) as /
[brong@utility2 ~]$ vfs -u ls '~/test/blog/'
-arw 387950597 2014-12-21 00:35:50 hello.txt (15)
-arw 387950745 2014-12-21 00:36:53 oldhello.txt (12)

This means that if you delete a folder and all its contents, you can undelete it without also undeleting everything ELSE that was deleted in the same action, just by setting nodes with the same IsDeleted field as the top node back to zero again.

[brong@utility2 ~]$ vfs -u rm '~/test/blog/oldhello.txt'
[brong@utility2 ~]$ vfs -u ls '~/test/blog/'
-arw 387950597 2014-12-21 00:35:50 hello.txt (15)
[brong@utility2 ~]$ vfs -u lsdel '~/test/blog/'
/ (deleted)
-arw 387950177 2014-12-21 00:31:56 hello.txt (12)
-arw 387950745 2014-12-21 00:36:53 oldhello.txt (12)
[brong@utility2 ~]$ vfs -u rm '~/test/blog'
[brong@utility2 ~]$ vfs -u lsdel '~/test/'
/ (deleted)
darw 387950021 2014-12-21 00:38:20 blog/
[brong@utility2 ~]$ vfs -u undel 387950021
restored 387950021 (blog) as /
[brong@utility2 ~]$ vfs -u ls '~/test/blog/'
-arw 387950597 2014-12-21 00:35:50 hello.txt (15)

So only the file that was in the folder when it was deleted is now present in the restored copy.

That’s the node tree. Alongside this is “Properties”, which are used not only for storing mime types of files (you can edit these in the filestorage) the expansion state of directories in the left view on the Files screen, but can be used via our WebDAV server to set any generic fields you like. There are ACLs and Locks which can also be used via DAV, and a Websites table which is used to manage our customer static-file web hosting.

The cache module can present any file as a regular Perl filehandle, including an append filehandle which already has the old content pre-loaded. Calling “close” on that filehandle returns a complete filestorage node on which other actions can be taken – and it does quota accounting on the fly. The whole system is object oriented and a joy to work with at the top level, despite all the weird bits underneath.

One thing – if you upload over the same filename again and again (webcams for example) we will only keep the most recent 30 copies, because we were finding that it slowed the directory listings too much otherwise.

File storage as a service

A long time ago, if you were in the middle of composing a draft, and you had uploaded a file, we stored that file in a temporary location on the web server’s filesystem. This was fine normally, but if we had to shut down one of our web servers, the session would move to another server, and the email would be sent without the attachment, because it couldn’t find the file.

We now keep uploaded files in a separate VFS directory for each user – and if you’re composing and we switch our web servers, you don’t even notice:

[brong@imap20 hm]$ vfs -u ls /
darw   504453 2014-10-18 16:13:19 files/
d-rw  6825201 2014-12-16 21:36:11 temp/

The home directory that you see in the Files screen is “/username.domain/files/” in the global filesystem. Your temporary files go into “/username.domain/temp/” with a special flag that says when they will be automatically deleted (default, 1 week from upload).

This has been very useful for other things too. We now upload imported calendar/addressbook files into VFS first via our general upload mechanism, and just use a VFS address for everything internally, meaning that the upload and the processing phase can be separate.

Calendar attachments are likewise done with an upload as a temporary file, and then a conversion into a long-lived URL later. Even on Mailr, which doesn’t have a file storage service for users, the file storage engine is used underneath for attachments and calendar.

Multiple ways to access

We use Perl as our backend language. We built both FTP and DAV servers, as well as our Files screen within the app. Another very valuable use of filestorage (indeed, one of the original reasons) is that you can copy attachments from emails into file storage, and attach from file storage into new emails. This is a great way to save having to download and upload giant attachments if you want to send them on to someone else.

As you can see from my example above, where my top level had 3 different directories, it’s possible to share file storage within a business. There’s a very powerful access control system to grant access only to individual folders, and separate read and write roles.

You can create websites, including password protected websites. We even have support for basic photo galleries.

Open Source?

A lot of the tooling on top of the VFS is based on open source modules, but hacked so much that they can’t really be shared easily back to the original authors. I have been more lax with this than I should have, particularly the DAV modules where I took the Net::DAV::Server module and almost entirely rewrote it based on the RFC, including a ton of testing against the clients of the day, and adding things like Apple’s quota support.

I don’t have time during this blog series to release anything nice, but there are no real secrets in the tree, so I’ve put up a tar file at which contains the current version of the files I’ve talked about in this post, and send me an email at the obvious address if you have comments or suggestions. The file itself is hosted on a FastMail static website!

A road not taken

All of this was built in 2004/2005. These days there are tons of “filesystem as a service” products out there. We could have built one – our VFS has some nice properties which would have made syncing with clients very viable – but our focus is on email, and we didn’t make the time to learn how to build good clients.

We actually have an experimental Linux FUSE module which can talk directly to our VFS, and I built some experimental Windows and Mac clients as well, but we never took it anywhere — though we did talk about the potential to pivot the company into the file storage business.

These days we have gone the other way, with Dropbox integration on the compose screen and hooks to add other services too. It makes sense to integrate with services which do what they are good at, and concentrate our efforts on being the best at our field, email.

Posted in Advent 2014, Technical. Comments Off

Dec 12: FastMail’s MySQL Replication: Multi-Master, Fault Tolerance, Performance. Pick Any Three

This blog post is part of the FastMail 2014 Advent Calendar.

The previous post on 11th December was from our support team. The following post on 13th December is about hosting your own domain with us.

Technical level: medium

For those who prefer watching videos over reading, here’s a talk I gave at Melbourne Perl Mongers in 2011 on FastMail’s open-sourced, custom MySQL replication system.

Most online services store their data within a database, and thanks to the culture of Open Source, these days there are plenty of robust RDBMs to choose from. At FastMail, we use a Percona build of MySQL 5.1 because of their customised tooling and performance patches (If you haven’t heard of Percona, I recommend trying them out). However, even though MySQL 5.1 is a great platform to work with, we do something differently here – we don’t use its built-in replication system and instead opted to roll our own.

First, what’s the problem with running an online service on a single database? The most important reason against this is the lack of redundancy. If your database catches on fire or as what happens more often the oom-killer chooses to zap your database server as it’s usually the biggest memory hog on a machine, none of your applications can continue without their needed data and so your service is taken offline. By using multiple databases, when a single database server is downed, your applications still have the others to chose from and connect to.

Another reason against using a single database for your online service is degraded performance – as more and more applications connect and perform work, your database’s server load increases. Once a server can’t take the requested load any longer, you’re left with query timeout responses and even refused connections, which again takes your service offline. By using multiple database servers, you can tell your applications to spread their load across the database farm thus reducing the work a single database server has to cope with, while gaining an upshot in performance for free across the board.

Clearly the best practice is to have multiple databases, but why re-invent the wheel? Mention replication to a veteran database admin and then prepare yourself a nice cup of hot chocolate while they tell your horror stores from the past as if you’re sitting around a camp fire. We needed to re-invent the wheel because there are a few fundamental issues with MySQL’s built-in replication system.

When you’re working with multiple databases and problems arise in your replication network, your service can grind to a halt and possibly take you offline until every one of your database servers back up and replicating happily again. We never wanted to be put in a situation like this and so wanted the database replication network itself to be redundant. By design, MySQL’s built-in replication system couldn’t give us that.

What we wanted was a database replication network where every database server could be a “master”, all at the same time. In other words, all database servers could be read and written to by all connecting applications. Each time an update occurred on any master, the query would then be replicated to all the other masters. MySQL’s built-in replication system allows for this, but it comes with a very high cost – it is a nightmare to manage if a single master was downed.

To achieve master-master replication with more than two masters, MySQL’s built-in replication system needs the servers be configured in a ring network topology. Every time an updated occurs on a master, it executes the query locally, then passes it off to the next server in the ring, which applies the query to its local database, and so on – much like participants playing pass-the-parcel. And this works nicely and is in place in many companies. The nightmares begin however if a single database server is downed, thus breaking the ring. Since the path of communication is broken, queries stop travelling around the replication network and data across every database server begin to get stale.

Instead, our MySQL replication system (MySQL::Replication) is based on a peer-to-peer design. Each database server runs its own MySQL::Replication daemon which serves out its local database updates. They then run a separate MySQL::Replication client for each master it wants a feed from (think of a mesh network topology). Each time a query is executed on a master, the connected MySQL::Replication clients take a copy and applies it locally. The advantage here is that when a database server is downed, only that single feed is broken. All other communication paths continue as normal, and query flow across the database replication network continue as if nothing ever happened. And once the downed server comes back online, the MySQL::Replication clients notice and continue where they left off. Win-win.

Another issue with MySQL’s built-in replication system is that a slave’s position relative to its master is recorded in a plain text-file called which is not atomically synced to disk. Once a slave dies and comes back online, files may be in an inconsistent state. If the InnoDB tablespace was flushed to disk before the crash but wasn’t, the slave will restart replication from an incorrect position and so will replay queries, leaving your data in an invalid state.

MySQL::Replication clients store their position relative to their masters inside the InnoDB tablespace itself (sounds recursive, but it’s not since there is no binlogging of MySQL::Replication queries). As updates are done within the same transaction as replicated queries are executed in, writes are completely atomic. Once a slave dies and comes back online, we are still in a consistent state since either the transaction was committed or it will be rolled back. It’s a nice place to be in.

MySQL::Replication – multi-master, peer-to-peer, fault tolerant, performant and without the headaches. It can be found here.

Posted in Advent 2014, MySQL. Comments Off

Dec 4: Standalone Mail Servers

This blog post is part of the FastMail 2014 Advent Calendar.

The previous post on December 3rd was about how we do real-time notifications. The next post on December 5th is about data integrity.

Technical level: highly technical

We’ve written a lot about our slots/stores architecture before – so I’ll refer you to our documentation rather than rehashing the details here.

We have evolved over the years, and particularly during the Opera time, I had to resist the forces suggesting a “put all your storage on a SAN and your processing on compute nodes” design, or “why don’t you just virtualise it”, as if that’s a magic wand that solves all your scalability and IO challenges.

Luckily I had a great example to point to: Berkeley University had a week-long outage on their Cyrus systems when their SAN lost a drive. They were sitting so close to the capability limits of their hardware that their mail architecture couldn’t handle the extra load of adding a new disk, and everything fell over. Because there was one single pool of IO, this meant every single user was offline.

I spent my evenings that week (I was living in Oslo) logging in to their servers and helping them recover. Unfortunately, the whole thing is very hard to google – search for “Berkeley Cyrus” and you’ll get lots of stuff about the Berkeley DB backend in Cyrus and how horrible it is to upgrade…

So we are very careful to keep our IO spread out across multiple servers with nothing shared, so an issue in one place won’t spread to the users on other machines.

The history of our hardware is also, to quite a large degree, the history of the Cyrus IMAPd mail server. I’ve been on the Cyrus Governance board for the past 4 years, and writing patches for a lot longer than that.

Email is the core of what we do, and it’s worth putting our time into making it the best we can. There are things you can outsource, but hardware design and the mail server itself have never been one of those things for us.

Early hardware – meta data on spinning disks

When I started at FastMail 10 years ago, our IMAP servers were honking great IBM machines (6 rack units each) with a shared disk array between them, and a shiny new 4U machine with a single external RAID6 unit. We were running a pre-release CVS 2.3 version of Cyrus on them, with a handful of our own patches on top.

One day, that RAID6 unit lost two hard disks in a row, and a third started having errors. We had no replicas, we had backups, but it took a week to get everyone’s email restored onto the new servers we had just purchased and were still testing. At least we had new servers! For that week though, users didn’t have access to their old email. We didn’t want this to ever happen again.

Our new machines were built more along the lines of what we have now, and we started experimenting with replication. The machines were 2U boxes from Polywell (long since retired now), with 12 disks – 4 high speed small drives in two sets of RAID1 for metadata, and 8 bigger drives (500Gb! – massive for the day) in two sets of RAID5 for email spool.

Even then I knew this was the right way – standalone machines with lots of IO capability, and enough RAM and processor (they had 32Gb of RAM) to run the mail server locally, so there are minimal dependencies in our architecture. You can scale that as widely as you want, with a proxy in front that can direct connections to the right host.

We also had 1U machines with a pair of attached SATA to SCSI drive units on either side. Those drive units had the same disk layout as the Polywell boxes, except the OS drives were in the 1U box – I won’t talk any more about these, they’re all retired too.

This ran happily for a long time on Cyrus 2.3. We wrote a tool to verify that replicas were identical to masters in all the things that matter (what can be seen via IMAP), and pushed tons of patches back to the Cyrus project to improve replication as we found bugs.

We also added checksums to verify data integrity after various corruptions were detected between replicas which showed a small rate of bitrot (on the order of 20 damaged emails per year across our entire system) – and tooling to allow the damage to be fixed by pulling back the affected email from either a replica or the backup system and restoring it into place.

Metadata on SSD

Cyrus has two metadata files per mailbox (actually, there are more these days) cyrus.index and cyrus.cache. With the growing popularity of SSDs around 2008-2009, we wanted to use SSDs, but cyrus.cache was just too big for the SSDs we could afford. It’s also only used for search and some sort commands, but the architecture of Cyrus meant that you had to MMAP the whole file every time a mailbox was opened, just in case a message was expunged. People had tried running with cache on slow disk and index on SSD, and it was still too slow.

There’s another small directory which contains the global databases – mailboxes database, seen and subscription files for each user, sieve scripts, etc. It’s a very small percentage of the data, but our calculations on a production server showed that 50% of the IO went to that config directory, about 40% to cyrus.index, and only 10% to cache and spool files.

So I spent a year and concentrated on rewriting the entire internals of Cyrus. This became Cyrus 2.4 in 2010. It has consistent locking semantics, which actually make it a robust QRESYNC/CONDSTORE compatible server (new standards which required stronger guarantees than the cyrus 2.3 datastructures could provide), and also meant that cache wasn’t loaded until it was actually needed.

This was a massive improvement for SSD-based machines, and we bought a bunch of 2U machines from E23 (our existing external drive unit vendor) and then later from Dell through Opera’s sysadmin team.

These machines had 12 x 2Tb drives in them, and two Intel x25E 64Gb SSDs. Our original layout was 5 sets of RAID1 for the 2Tb drives, with two hotspares.

Email on SSD

We ran happily for years with the 5 x 2Tb split, but something else came along. Search. We wanted dedicated IO bandwidth for search. We also wanted to load the initial mailbox view even faster. We decided that almost all users get enough email in a week that their initial mailbox view is going to be able to be generated from a week’s worth of email.

So I patched Cyrus again. For now, this set of patches is only in the FastMail tree, it’s not in upstream Cyrus. I plan to add it after Cyrus 2.5 is released. All new email is delivered to the SSD, and only archived off later. A mailbox can be split, with some emails on the SSD, and some not.

We purchased larger SSDs (Intel DC3700 – 400Gb), and we now run a daily job to archive emails that are bigger than 1Mb or older than 7 days to the slow drives.

This cut the IO to the big disks so much that we can put them back into a single RAID6 per machine. So our 2U boxes are now in a config imaginatively called ‘t15′, because they have 15 x 1Tb spool partitions on them. We call one of these spools plus its share of SSD and search drive a “teraslot”,
as opposed to our earlier 300Gb and 500Gb slot sizes.

They have 10 drives in a RAID6 for 16Tb available space, 1Tb for operating system and 15 1Tb slots.

They also have 2 drives in a RAID1 for search, and two SSDs for the metadata.

Filesystem Size Used Avail Use% Mounted on
/dev/mapper/sdb1 917G 691G 227G 76% /mnt/i14t01
/dev/mapper/sdb2 917G 588G 329G 65% /mnt/i14t02
/dev/mapper/sdb3 917G 789G 129G 86% /mnt/i14t03
/dev/mapper/sdb4 917G 72M 917G 1% /mnt/i14t04
/dev/mapper/sdb5 917G 721G 197G 79% /mnt/i14t05
/dev/mapper/sdb6 917G 805G 112G 88% /mnt/i14t06
/dev/mapper/sdb7 917G 750G 168G 82% /mnt/i14t07
/dev/mapper/sdb8 917G 765G 152G 84% /mnt/i14t08
/dev/mapper/sdb9 917G 72M 917G 1% /mnt/i14t09
/dev/mapper/sdb10 917G 800G 118G 88% /mnt/i14t10
/dev/mapper/sdb11 917G 755G 163G 83% /mnt/i14t11
/dev/mapper/sdb12 917G 778G 140G 85% /mnt/i14t12
/dev/mapper/sdb13 917G 789G 129G 87% /mnt/i14t13
/dev/mapper/sdb14 917G 783G 134G 86% /mnt/i14t14
/dev/mapper/sdb15 917G 745G 173G 82% /mnt/i14t15
/dev/mapper/sdc1 1.8T 977G 857G 54% /mnt/i14search
/dev/md0 367G 248G 120G 68% /mnt/ssd14

The SSDs use software RAID1, and since Intel DC3700s have strong onboard crypto, we are using that rather than OS level encryption. The slot and search drives are all mapper devices because they use LUKS encryption. I’ll talk more about this when we get to the confidentiality post in the security series.

The current generation

Finally we come to our current generation of hardware. The 2U machines are pretty good, but they have some issues. For a start, the operating system shares IO with the slots, so interactive performance can get pretty terrible when working on those machines.

Also, we only get 15 teraslots per 2U.

So our new machines are 4U boxes with 40 teraslots on them. They have 24 disks in the front on an Areca RAID controller:


And 12 drives in the back connected directly to the motherboard SATA:


The front drives are divided into two lots of 2Tb x 12 drive RAID6 sets, for 20 teraslots each.

In the back, there are 6 2Tb drives in a pair of software RAID1 sets (3 drives per set, striped, for 3Tb usable) for search, and 4 Intel DC3700s as a pair of RAID1s. Finally, a couple of old 500Gb drives for the OS – we have tons of old 500Gb drives, so we may well recycle them. In a way, this is really two servers in one, because they are completely separate RAID sets just sharing the same hardware.

Finally, they have 192Gb of RAM. Processor isn’t so important, but cache certainly is!

Here’s a snippet from the config file showing how the disk is distributed in a single Cyrus instance. Each instance has its own config file, and own paths on the disks for storage:

servername: sloti33d1t01

configdirectory: /mnt/ssd33d1/sloti33d1t01/store1/conf
sievedir: /mnt/ssd33d1/sloti33d1t01/store1/conf/sieve

duplicate_db_path: /var/run/cyrus/sloti33d1t01/duplicate.db
statuscache_db_path: /var/run/cyrus/sloti33d1t01/statuscache.db

partition-default: /mnt/ssd33d1/sloti33d1t01/store1/spool
archivepartition-default: /mnt/i33d1t01/sloti33d1t01/store1/spool-archive

tempsearchpartition-default: /var/run/cyrus/search-sloti33d1t01
metasearchpartition-default: /mnt/ssd33d1/sloti33d1t01/store1/search
datasearchpartition-default: /mnt/i33d1search/sloti33d1t01/store1/search
archivesearchpartition-default: /mnt/i33d1search/sloti33d1t01/store1/search-archive

The disks themselves – we have a tool to spit out the drive config of the SATA attached drives. It just pokes around in /sys for details:

$ utils/
1 - HDD 500G RDY sdc 3QG023NC
2 - HDD 500G RDY sdd 3QG023TR
3 E SSD 400G RDY sde md0/0 BTTV332303FA400HGN
4 - HDD 2T RDY sdf md2/0 WDWMAY04568236
5 - HDD 2T RDY sdg md3/0 WDWMAY04585688
6 E SSD 400G RDY sdh md0/1 BTTV3322038L400HGN
7 - HDD 2T RDY sdi md2/1 WDWMAY04606266
8 - HDD 2T RDY sdj md3/1 WDWMAY04567563
9 E SSD 400G RDY sdk md1/0 BTTV323101EM400HGN
10 - HDD 2T RDY sdl md2/2 WDWMAY00250279
11 - HDD 2T RDY sdm md3/2 WDWMAY04567237
12 E SSD 400G RDY sdn md1/1 BTTV324100F9400HGN

And the Areca tools work for the drives in front:

$ utils/cli64 vsf info
# Name Raid Name Level Capacity Ch/Id/Lun State
1 i33d1spool i33d1spool Raid6 20000.0GB 00/00/00 Normal
2 i33d2spool i33d2spool Raid6 20000.0GB 00/01/00 Normal
GuiErrMsg: Success.
$ utils/cli64 disk info
# Enc# Slot# ModelName Capacity Usage
1 01 Slot#1 N.A. 0.0GB N.A.
2 01 Slot#2 N.A. 0.0GB N.A.
3 01 Slot#3 N.A. 0.0GB N.A.
4 01 Slot#4 N.A. 0.0GB N.A.
5 01 Slot#5 N.A. 0.0GB N.A.
6 01 Slot#6 N.A. 0.0GB N.A.
7 01 Slot#7 N.A. 0.0GB N.A.
8 01 Slot#8 N.A. 0.0GB N.A.
9 02 Slot 01 WDC WD2003FYYS-02W0B0 2000.4GB i33d1spool
10 02 Slot 02 WDC WD2003FYYS-02W0B0 2000.4GB i33d1spool
11 02 Slot 03 WDC WD2000FYYZ-01UL1B0 2000.4GB i33d1spool
12 02 Slot 04 WDC WD2003FYYS-02W0B0 2000.4GB i33d1spool
13 02 Slot 05 WDC WD2003FYYS-02W0B0 2000.4GB i33d1spool
14 02 Slot 06 WDC WD2003FYYS-02W0B0 2000.4GB i33d1spool
15 02 Slot 07 WDC WD2003FYYS-02W0B1 2000.4GB i33d1spool
16 02 Slot 08 WDC WD2003FYYS-02W0B0 2000.4GB i33d1spool
17 02 Slot 09 WDC WD2000F9YZ-09N20L0 2000.4GB i33d1spool
18 02 Slot 10 WDC WD2003FYYS-02W0B1 2000.4GB i33d1spool
19 02 Slot 11 WDC WD2003FYYS-02W0B1 2000.4GB i33d1spool
20 02 Slot 12 WDC WD2003FYYS-02W0B1 2000.4GB i33d1spool
21 02 Slot 13 WDC WD2003FYYS-02W0B0 2000.4GB i33d2spool
22 02 Slot 14 WDC WD2003FYYS-02W0B0 2000.4GB i33d2spool
23 02 Slot 15 WDC WD2003FYYS-02W0B0 2000.4GB i33d2spool
24 02 Slot 16 WDC WD2000FYYZ-01UL1B0 2000.4GB i33d2spool
25 02 Slot 17 WDC WD2003FYYS-02W0B0 2000.4GB i33d2spool
26 02 Slot 18 WDC WD2003FYYS-02W0B0 2000.4GB i33d2spool
27 02 Slot 19 WDC WD2003FYYS-02W0B0 2000.4GB i33d2spool
28 02 Slot 20 WDC WD2003FYYS-02W0B0 2000.4GB i33d2spool
29 02 Slot 21 WDC WD2003FYYS-02W0B0 2000.4GB i33d2spool
30 02 Slot 22 WDC WD2003FYYS-02W0B0 2000.4GB i33d2spool
31 02 Slot 23 WDC WD2002FYPS-01U1B1 2000.4GB i33d2spool
32 02 Slot 24 WDC WD2003FYYS-02W0B0 2000.4GB i33d2spool
GuiErrMsg: Success.

We always keep a few free slots on every machine, so we have the capacity to absorb the slots from a failed machine. We never want to be in the state where we don’t have enough hardware!

Filesystem Size Used Avail Use% Mounted on
/dev/mapper/md2 2.7T 977G 1.8T 36% /mnt/i33d1search
/dev/mapper/md3 2.7T 936G 1.8T 35% /mnt/i33d2search
/dev/mapper/sda1 917G 730G 188G 80% /mnt/i33d1t01
/dev/mapper/sda2 917G 805G 113G 88% /mnt/i33d1t02
/dev/mapper/sda3 917G 709G 208G 78% /mnt/i33d1t03
/dev/mapper/sda4 917G 684G 234G 75% /mnt/i33d1t04
/dev/mapper/sda5 917G 825G 92G 91% /mnt/i33d1t05
/dev/mapper/sda6 917G 722G 195G 79% /mnt/i33d1t06
/dev/mapper/sda7 917G 804G 113G 88% /mnt/i33d1t07
/dev/mapper/sda8 917G 788G 129G 86% /mnt/i33d1t08
/dev/mapper/sda9 917G 661G 257G 73% /mnt/i33d1t09
/dev/mapper/sda10 917G 799G 119G 88% /mnt/i33d1t10
/dev/mapper/sda11 917G 691G 227G 76% /mnt/i33d1t11
/dev/mapper/sda12 917G 755G 162G 83% /mnt/i33d1t12
/dev/mapper/sda13 917G 746G 172G 82% /mnt/i33d1t13
/dev/mapper/sda14 917G 802G 115G 88% /mnt/i33d1t14
/dev/mapper/sda15 917G 159G 759G 18% /mnt/i33d1t15
/dev/mapper/sda16 917G 72M 917G 1% /mnt/i33d1t16
/dev/mapper/sda17 917G 706G 211G 78% /mnt/i33d1t17
/dev/mapper/sda18 917G 72M 917G 1% /mnt/i33d1t18
/dev/mapper/sda19 917G 72M 917G 1% /mnt/i33d1t19
/dev/mapper/sda20 917G 72M 917G 1% /mnt/i33d1t20
/dev/mapper/sdb1 917G 740G 178G 81% /mnt/i33d2t01
/dev/mapper/sdb2 917G 772G 146G 85% /mnt/i33d2t02
/dev/mapper/sdb3 917G 797G 120G 87% /mnt/i33d2t03
/dev/mapper/sdb4 917G 762G 155G 84% /mnt/i33d2t04
/dev/mapper/sdb5 917G 730G 187G 80% /mnt/i33d2t05
/dev/mapper/sdb6 917G 803G 114G 88% /mnt/i33d2t06
/dev/mapper/sdb7 917G 806G 112G 88% /mnt/i33d2t07
/dev/mapper/sdb8 917G 786G 131G 86% /mnt/i33d2t08
/dev/mapper/sdb9 917G 663G 254G 73% /mnt/i33d2t09
/dev/mapper/sdb10 917G 776G 142G 85% /mnt/i33d2t10
/dev/mapper/sdb11 917G 743G 174G 82% /mnt/i33d2t11
/dev/mapper/sdb12 917G 750G 168G 82% /mnt/i33d2t12
/dev/mapper/sdb13 917G 743G 174G 82% /mnt/i33d2t13
/dev/mapper/sdb14 917G 196G 722G 22% /mnt/i33d2t14
/dev/mapper/sdb15 917G 477G 441G 52% /mnt/i33d2t15
/dev/mapper/sdb16 917G 539G 378G 59% /mnt/i33d2t16
/dev/mapper/sdb17 917G 72M 917G 1% /mnt/i33d2t17
/dev/mapper/sdb18 917G 72M 917G 1% /mnt/i33d2t18
/dev/mapper/sdb19 917G 72M 917G 1% /mnt/i33d2t19
/dev/mapper/sdb20 917G 72M 917G 1% /mnt/i33d2t20
/dev/md0 367G 301G 67G 82% /mnt/ssd33d1
/dev/md1 367G 300G 67G 82% /mnt/ssd33d2

Some more copy’n’paste:

$ free
total used free shared buffers cached
Mem: 198201540 197649396 552144 0 7084596 120032948
-/+ buffers/cache: 70531852 127669688
Swap: 2040248 1263264 776984

Yes, we use that cache! Of course, the Swap is a little pointless at that size…

$ grep 'model name' /proc/cpuinfo
model name : Intel(R) Xeon(R) CPU E5-2609 v2 @ 2.50GHz
model name : Intel(R) Xeon(R) CPU E5-2609 v2 @ 2.50GHz
model name : Intel(R) Xeon(R) CPU E5-2609 v2 @ 2.50GHz
model name : Intel(R) Xeon(R) CPU E5-2609 v2 @ 2.50GHz
model name : Intel(R) Xeon(R) CPU E5-2609 v2 @ 2.50GHz
model name : Intel(R) Xeon(R) CPU E5-2609 v2 @ 2.50GHz
model name : Intel(R) Xeon(R) CPU E5-2609 v2 @ 2.50GHz
model name : Intel(R) Xeon(R) CPU E5-2609 v2 @ 2.50GHz

IMAP serving is actually very low on CPU usage. We don’t need a super-powerful CPU to drive this box. The CPU load is always low, it’s mostly IO wait – so we just have a pair of 4 core CPUs.

The future?

We’re in a pretty sweet spot right now with our hardware. We can scale these IMAP boxes horizontally “forever”. They speak to the one central database for a few things, but that could be easily distributed. In front of these boxes are frontends with nginx running an IMAP/POP/SMTP proxy, and compute servers doing spam scanning before delivering via LMTP. Both look up the correct backend from the central database for every connection.

For now, these 4U boxes come in at about US$20,000 fully stocked, and our entire software stack is optimised to get the best out of them.

We may containerise the Cyrus instances to allow fairer IO and memory sharing between them if there is contention on the box. For now, it hasn’t been necessary because the machines are quite beefy, and anything which adds overhead between the software and the metal is a bad thing. As container software gets more efficient and easier to manage, it might become worthwhile rather than running multiple instances on the single operating system as we do now.

Posted in Advent 2014, Technical. Comments Off

Dec 3: Push it real good

This blog post is part of the FastMail 2014 Advent Calendar.

The previous post on December 2nd was an intro to our approach to security. The next post on December 4th is about our IMAP server hardware.

Technical level: lots

Bron (one of our team, you’ve heard of him) does this great demo when he shows FastMail to other people. He opens his inbox up on the screen, and gets someone to send him an email. A nice animation slides the new message into view on the screen. At the same time, his phone plays a sound. He uses his watch to delete the email, and it disappears from the screen too. It’s a huge “wow” moment, made possible by our push notification system. Today we’ll talk about exactly how we let you know when something interesting happens in your mailbox.

Cyrus has two mechanisms for telling the world that something has changed, idled and mboxevent. idled is the simpler of the two. When your mail client issues an IDLE command, it is saying “put me to sleep and tell me when something changes”. idled is the server component that manages this, holding the connection open and sending a response when something changes. An example protocol exchange looks like something like this (taken from RFC 2177, the relevant protocol spec):

S: * FLAGS (Deleted Seen)
S: A001 OK SELECT completed
C: A002 IDLE
S: + idling
...time passes; new mail arrives...
S: A002 OK IDLE terminated

It’s a fairly simple mechanism, only designed for use with IMAP. We’ll say no more about it.

Of far more interest is Cyrus’ “mboxevent” mechanism, which is based in part on RFC 5423. Cyrus can be configured to send events to another program any time something changes in a mailbox. The event contains details about the type of action that occurred, identifying information about the message and other useful information. Cyrus generates events for pretty much everything – every user action, data change, and other interesting things like calendar alarms. For example, here’s a delivery event for a system notification message I received a few minutes ago:

 "event" : "MessageNew",
 "messages" : 1068,
 "modseq" : 1777287, 
 "pid" : 2087223,
 "serverFQDN" : "sloti30t01",
 "service" : "lmtp",
 "uidnext" : 40818,
 "uri" : "imap://;UIDVALIDITY=1335827579/;UID=40817",
 "vnd.cmu.envelope" : "(\"Wed, 03 Dec 2014 08:49:40 +1100\" \"Re: Blog day 3: Push it real good\" ((\"Robert Norris\" NIL \"robn\" \"\")) ((\"Robert Norris\" NIL \"robn\" \"\")) ((\"Robert Norris\" NIL \"robn\" \"\")) ((NIL NIL \"staff\" \"\")) NIL NIL \"<>\" \"<>\")",
 "vnd.cmu.mailboxACL" : "\tlrswipkxtecdn\tadmin\tlrswipkxtecdan\tanyone\tp\t",
 "vnd.cmu.mbtype" : "",
 "vnd.cmu.unseenMessages" : 90,
 "vnd.fastmail.cid" : "b9384d25e93fc71c",
 "vnd.fastmail.convExists" : 413,
 "vnd.fastmail.convUnseen" : 82,
 "vnd.fastmail.counters" : "0 1777287 1777287 1761500 1758760 1416223082",
 "vnd.fastmail.sessionId" : "sloti30t01-2087223-1417556981-1"

The event contains all sorts of information: the action that happened, the message involved, how many messages I have, the count of unread messages and conversations, the folder involved, stuff about the message headers, and more. This information isn’t particularly useful in its raw form, but we can use it to do all kinds of interesting things.

We have a pair of programs that do various processing on the events that Cyrus produce. They are called “pusher” and “notifyd”, and run on every server where Cyrus runs. Their names don’t quite convey their purpose as they’ve grown up over time into their current forms.

pusher is the program that receives all events coming from Cyrus. It actions many events itself, but only the ones it can handle “fast”. All other events are handed off to notifyd, which handles the “slow” events. The line between the two is a little fuzzy, but the rule of thumb is that if an event needs to access the central database or send mail, it’s “slow” and should be handled by notifyd. Sending stuff to open connections or making simple HTTP calls are “fast”, and are handled in pusher. I’ve got scare quotes around “slow” and “fast” because the slow events aren’t actually slow. It’s more about how well each program responds when flooded with events (for programmers: pusher is non-blocking, notifyd can block).

We’ll talk about notifyd first, because it’s the simpler of the two. The two main styles of events it handles are calendar email notifications and Sieve (filter/rule) notifications.

Calendar email notifications are what you get when you say, for example, “10 minutes before this event, send me an email”. All the information that is needed to generate an email is placed into the event that comes from Cyrus, including the name of the event, the start and end time, the attendees, location, and so on. notifyd constructs an email and sends it to the recipient. It does database work to look up the recipient’s language settings to try and localise the email it sends. Here we see database and email work, and so it goes on the slow path.

The Sieve filter system we use has a notification mechanism (also see RFC 5435) where you can write rules that cause new emails or SMS messages to be sent. notifyd handles these too, taking the appropriate action on the email (summarise, squeeze) and then sending an email or posting to a SMS provider. For SMS, it needs to fetch the recipient’s number and SMS credit from the database so again, it’s on the slow path.

On to pusher, which is where the really interesting stuff happens. The two main outputs it has are the EventSource facility used by the web client, and the device push support used by the Android and iOS apps.

EventSource is a facility available in most modern web browsers that allow it to create a long-lived connection to a server and receive a constant stream of events. The events we send are very simple. For the above event, pusher would send the following to all web sessions I currently have open:

event: push
id: 1760138
data: {"mailModSeq":1777287,"calendarModSeq":1761500,"contactsModSeq":1758760}

These are the various “modseq” (modification sequence, see also RFC 7162) numbers for mail, calendar and contacts portions of my mail store. The basic idea behind a modseq is that every time something changes, the modseq number goes up. If the client sees the number change, it knows that it needs to request an update from the server. By sending the old modseq number in this request, it receives only the changes that have happened since then, making this a very efficient operation.

If you’re interested, we’ve written about our how we use EventSource in a lot more detail in this blog post from a couple of years ago. Some of the details have changed since then, but the ideas are still the same.

The other thing that pusher handles is pushing updates to the mobile apps. The basic idea is the same. When you log in to one of the apps, they obtain a device token from the device’s push service (Apple Push Notification Service (APNS) for iOS or Google Cloud Messaging (GCM) for Android), and then make a special call to our servers to register that token with pusher. When the inbox changes in some way, a push event is created and sent along with the device token to the push service (Apple’s or Google’s, depending on the token type).

On iOS, a new message event must contain the actual text that is displayed in the notification panel, so pusher extracts that information from the “vnd.cmu.envelope” parameter in the event it received from Cyrus. It also includes an unread count, which is used to update the red “badge” on the app icon, and a URL which is passed to the app when the notification is tapped. An example APNS push event might look like:

  "aps" : {
    "alert" : "Robert Norris\nHoliday pics",
    "badge" : 82,
    "sound" : "default"
  "url" : "?u=12345678#/mail/Inbox/eb26398990c4b29b-f45463209u40683

For other inbox changes, like deleting a message, we send a much simpler event to update the badge:

  "aps" : {
    "badge" : 81

The Android push operates a little differently. On Android it’s possible to have a service running in the background. So instead of sending message details in the push, we only send the user ID.

  "data" : {
    "uparam" : "12345678"

(The Android app actually only uses the user ID to avoid a particular bug, but it will be useful in the future when we support multiple accounts).

On receiving this, the background service makes a server call to get any new messages since the last time it checked. It’s not unlike what the web client does, but simpler. If it finds new messages, it constructs a system notification and displays it in the notification panel. If it sees a message has been deleted and it currently has it visible in the notification, it removes it. It also adds a couple of buttons (archive and delete) which result in server actions being taken.

So that’s all the individual moving parts. If you put them all together, then you get some really impressive results. When Bron uses the “delete” notification action on his watch (an extension of the phone notification system), it causes the app to send a delete instruction the the server. Cyrus deletes the message and sends a “MessageDelete” event to pusher. pusher sends a modseq update via EventSource to the web clients which respond by requesting an update from the server, noting the message is deleted and removing it from the message list. pusher also notices this is an inbox-related change, so sends a new push to any registered Android devices and, because it’s not a “MessageNew” event, sends a badge update to registered iOS devices.

One of the things I find most interesting about all of this is that in a lot of ways it wasn’t actually planned, but has evolved over time. notifyd is over ten years old and existed just to support Sieve notifications. Then pusher came along when we started the current web client and it needed live updates. Calendar notifications came later and most recently, device push. It’s really nice having an easy, obvious place to work with mailbox events. I fully expect that we’ll extend these tools further in the future to support new kinds of realtime updates and notifications.

Posted in Advent 2014, Technical. Comments Off

Dec 1: Email Search System

This blog post is part of the FastMail 2014 Advent Calendar.

The next post on December 2nd is Security – Confidentiality, Integrity and Availability.

Technical level: medium

Our email search system was originally written by Greg Banks, who has moved on to a role at Linked In, so I maintain it now. It’s a custom extension to the Cyrus IMAPd mail server.

Fast search was a core required feature for our new web interface. My work account has over half a million emails in it, and even though our interface allows fast scroll, it’s still impossible to find anything more than a few weeks old without either knowing exactly when it was sent, or having a powerful search facility.

Greg tried a few different engines, and settled on the Xapian project as the best fit for the one-database-per-user that we wanted.

We tried indexing new emails as they arrived, even directly to fast SSDs, and discovered that the load was just too high. Our servers were overloaded trying to index in time – because adding a single email causes a lot of updates.

Luckily, Xapian supports searching from multiple databases at once, so we came up with the idea of a tiered database structure.

New messages get indexed to tmpfs in a small database. A job runs every hour to see if tmpfs is getting too full (over 50% of the defined size), in which case it compacts immediately, otherwise we automatically compact during the quiet time of the day. Compacted databases are more efficient, but read only.

This allows us to index all email immediately, and return a message that arrived just a second ago in your search results complete with highlighted search terms, yet not overload the servers. It also means that search data can be stored on inexpensive disks, keeping the costs of our accounts down.

Technical level: extreme

Here’s some very technical information about how the tiers are implemented, and an example of running the compaction.

We have 4 tiers at FastMail, though we don’t actually use the ‘meta’ one (SSD) at the moment:

  • temp
  • meta
  • data
  • archive

The temp level is on tmpfs, purely in memory. Meta is on SSD, but we don’t use that except during shutdown. Data is the main version, and we re-compact all the data level indexes once per week. Finally archive is never automatically updated, but we build it when users are moved or renamed, or can create it manually.

Both external locking (Xapian isn’t always happy with multiple writers on one database) and the compaction logic are managed via a separate file called xapianactive. The xapianactive looks like this:

% cat /mnt/ssd30/sloti30t01/store23/conf/user/b/brong.xapianactive
temp:264 archive:2 data:37

The first item in the active file is always the writable index – all the others are read-only.

These map to paths on disk according to the config file:

% grep search /etc/cyrus/imapd-sloti30t01.conf
search_engine: xapian
search_index_headers: no
search_batchsize: 8192
defaultsearchtier: temp
tempsearchpartition-default: /var/run/cyrus/search-sloti30t01
metasearchpartition-default: /mnt/ssd30/sloti30t01/store23/search
datasearchpartition-default: /mnt/i30search/sloti30t01/store23/search
archivesearchpartition-default: /mnt/i30search/sloti30t01/store23/search-archive

(the ‘default tier’ is to tell the system where to create a new search item)

So based on these paths, we find.

% du -s /var/run/cyrus/search-sloti30t01/b/user/brong/* /mnt/i30search/sloti30t01/store23/search/b/user/brong/* /mnt/i30search/sloti30t01/store23/search-archive/b/user/brong/*
3328 /var/run/cyrus/search-sloti30t01/b/user/brong/xapian.264
1520432 /mnt/i30search/sloti30t01/store23/search/b/user/brong/xapian.37
3365336 /mnt/i30search/sloti30t01/store23/search-archive/b/user/brong/xapian.2

I haven’t compacted to archive for a while. Let’s watch one of those. I’m selecting all the tiers, and compressing to a single tier. The process is as follows:

  1. take an exclusive lock on the xapianactive file
  2. insert a new default tier database on the front (in this example it will be temp:265) and unlock xapianactive again
  3. start compacting all the selected databases to a single database on the given tier
  4. take an exclusive lock on the xapianactive file again
  5. if the xapianactive file has changed, discard all our work (we lock against this, but it’s a sanity check) and exit
  6. replace all the source databases for the compact with a reference to the destination database and unlock xapianactive again
  7. delete all now-unused databases

Note that the xapianactive file is only locked for two VERY SHORT times. All the rest of the time, the compact runs in parallel, and both searching on the read-only source databases and indexing to the new temp database can continue.

This allows us to only ever have a single thread compacting to disk, so our search drives are mostly idle, and able to serve
customer search requests very quickly.

When holding an exclusive xapianactive lock, it’s always safe to delete any databases which aren’t mentioned in the file – at worst you will race against another task which is also deleting the same databases, so this system is self-cleaning after any failures.

Here goes:

% time sudo -u cyrus /usr/cyrus/bin/squatter -C /etc/cyrus/imapd-sloti30t01.conf -v -z archive -t temp,meta,data,archive -u brong
compressing temp:264,archive:2,data:37 to archive:3 for user.brong (active temp:264,archive:2,data:37)
adding new initial search location temp:265
compacting databases
Compressing messages for brong
done /mnt/i30search/sloti30t01/store23/search-archive/b/user/brong/xapian.3.NEW
renaming tempdir into place
finished compact of user.brong (active temp:265,archive:3)

real 4m52.285s
user 2m29.348s
sys 0m13.948s

% du -s /var/run/cyrus/search-sloti30t01/b/user/brong/* /mnt/i30search/sloti30t01/store23/search/b/user/brong/* /mnt/i30search/sloti30t01/store23/search-archive/b/user/brong/*
368 /var/run/cyrus/search-sloti30t01/b/user/brong/xapian.265
du: cannot access `/mnt/i30search/sloti30t01/store23/search/b/user/brong/*': No such file or directory
4614368 /mnt/i30search/sloti30t01/store23/search-archive/b/user/brong/xapian.3

If you want to look at the code, it’s all open source. I push the fastmail branch to github regularly. The xapianactive code is in imap/search_xapian.c and the C++ wrapper in imap/xapian_wrap.cpp.

Posted in Advent 2014, Technical. Comments Off

Updating our SSL certificates to SHA-256

This is a technical post. The important points to take away are that if, like most of our customers, you’re using FastMail’s web client with a modern, regularly updated browser like Chrome, Firefox, Internet Explorer or Safari, then everything will be fine. If you’re using an old browser or operating system (including long-unsupported mobile devices like old Nokia or WebOS devices), it may start failing to connect to FastMail during December, and you’ll need to make changes to the settings you use to access FastMail. Read on for details.

For many years the standard algorithm used to sign SSL certificates has been SHA-1. Recently, weaknesses have been exposed in that algorithm which make it unsuitable for encryption work. It’s not broken yet, but it’s reasonable to expect that it will be broken within the next year or two.

A replacement algorithm is available, called SHA-256 (sometimes called SHA-2), and its been the recommended algorithm for new certificates for the last couple of years.

Back in April, we updated our certificates with new ones that used SHA-256. This caused problems for certain older clients that didn’t have support for SHA-256. After some investigation, we reverted to SHA-1 certificates.

Recently Google announced that they would start deprecating SHA-1 support this year. Chrome 40 (currently in testing, due for release in January) will start showing the padlock icon on as “secure, with minor errors”. Crucially, it will no longer display the green “EV” badge.

As a result, we are intending to update our certificates to SHA-256 during December. Its something we wanted to do back in April anyway, as we’d much prefer to proactively support modern security best practice rather than scramble frantically to fix things when breaches are discovered.

Unfortunately, this will cause problems for customers using older browsers. Most desktop browsers should not have any problem, though Windows XP users will need to update to Service Pack 3. Many more obscure devices (notably Nokia and WebOS devices) do not support SHA-256 at all, and will not be able to connect to us securely.

We will be attempting to support a SHA-1 certificate on and, but only if our certificate authority will agree to issue one to us. Once we have that information I’ll update this post.

If you have any questions about this change, please contact support.

Further reading:

Posted in Technical. Comments Off

Get every new post delivered to your Inbox.

Join 6,295 other followers