Dec 11: FastMail Support

This blog post is part of the FastMail 2014 Advent Calendar.

The previous post on 10th December was the second security post, on availability. The following post on 12th December is about our multi-master database replication.

Technical level: low

Support system

FastMail has a comprehensive support system. We have a well written, and well maintained online help system, where you can read in detail about every aspect of the service. This is complemented by our blog(which is likely where you are reading this) which you can subscribe to, where you can read about what we are working on, or where the service is generally heading including any recent changes.

To see the live running status of various FastMail services, you can see our own status page. That page also gives you an idea about how the service has been operating in the past week. If you would like to see an uptime report of FastMail from an independent third-party, there’s always the status report from Pingdom.

But in spite of all that, sometimes you will need to get in touch with a human being who knows the service really well. In that case, you can get in touch with a member of our friendly support team using our ticket system.

Support team and support process

Back in early 2000-ish, I realized IMAP as The True Path while the mainstream practice was POPanism, and searched hard and wide for a good provider. I found FastMail to be the best IMAP provider on the planet, and signed up for an account. And I would say the same thing today as well – as far as the ‘best IMAP provider’ bit goes. As for ‘The True Path’, I am a born-again IMAPian, and would say JMAP fits the bill these days as we are transitioning from the Destkopian age to the AndroiPhonean age. Our new apps for iPhone and Android use JMAP to do their work, and you can see how well they work, especially on high-latency connections.

But I digress.

So, these were early days, and soon FastMail advertised a tech-support position. I had done a side-project of developing a basic email client mainly to learn email protocols, so when the opportunity came along, I was very keen and I applied and soon landed the job.

I was joined by Vinodh and Yassar around 2008, and from then on, they have been handling front-line requests until recently.

Now a days, you will find a good chunk of your support ticket requests handled by our new recruit Afsal (and soon by Anto as well), especially during US day time.

New support tickets are handled by those techs in front-line support. Any issues escalated will usually be dealt by Yassar, and further escalations will be handled by myself. Yassar and I escalate issues to either an engineer on duty, or if we are sure the issue is related to somebody’s area of expertise, directly to that developer or admin.

Future plans

We plan to provide 24 hour support coverage in the future, and we are working hard towards that. We will extend support coverage to the US day time first, after which we will start extending coverage to other timezones as well. Recruiting and training new people will take time, but we’ll get there eventually.

Most frequently asked questions

As today’s post is about support, I’ll list here the most common questions that we see in support tickets, which you can get quick help for from our help documentation itself:

Remember; there is a wealth of information in our support system online, so thats a good first place to go to, to learn about FastMail. But our support team is always at hand, should you have questions!

See you again!

It is very satisfying to engage with our users, and help them make the most out of their FastMail accounts. Sometimes its a 75 year old grandma in the US who asks how best they can share their new recipes to a selected set of contacts(think address book groups), and sometimes its the uber-geek from across the globe who asks what scheme we use to hash our passwords(we use bcrypt)! Our customers are diverse, their questions intimidating interesting, and the experience satisfying!

FastMail has seen tremendous growth in the past few years, and we are working hard on scaling our support team to match. You should see the results of this work in the months to come.

Posted in Advent 2014. Comments Off

Dec 10: Security – Availability

This blog post is part of the FastMail 2014 Advent Calendar.

The previous post on 9th December was about email authentication. The following post is from our support team.

Technical level: medium

Availability is the ability for authorised users to gain access to their data in a timely manner.

While both Confidentiality and Integrity are important, they are not noticed unless something goes wrong. Availability on the other hand, is super visible. If you have an outage then it will be all through the media – like Yahoo, or Microsoft, or Google, or indeed FastMail.

Availability at FastMail

Our record really speaks for itself. Our public pingdom page and status page show how reliably available we are. FastMail has great uptime.

We achieve this by reducing single points of failure, and having all data replicated in close to real time.

I was in New York earlier this year consolidating our machines, removing some really old ones, and also moving everything to new cabinets which have a better cooling system and more reliable power. We had out-grown the capacity in our existing cabinets, and didn’t have enough power to run completely on just half our circuits any more.

Our new cabinets have redundant power – a strip up each side of the rack – and every server is wired to both strips, and able to run from just one. Each strip has the capacity to run the entire rack by
itself.

03-pdu-powerin
05-back-04

The servers are laid out in such a way that we can shut down any one cabinet. In fact, we can shut down half the cabinets at a time without impacting production users. In 2014 it’s not such a big deal to be able to reinstall any one of your machines in just a few minutes – but in 2005 when we switched to fully automated installation of all our machines, only a few big sites were doing it. For the past few years, we’ve been at the point where we can shut down any machine with a couple of minutes’ notice to move service off it, and users don’t even notice that it’s gone. We can then fully reinstall the operating system.

We have learned some hard lessons about availability over the years. The 2011 incident took a week to recover from because it hit every server at exactly the same time. We couldn’t mitigate it by moving load to the replicas. We are careful not to upgrade everywhere at once any more, no matter how obvious and safe the change looks!

Availability and Jurisdiction

People often ask why we’re not running production out of our Iceland datacentre. We only host secondary MX and DNS, plus an offsite replica of all data there.

While we work hard on the reliability of our systems, a lot of the credit for our uptime has to go to our awesome hosting provider, NYI. They provide rock-solid power and network. To give you some examples:

  • During Hurricane Sandy, when other datacentres were bucketing fuel up the staircases and having outages, we lost power on ONE circuit for 30 seconds. It took out two units which hadn’t been cabled correctly, but they weren’t user facing anyway.
  • We had a massive DDOS attempted against us using the NTP flaw a while ago. They blocked just the NTP port to the one host being attacked, and informed us of the attack while they asked their upstream providers to push the block out onto the network to kill off the attack. Our customers didn’t even notice.
  • They provide 24/7 onsite technical staff. Once when they were busy with another emergency, I had to wait 30 minutes for a response on an issue. The CEO apologised to me personally for having to wait. Normal response times are within 2 minutes.

The only outage we’ve had this year that can be attributed to NYI at all is a 5 minute outage when they switched the network uplink from copper to fibre, and managed to set the wrong routing information on the new link. 5 minutes in a year is pretty good.

The sad truth is, we just don’t have the reliability from our Iceland datacentre to provide the uptime that our users expect of us.

  • Network stats to New York: you see the only time it drops below 99.99% is July, when I moved all the servers, and there was the outage on the 26th (actually 5 minutes by my watch). As far as I can tell, the outages on the 31st were actually a pingdom error rather than a problem in NYI
  • Network stats to Iceland: Ignore the 5 hour outage in August, because that was actually me in the datacentre. We don’t have dual cabinet redundancy there, so I couldn’t keep services up while I replaced parts. Even so, there are multiple outages longer than 10 minutes. These would have been very user-visible if users saw them. As it is, they just page the poor on-call engineer.

If we were to run production traffic to another datacentre, we would have to be convinced that they provide a similar level of quality to that provided by NYI. Availability is the life-blood of our customers. They need email to be up, all the time.

Human error

Once you get the underlying hardware and infrastructure to the level of reliability we have, the normal cause of problems is human error.

We have put a lot of work this year into processes to help avoid human errors causing production outages. There will be more on the testing process and beta => qa => production rollout stages in a later blog post. We’ve also had to change our development style slightly to deal with the fact that we now have two fully separate instances of our platform running in production – we’ll also blog about that, since it’s been a major project this year.

General internet issues

Of course, the internet itself is never 100% reliable, as was seen by our Optus and Vodafone using customers in Australia recently. Optus were providing a route back from NYI which went through Singtel, and it wasn’t passing packets. There was nothing we could do, we had to wait for Optus to figure out what was wrong and fix it at their end.

We had a similar situation with Virgin Media in the UK back in 2013, but then we managed to route traffic via a proxy in our Iceland datacentre. This wouldn’t have worked for Australia, because traffic from Australia to Iceland travels through New York too.

We are looking at what is required to run up a proxy in Australia for Asia-Pacific region traffic if there are routing problems from this part of the world again. Of course, that depends on the traffic from our proxy being able to get through.

One of the nastiest network issues we’ve ever had was when traffic to/from Iceland was being sent through two different network switches in London, depending on the exact source/destination address pair – and one of the switches was faulty – so only half our traffic was getting through. That one took 6 hours to be resolved. Thankfully, there was no production traffic to Iceland, so users didn’t notice.

Posted in Advent 2014. Comments Off

Dec 9: Email authentication

This blog post is part of the FastMail 2014 Advent Calendar.

The previous post on 8th December was about our rich text email editor. The following post on 10th December is the second security post, on availability.

Technical level: medium.

What is it?

Email authentication is a collection of techniques to verify that an email is legitimately from the sender it claims to be from.

History

Back when email was first designed the world was a much more trusting place. The early internet consisted mainly of government agencies and educational institutions, and the notion that anyone would forge sender details in an email wasn’t considered.
Although standards have moved on since then, compatibility is always a concern when implementing new standards, as a result, email standards do not have sender authentication at their core.

In today’s internet, spam and phishing emails are or course a big problem, email authentication has become something which needs to be addressed, and a number of techniques have been developed to help achieve this. Here is a brief summary of some of the popular ones, and how FastMail uses each one.

Real-time Blackhole List (RBL)

RBL, also known as a DNS-Based Blackhole list check is a method where inbound IP addresses are checked against a list of known bad servers. This gives an early indication of which mail is more likely to be spam.
In addition to checking incoming mail against these lists we also monitor our own IP addresses on the lists as an early warning system. When one of our addresses is listed we take steps to find and remove the problem account while sending mail via one of our other outgoing servers to minimise disruption to our legitimate users.

Sender Policy Framework (SPF)

SPF is a domain based authentication system. It allows a recipient to verify that the sending server is authorised to send email for a particular domain.

Domain owners can publish a list of valid addresses in their DNS records, and can suggest how to deal with email which does not come from an address on that list.

Unfortunately, the sender address verified by SPF is not necessarily the sender address that you see in your email client. The address checked by SPF is the return path sent in the SMTP transaction, which may differ from the address in the email’s From: header.

Microsoft attempted to fix this by introducing SenderID, which is similar to SPF but can verify the From address header. However, there were numerous problems with this as a standard and it isn’t widely implemented.

At FastMail we use SPF as one of the many factors to spam filter incoming mail. For outgoing mail we specify our servers explicitly so that they get a positive score for successful SPF, but also say “?all” to allow for other systems to send from addresses on our domains. If you have your own domain with us, then of course then you can setup your own SPF records to be as strict or as liberal as you need.

DomainKeys Identified Mail (DKIM)

DKIM is a merging of two older standards; DomainKeys, and Identified Internet Mail. The intention of DKIM is to verify that an email associated with a particular domain was sent by an authorised agent of that domain and has not been modified since being sent. Based on Public Key Cryptography, a domain owner creates a public and private key pair, publishes the public part, and then uses the private part to sign the body and selected headers of an email. The receiver of an email is then able to check that signature against the public part of the key to verify that the sender of the email had access to the private part of the key, and therefore is authorised to send email on behalf of that domain.

Problems: Again, there is no stipulation that the domain in the From: header signs the email. It is possible for an email to be signed by any domain and pass basic DKIM checks.

At FastMail we sign all outgoing emails with a key for our messagingengine.com domain, and also with a key for the domain of your email address (e.g. fastmail.com).

If you use your own domain with us, we will automatically sign emails with a DKIM key if you host your DNS with us. We also make it super easy to setup both SPF and DKIM.

For incoming mail, again, we use DKIM as a tool in spam filtering. DKIM is also used to validate official emails from FastMail, We use the DKIM signature combined with the headers to validate that the email was sent by an official FastMail staff account and add a green tick next to legitimate emails.

Author Domain Signing Practices (ADSP)

ADSP is an extension to DKIM whereby a domain owner can publish a policy stating how email from their domain should be signed. A domain owner can state one of the following.

  1. Legitimate email from this domain may or may not be signed by the domain.
  2. Legitimate email from this domain will be signed by the domain.
  3. Legitimate email from this domain will be signed by the domain, and non signed email should be discarded.

The domain used for ADSP is the domain in the From header of the email, and is the one most likely seen by the recipient.

Domain-based Message Authentication, Reporting & Conformance (DMARC)

DMARC brings together the SPF and DKIM checks and ties them in with the sender shown in the From address of the email.

In order to be considered ‘DMARC aligned’ an email must pass SPF and DKIM checks, the SPF domain must match the domain of the From address, and at least one DKIM signature must also match that domain. This provides a good level of certainty that the email is not forged.

Domain owners who choose to publish DMARC records can suggest what should be done with messages which do not pass DMARC tests, reject outright, or quarantine (treat as spam).

The reporting part of DMARC is a tool for domain owners rather than end users. Email receivers who fully implement DMARC build reports on email received, and send reports to domain owners who request them (via the published DMARC record). This report shows some basic aggregate information on number of emails received, the servers they were received from, and their SPF and DKIM status. This allows domain owners to discover how their domain is being used, which can then inform decisions on the best SPF/DKIM/DMARC policies to publish.

We are investigating how we can use DMARC to benefit FastMail users. Given the wide range of ways in which our users are using our services it would be bad to publish reject or quarantine policies. The number of legitimate emails which would be blocked by this would far outweigh any possible benefit. Implementing over restrictive policies in an environment where email is used in a diverse way would result in pain for users.

Challenges

The biggest issue by far is Mailing Lists. An email sent to a mailing list will typically be re-sent from the mail server of the list, breaking SPF, and usually has some alterations made to it (such as subject changes or unsubscribe links added), which break DKIM signatures.

It is also fairly common that a third party would send email on your behalf, for example, a company might contract out their support and ticketing system to a third party, and emails from the ticketing system would be sent from the companies domain. Care needs to be taken to ensure that these emails are also considered in SPF policies, and that the third party is able to properly DKIM sign these messages.

Third party senders has been one of the challenges we faced while implementing our green tick and phishing warnings system. This blog is hosted by WordPress, and sends email on behalf of FastMail (the blog notifications). We needed to make sure that these emails from WordPress could be identified and validated against the WordPress DKIM signature, check that they were sent from our WordPress blog, and make sure those emails were not marked with the phishing warning box. This needs to be done on a case by case basis as what identifies emails on WordPress isn’t likely to be the same as what identifies emails on other services such as Twitter.

Another common source of SPF failures is forwarding. If, for example, a user has migrated to FastMail from another provider, and has set up their old provider to forward email to their FastMail address, then we would see the IP address of the forwarding server, not the originating server, and this could result in an SPF failure for an otherwise legitimate email. There are some standards which attempt to address this such as Sender Rewriting Scheme (SRS), which involves rewriting the envelope sender to one at the forwarding domain. This fixes the SPF problem, but if the originating domain uses DMARC, then the email would no longer be aligned as the from addresses will no longer match.

The trouble of course is that phishing emails can be sent from entirely unrelated domains and are still successful. So sender authentication doesn’t always help.

Another problem with email authentication is a misunderstanding of what is being verified. We can take technical steps to verify that an email did come from a legitimate email account, but we can make no claim over how trustworthy the author of that email is. Anybody can purchase a domain, set it up properly with SPF, DKIM, and DMARC, and then use it to send spam or phishing emails. Also any service, including FastMail, faces the problem that accounts can be compromised and used to send bad content. Detecting and dealing with this is a whole other blog post.

Posted in Advent 2014. Comments Off

Dec 8: Squire: FastMail’s rich text editor

This blog post is part of the FastMail 2014 Advent Calendar.

The previous post on 7th December was about automated installation. The following post on 9th December is on email authentication.

Technical level: low-medium.

We’re going to take a break from talking about our backend infrastructure in this post and switch over to discussing our webmail.

In the beginning, there was text. And really, it was pretty good. You could *emphasise* things, SHOUT AT PEOPLE, and generally convey the nuance of what you had to say. But then came HTML email. Now you could make big bold statements, or small interesting asides. Your paragraphs were no longer hard-wrapped, but instead flowed according to the size of your screen. Despite some grumblings from a dedicated band of luddites (including a few of the FastMail team :-)), most people decided that this was, in fact, better.

To support rich text editing in our previous interface, we used CKEditor. While not a bad choice, like most other editors out there it was designed for creating websites, not writing emails. As such, simply inserting an image by default presented a dialog with three tabs and more options than you could believe possible. Meanwhile, support for quoting, crucial in email, was severely limited. It also came with its own UI toolkit and framework, which we would have had to heavily customise to fit in with the rest of the new UI we were building; a pain to maintain.

With our focus on speed and performance, we were also concerned about the code size. The version of CKEditor we use for our previous (classic) UI, which only includes the plugins we need, is a 159 KB download (when gzipped; uncompressed it’s 441 KB). That’s just the code, excluding styles and images. To put this in perspective, in the current interface the combined code weight required to load the whole compose screen, including our awesome base library (more on that in a future post…), the mail/contacts model code and all the UI code to render the entire screen comes to only 149.4 KB (459.7 KB uncompressed).

After considering various options, we therefore decided to strike out on our own and wrote Squire.

Making a rich text editor is notoriously difficult due to the fact that different browsers are extremely inconsistent in this area. The APIs were all introduced by Microsoft back in the IE heyday, and were then copied by the other vendors in various incompatible ways. The result of applying document.execCommand to simply bold the selected text is likely to be different in every browser you try.

To deal with this, most rich text editors execute a command, then try to clean up the mess the browser created. With Squire, we neatly bypass this by simply not using the browser’s built-in commands. Instead, we manipulate the DOM directly, only using the selection and range APIs. This turns out to be easier and require less code than letting the browser do any of the work!

For example, to bold some text, we use the following simple algorithm (actually, this more generally applies to any inline style, such as setting a font or colour too):

  1. Iterate through the text nodes in the DOM that are part of the current selection.
  2. For each text node, check if it’s already got a parent <b> tag. If it does, there’s nothing to do. If not, create a new <b> element and wrap the text node in it. If the text node was only partially in the selection, split it first so only the selected part gets wrapped.

I’d like to give a quick shout out to the under-appreciated TreeWalker API for iterating through the DOM. True story: when first developing Squire, I came across a bug in Opera’s TreeWalker implementation. The first comment on my report from the Presto developer team (this was pre-WebKit days at Opera) was, and I quote verbatim, “First TreeWalker bug ever. First TreeWalker usage ever? :)”. Sadly, due to the lack of common use, other browsers have also had the occasional bug with this API too, so to be on the safe side I just reimplemented the bits of the API I needed in JavaScript. The idea is sound though.

Squire also completely takes over certain keys that are handled badly by default, such as enter and delete. This lets us get a consistent result, and allows us to add the features we want, such as breaking nested quotes if you hit enter on a blank line. And of course we’ve added our own keyboard shortcuts too for actions like changing quote level or starting a bullet list.

At only 11.5 KB of JavaScript after minification and gzip (34.7 KB uncompressed) and with no dependencies, Squire is extremely lightweight. If you’re building your own webmail client, or something else that needs to be able to edit rich text, give it a go! Squire is MIT licensed and available on GitHub.

Posted in Advent 2014. Comments Off

Dec 7: Automated installation

This blog post is part of the FastMail 2014 Advent Calendar.

The previous post on December 6th was about how we authenticate users. The following post is on our rich text email editor.

Technical level: medium

Any server in our entire system can be reinstalled in under 10 minutes, without users being aware.

That was a very ambitious goal in 2004 when I started at FastMail. DevOps was in its infancy, the tools available to us weren’t very good yet – but the alternative of staying with what we had was not scalable. Every machine was hand-built following a “script” which was instructions on our internal wiki, and hoping you got every step perfect. Every machine was slightly different.

We chose Debian Linux as the base system for the new automated installs, using FAI to install the machines from network boot to a full operating system with all our software in place.

Our machines are listed by name in a large central configuration file, which maps from the hardware addresses of the ethernet cards (easy to detect during startup, and globally unique) to a set of roles. The installation process uses those roles to decide what software to install.

Aside: the three types of data

I am a believer in this taxonomy, which splits data into three different types for clarity of thinking:

  1. Own Output – creative effort you have produced yourself. In theory you can reproduce it, though anyone who has lost hours of work to a crash knows just how disheartening it can be to repeat yourself.
  2. Primary Copy of others’ creative output. Unreproducable. Lose this, it’s gone forever.
  3. Secondary Copy of anything. Cache. Can always be re-fetched.

There is a bit of blurring between categories in practice, particularly as you might find secondary copies in a disaster and get back primary data you thought was lost. But for planning, these categories are very valuable for deciding how to care for the data.

  • Own Output – stick it in version control. Always.

    Since the effort that goes into creating is so high compared to data storage cost, there’s no reason to discard anything, ever. Version control software is designed for precisely this purpose.

    The repository then becomes a Primary Copy, and we fall through to;

  • Primary Copy – back it up. Replicate it. Everything you can to ensure it is never lost. This stuff is gold.

    In FastMail’s case as an email host, it is other people’s precious memories. We store emails on RAIDed drives with battery backed RAID units on every backend server, and each copy is replicated to two other servers, giving a total of 3 copies on RAID1 or 6 disks in total with a full copy of every message on them.

    One of those copies is in a datacentre a third of the distance around the world from the other two.

    On top of this, we run nightly backups to a completely separate format on a separate system.

  • Secondary Copy – disposable. Who cares. You can always get it again.

    Actually, we do keep backups of Debian package repositories for every package we use just in case we want to reinstall and the mirror is down. And we keep a local cache of the repository for fast reinstalls in each datacentre too.

It’s amazing how much stuff on a computer is just cache. For example, Operating System installs. It is so frustrating when installing a home computer how intermingled the operating system and updates (all cache) becomes with your preference selections and personal data (own creative output or primary copy). You find yourself backing up a lot more than is strictly necessary.

Operating system is purely cache

We avoid the need to do full server backups at FastMail by never changing config files directly. All configuration goes in version-controlled templates files. No ifs, no buts. It took a while to train ourselves with good habits here – reinstalling frequently and throwing out anything that wasn’t stored in version control until everyone got the hint.

The process of installing a machine is a netboot with FAI which wipes the system drive, installs the operating system, and then builds the config from git onto the system and reboots ready-to-go. This process is entirely repeatable, meaning the OS install and system partition is 100% disposable, on every machine.

If we were starting today, we would probably build on puppet or one of the other automation toolkits that didn’t exist or weren’t complete enough when I first built this. Right now we still use Makefiles and perl’s Template-Toolkit to generate the configuration files. You can run make diff on each configuration directory to see what’s different between a running machine and the new configuration, then make install to upgrade the config files and restart the related service. It works fine.

It doesn’t matter what exact toolkit is used to automate system installs, so long as it exists. It’s the same process regardless of whether we just want to reinstall to ensure a clean system, are recovering from a potential system compromise, are replacing failed hardware, or we have new machines to add to our cluster.

User data on separate filesystems

Most of our machines are totally stateless. They perform compute roles, generating web pages, scanning for spam, routing email. We don’t store any data on them except cached copies of frequently accessed files.

The places that user data are stored are:

  • Email storage backends (of course!)
  • File storage backends
  • Database servers
  • Backup servers
  • Outbound mail queue (this one is a bit of a special case – email can be=
    held for hours because the receiving server is down, misconfigured, or tem=
    porarily blocking us. We use drbd between two machines for the outbound sp=
    ool, because postfix doesn’t like it when the inode changes)

The reinstall leaves these partitions untouched. We create data partitions using either LUKS or the built-in encryption of our SSDs, and then create partitions with labels so they can be automatically mounted. All the data partitions are currently created with the ext4 filesystem, which we have found to be the most stable and reliable choice on Linux for our workload.

All data is on multiple machines

We use different replication systems for different data. As mentioned in the Integrity post, we use an application level replication system for email data so we can get strong integrity guarantees. We use a multi-master replication system for our Mysql database, which we will write about in this series as well. I’d love to write about the email backup protocol as well, but I’m not sure I’ll have time in this series! And the filestorage backend is another protocol again.

The important thing is every type of data is replicated over multiple machines, so with just a couple of minutes’ notice you can take a machine out of production and reinstall or perform maintenance on it
(the slowest part of shutting down an IMAP server these days is copying the search databases from tmpfs to real disk so we don’t have to rebuild them after the reboot).

Our own work

We use the git version control system for all our own software. When I started at FastMail we used CVS, and we converted to Subversion and then to finally to Git.

We have a reasonably complex workflow, involving per-host branches, per-user development branches, and a master branch where everything eventually winds up. The important thing is that nothing is considered “done” until it’s in git. Even for simple one-off tasks, we will write a script and archive it in git for later reference. The amount of code that a single person can write is so small these days compared to the size of disks that it makes sense to keep everything we ever do, just in case.

Posted in Advent 2014. Comments Off

Dec 6: User authentication

This blog post is part of the FastMail 2014 Advent Calendar.

The previous post on December 5th was about the importance of data integrity. The following post on December 7th is about how we install servers.

Technical level: medium/high

Today we talk about the internals of our authentication system, which is how we decide that you are who you say you are and from there, figure out what you’re allowed to do.

On every server we have a service running called “saslperld”. As with many of our internal services, its name is now only loosely related to its function. saslperld is our authentication service, and exists to answer the question “can this user, with this password, access this service?”.

Making an authentication request

Each piece of server software we have has some way of calling out to an external service for authentication information. Each server tends to implement its own protocol for doing this, so saslperld implements a range of different methods to receive authentication questions.

The simplest of these is the original saslauthd interface, used by Cyrus. It has three fields – username, password, and service name, and returns a simple yes/no answer. These days it’s barely used, and only really in internal services, because it can’t really be extended so we can’t pass in other interesting information about the authentication attempt (which I’ll talk about further down).

The real workhorse is the HTTP interface, used by nginx. Briefly, nginx is our frontend server through which all user web and mail connections go. It takes care of authentication before anything ever touches a backend server. Since it’s fundamentally a system for web requests, its callouts are all done over HTTP. That’s useful, because it means we can demonstrate an authentication handshake with simple command-line tools.

Here’s a successful authentication exchange to saslperld, using curl:

$ curl -i -H 'Auth-User: robn@fastmail.fm' -H 'Auth-Pass: ********' -H 'Auth-Protocol: imap' http://localhost:7777
HTTP/1.0 200 OK
Auth-Pass: ********
Auth-Port: 2143
Auth-Server: 10.202.80.1
Auth-Status: OK
Auth-User: robn@fastmail.fm

The response here is interesting. Because we can use any HTTP headers we like in the request and the response, we can return other useful information. In this case, `Auth-Status: OK` is the “yes, the user may login” response. The other stuff helps nginx decide where to proxy the user’s connection to. Auth-Server and Auth-Port are the location of the backend Cyrus server where my mail is currently stored. In a failover situation, these can be different. Auth-User and Auth-Pass are a username and password that will work for the login to the backend server. Auth-User will usually, but not always, be the same as the username that was logged in with (it can be different on a couple of domains that allow login with aliases). Armed with this information, nginx can set up the connection and then let the mail flow.

The failure case has a much simpler result:

HTTP/1.0 200 OK
Auth-Status: Incorrect username or password.
Auth-Wait: 3

Any status that isn’t “OK” is an error string to return to the user (if possible, not all protocols have a way to report this). Auth-Wait is a number of seconds to pause the connection before returning a result. That’s a throttling mechanism to help protect against password brute-forcing attacks.

If the user’s backend is currently down, we return:

HTTP/1.0 200 OK
Auth-Status: WAIT
Auth-Wait: 1

This tells nginx to wait one second (“Auth-Wait: 1″) and the retry the authentication. This allows the blocking saslperld daemon to answer other requests, while not returning a response to the user. This is what we use when doing a controlled failover between backend servers, so there is no user-visible downtime even though we shut down the first backend and then force replication to complete before allowing access to the other backend.

This is a simple example. In reality we pass some additional headers in, including the remote IP address, whether the conneciton is on an SSL port or not, and so on. This information contributes to the authentication result. For example, if we’ve blocked a particular IP, then we will always return “auth failed”, even if the user could have logged in otherwise. There’s a lot of flexibility in this. We also do some rate tracking and limiting based on the IP address, to protect against misbehaving clients and other things. This is all handled by another service called “ratetrack” (finally, something named correctly!) which all saslperlds communicate with. We won’t talk about that any more today.

There’s a couple of other methods available to make an authentication request, but they’re quite specialised and not as powerful as the HTTP method. We won’t talk about those because they’re really not that interesting.

Checking the password

Once saslperld has received an authentication request, it first has to make sure that the correct password has been provided for the given username. That should be simple, but we have our alternate login system that can make this quite involved.

The first test is the simplest – make sure the user exists! If it doesn’t, obviously authentication fails.

Next, we check the provided password against the user’s “master” password. There’s nothing unusual here, it’s just a regular password compare (we use the bcrypt function for our passwords). If it succeeds, which it does for most users that only have a single master password set, then the authencation succeeds.

If it fails, we look at the alternate logins the user has configured. For each one available, we do the appropriate work for its type. For most of these the provided password is a base password plus some two-factor token. We check the base password, and then perform the appropriate check against the token, for example a Yubikey server check, or comparing against the list of generated one-time-passwords, or so on. The SMS 1-hour token is particularly interesting – we add a code to our database, SMS the code to the user, and then fail the authentication. When the user then uses the code, we do a complete one-time-password check.

At this point if any of the user’s authentication methods have succeeded, we can move on to the next step. Otherwise, authentication has failed, and we report this back to the requesting service and it does whatever it does to report that to the user.

Authorising the user

At this point we’ve verified that the user is who they say they are. Now we to find out if they’re allowed to have access to the service they asked for.

First, we do some basic sanity checking on the request. For example, if you’ve tried to do an IMAP login to something other than mail.messagingengine.com, or you try to do a non-SSL login to something that isn’t on the “insecure” service, then we’ll refuse the login with an appropriate error. These don’t tend to happen very often now that we have separate service IPs for most things, but the code is still there.

Next, we check if the user is allowed to login to the given service. Each user has a set of flags indicating which services they’re allowed to login to. We can set these flags on a case-by-case basis, usually in response to some support issue. If they user is not explicitly blocked in this way, we then check their service level to see if the requested service is allowed at that service level. A great example here is CalDAV, which is not available to Lite accounts. An attempt to by a Lite user to do a CalDAV login will fail at this point. Finally, we make sure that the service is allowed according to the login type. This is how “restricted” logins are implemented – there’s a list of “allowed” services for restricted accounts, and the requested service has to be in that list.

(It’s here that we also send the “you’re not allowed to use that service” email).

Once we’ve confirmed that the user is allowed to access the requested service we do a rate limit check as mentioned above. If that passes, the user is allowed in, and we return success to the requesting service.

The rest

To make things fast, we cache user information and results quite aggressively, so we can decide what to do very quickly. The downside of this is that if you change your password, add an alternate login method or upgrade your account all the saslperlds need to know this and refresh their knowledge of you. This is done by the service that made the change (usually a web server or some internal tool) by sending a network broadcast with the username in it. saslperld is listening for this broadcast and drops its cache when it receives it.

There’s not much else to say. It’s a single piece of our infrastructure that does a single task very well, with a lot of flexibility and power built in. We have a few other services in our infrastructure that could be described similarly. We’ll be writing more about some of those this month.

Posted in Uncategorized. Comments Off

Dec 5: Security – Integrity

This blog post is part of the FastMail 2014 Advent Calendar.

The previous post on December 4th was about how we build our mail servers. The next post on December 6th is about how we authenticate users.

Technical level: medium

On Tuesday I started this series of posts on security with an overview of the elements of security: Confidentiality, Integrity and Availability.

Integrity is, in my opinion, the most important part of security when it comes to email, so I’m starting there.

I believe that email is your electronic memory. I spoke about this at Oslo University back in 2011, where I answered the question “is email dead” with the following points:

  • Compatibility
  • Unchangeable
  • Privacy
  • Business / Orders / Receipts

Email is built on standards. It’s the world’s most interoperable network.

Once you get an email, it’s your own immutable copy. Your own private immutable copy. It can’t be retracted or edited, all the sender can do is send you a new email asking you to disregard the last one. You never get a situation where you remember seeing something, but it doesn’t exist any more. Unless they link out to a website, then the content can disappear far too easily. Forget about diamonds, email is forever.

At that talk, I addressed the idea that social networks with their private-garden messaging systems would replace email. I use social networks – I organise to catch up with friends via Facebook. I even use it to find people to cover classes (my other job: teaching gym classes). But I wouldn’t use social networks for business receipts, or orders, or something I wanted to remember forever. Email is still the gold standard here (unless you have a fax machine).

Recently, I was trying to find an old conversation that a friend and I had via Facebook messages. We have a few from 2007, and then a gap through until 2010. Nothing from the years in between. That whole conversation is lost, because we didn’t copy it anywhere else, and Facebook didn’t keep it.

My email memory goes back to a little before I moved everything to FastMail – because I messed up and lost everything with a stupid mistake in 2002. I don’t expect to ever “forget” anything I’ve received since then.

Email is, by the design of both the POP3 and IMAP protocols, immutable at the message level. You are not allowed to change the contents of a message once it’s been seen by a client.

In the Cyrus mail server, we take this a step further, by storing the sha1 digest of every raw email message in the index file, and hence being able to detect any accidental corruption or malicious modification of the file on disk.

Integrity at FastMail

This is where we really shine. We’re fanatical about data integrity. We only blog about the cases where things go wrong, which you can read about in examples like:

Even with the nasty bug in 2011, a single misplaced comma which caused all our disks to fill up super-fast and required a full index reconstruct, we didn’t lose anyone’s email, because we had enough sanity checks in place. It just took a while to rebuild indexes. We don’t corrupt indexes on full disk any more. The 2014 bug we lost a handful of emails for 70 users because of a further bug in handling emails which were expunged at one end of a replica pair and not at the other. That bug is now fixed as well..

If you look at the change history on the Cyrus IMAPd server leading up to the 2.4 release, and even earlier as well, you’ll see us adding integrity checks at every level of the Cyrus data structures. You’ll also see patches to the replication system as we, over months, tracked down every case where it was incomplete or incorrect, and fixed them until our replicas were perfect. Our “checkreplication” script still runs weekly over all users looking for mismatches.

And then for 2.4 we rewrote Cyrus replication completely, to be more efficient so we could have replicas in another country – and to make the replicas also do integrity checks on the data coming over the wire, so you can never replicate a broken mailbox and break the other copy as well.

This is why we replicate at the application level rather than at the filesystem level using something like DRBD – because at the application level we have enough information to ensure that only consistent mailboxes are replicated.

Our backup system also does separate checks, using the same index record, but a completely different piece of code (written in Perl rather than C).

This wasn’t built in response to the idea that some attacker would come in and subtly change your emails (though it does provide some very strong protections against those attacks), it was written in response to risks like the faulty RAM we saw in early 2014, or bugs in the kernel silently corrupting files.

Our backup system is available to all our customers, self serve. Just click on a button and a restore will be run for you in the background. So even if you delete email by mistake, you have at least a week to get it back.

Interestingly, some choices reduce integrity in exchange for other things, for example storing all email on encrypted filesystems is a risk to data integrity – it’s much more likely that we will lose everything on a filesystem in face of a partial failure. Data recovery is less possible – so we’re relying more heavily on replicas and backups. The tradeoff here is that if we discard a failed disk, or one of our servers is accidentally sold off on ebay (hey, it’s happened) with user data still on it, then it won’t be readable. It’s a confidentiality vs integrity tradeoff that we are comfortable making.

Integrity and Hosting Jurisdictions

There’s not a huge risk of a government sponsored or other well funded attacker trying to modify your email on our servers. The chance of detection by our regular integrity checking systems is very high (and we can tell the difference between an email with 4096 bytes of garbage where a block was corrupted on disk and one with subtly changed wording), and the benefits are low.

As for the accidental corruption that we do sometimes see – it’s going to come down to dirty vs clean power, and environmental conditions. Temperature fluctuations, humidity, vibration – these are all risks to data integrity, and they are more about a specific datacentre than choice of country. We recently moved our servers to new racks in a cold-containment-aisle area inside our NYI datacentre, which will give consistent cooling up the full height of the rack. All servers have dual power supplies, on two separate circuits, and the power is well filtered by the time it reaches us.

Integrity and The Future

There is one more thing that I want to add to Cyrus to improve integrity even further. At the moment it is possible to fully delete a mailbox or a user on a Cyrus server, and have that delete replicate immediately. In future, I will make it so that it is not possible, even with a single Cyrus server compromised, to permanently delete anything from its replicas. Removing a user will have to be done explicitly on each copy.

I also want to extend the backup system to be something “standard”, at least within the Cyrus world, and open source for everybody. For now it’s quite specific to our systems. A standard interchange format for mailbox archives would make life better for everyone. I have some draft notes from a meeting with David Carter at Cambridge (the original author of the Cyrus replication code), but haven’t finished it yet.

And finally, I want to back up everything else about a user, to the point where it has the same integrity guarantees as the email. Often if someone has deleted their entire account by mistake or let it lapse, we can recover the email – but some of the database-backed items are lost forever.

This will also allow mothballing accounts for cases like a poor fellow I answered a support request for recently. His father had Alzheimer’s Disease and forgot to renew his FastMail account. By the time the son realised that the account had been closed, all email history had been cleaned off our servers. By keeping full backups for a much longer time in the case of payment lapses rather than deliberate account closure, we could save people from losing email in these cases.

As I said at the start, your email is your memory. We take our job of keeping that memory intact very seriously.

aside: for those concerned about sha1 collision attacks, not only is there no known sha1 collision at all yet, it’s very hard to cause a collision by sending an email, because many headers are added between SMTP delivery and final injection into the mailbox, and they are hard to predict. Not impossible, which is why we’re working on a series of patches to include a random string in the Received header added by Cyrus.

Finally, it’s possible to change the hash algorithm with a simple index upgrade, checking the old one against sha1 and then calculating a new hash. We’ve already done this once from md5 to sha1.

Posted in Advent 2014. Comments Off
Follow

Get every new post delivered to your Inbox.

Join 5,911 other followers