Dec 14: On Duty!

This blog post is part of the FastMail 2014 Advent Calendar.

The previous post on 13th December was about our DNS hosting system. The following post on 15th December shows how we load the mailbox screen so quickly.

Technical level: low

BZZZT, BZZZT. Wake up, something is broken! My wife hates the SMS tone on my phone. It triggers bad memories of the middle of the night.

When I started at FastMail in 2004, we already had a great production monitoring system. Like many things, we were doing this well before most internet services. Everyone was on call all the time, and we used two separate SMS services plus the jfax faxing service API to make a phone call (which would squeal at you) just in case the SMSes failed.

The first year was particularly bad – there was a kernel lockup bug which we hadn’t diagnosed. Thankfully Chris Mason debugged it to a problem in reiserfs when there were too many files deleted all at once, and fixed it. Things became less crazy after that.

End-to-end monitoring

Our production monitoring consists of two main jobs, “pingsocket” and “pingfm”. pingsocket runs every 2 minutes, pingfm runs every 10 minutes. There are also a bunch of individual service monitors for replication that inject data into individual databases or mail stores and check that the same data reaches the replicas in a reasonable timeframe. There are tests for external things like RBL listings and outbound mail queue sizes, and finally tests for server health like temperature, disk failure and power reliability.

pingsocket is the per-machine test. It checks that all services which are supposed to be running, are running. It has some basic abilities to restart services, but mostly it will alert if something is wrong.

pingfm is the full end-to-end test. It logs in via the website and sends itself an email. It checks that the email was delivered correctly and processed by a server-side rule into a subfolder. It then fetches the same email again via POP3 and injects that into yet another folder and confirms receipt. This tests all the major email subsystems and the web interface.

Any time something breaks that we don’t detect with our monitoring, we add a new test that will check for it. We also think of likely failure modes when we add something new, and test that they are working – for example the calendar service was being tested on every server even before we released it to beta.

Notification levels

The split has always been between “urgent” (sms) and non-urgent (email). Many issues like disk usage levels have both a warning level and an alert level. We will get emails telling us of the issue first, and only a page if the issue hits the alert stage. This is great for getting sleep.

We now have a middle level where it will notify us via our internal IRC channel, so it interrupts more than just an email, but still won’t wake anybody.

Duty Roster

Until 2011, every alert went to everybody in the company – though it would choose somebody first based on a complex algorithm. If you didn’t want to be paged, you had to turn your phone off. This was hard if anyone went on holiday, and occasionally we would have everyone go offline at once, which wasn’t great. Once I switched my phone on in Newark Airport to discover that the site was down and nobody else had responded, so I sat perched on a chair on my laptop and fixed things. We knew we needed a better system.

That system is called ‘Lars Botlesen’ or larsbot for short (as a joke on the Opera CEO’s name). The bot has been injected into the notification workflow, and if machines can talk to the bot, they will hand over their notification instead of paging or emailing directly. The bot can then take action.

The best thing we ever added was the 30 second countdown before paging. Every 5 seconds the channel gets a message. Here’s one from last night (with phone numbers and pager keys XXXed out):

<Lars_Botlesen> URGENT gateway2 - [rblcheck: new: 66.111.4.224 dirty, but only one clean IP remaining. Run rblcheck with -f to force] [\n   "listed",\n   {\n      "https://ers.trendmicro.com/reputations/index?ip_address=66.111.4.224" : 1\n   },\n   {\n      "https://ers.trendmicro.com/reputations/index?ip_address=66.111.4.224" : "listed"\n   }\n]\n
<Lars_Botlesen> LarsSMS: 'gateway2: [\n   "listed",\n   {\n      "https://ers.trendmicro.com/reputations/index?ip_address=66.111.4.224" : 1\n   },\n   {\n      "https://ers.trendmicro.com/reputations/index?ip_address=66.111.4.224" : "listed"\n   }\n]\n' in 25 secs to brong (XXX)
<Lars_Botlesen> LarsSMS: 'gateway2: [\n   "listed",\n   {\n      "https://ers.trendmicro.com/reputations/index?ip_address=66.111.4.224" : 1\n   },\n   {\n      "https://ers.trendmicro.com/reputations/index?ip_address=66.111.4.224" : "listed"\n   }\n]\n' in 20 secs to brong (XXX)
<brong_> lars ack
<Lars_Botlesen> sms purged: brong (XXX) 'gateway2: [\n   "listed",\n   {\n      "https://ers.trendmicro.com/reputations/index?ip_address=66.111.4.224" : 1\n   },\n   {\n      "https://ers.trendmicro.com/reputations/index?ip_address=66.111.4.224" : "listed"\n   }\n]\n'
* Lars_Botlesen has changed the topic to: online, acked mode, Bron Gondwana in Australia/Melbourne on duty - e:XXX, p:+XXX,Pushover
<Lars_Botlesen> ack: OK brong_ acked 1 issues. Ack mode enabled for 30 minutes

We have all set our IRC clients to alert us if there’s a message containing the text LarsSMS – so if anyone is on their computer, they can acknowledge the issue and stop the page. By default it will stay in the acknowledged mode for 30 minutes, suppressing further message – because the person who has acknowledged is actively working to fix the issue and looking out for new issues as well.

This was particularly valuable while I was living in Oslo. During my work day, the Australians would be asleep – so I could often catch an issue before it woke the person on duty. Also if I broke something, I would be able to avoid waking the others.

If I hadn’t acked, it would have SMSed me (and pushed through a phone notification system called Pushover, which has replaced jfax). Actually, it wouldn’t have because this happened 2 seconds later:

<robn> lars ack
* Lars_Botlesen has changed the topic to: online, acked mode, Bron Gondwana in Australia/Melbourne on duty - e:XXX, p:+XXX,Pushover
<Lars_Botlesen> ack: no issues to ack. Ack mode enabled for 30 minutes

We often have a flood of acknowlements if it’s during a time when people are awake. Everyone tries to be first!

If I hadn’t responded within 5 minutes of the first SMS, larsbot would go ahead and page everybody. In the worst case where the on-duty doesn’t respond, it is only 5 minutes and 30 seconds before everyone is alerted. This is a major part of how we keep our high availability.

Duty is no longer automatically allocated – you tell the bot that you’re taking over. You can use the bot to pass duty to someone else as well, but it’s discouraged. The person on duty should be actively and knowingly taking on the responsibility.

We pay a stipend for being on duty, and an additional amount for being actually paged. The amount is enough that you won’t feel too angry about being woken, but low enough that there’s no incentive to break things so you make a fortune. The quid-pro-quo for being paid is that you are required to write up a full incident report detailing not only what broke, but what steps you took to fix it so that everyone else knows what to do next time. Obviously, if you _don’t_ respond to a page and are on duty, you get docked the entire day’s duty payment and it goes to the person who responded instead.

Who watches the watchers?

As well as larsbot, which runs on a single machine, we have two separate machines running something called arbitersmsd. It’s a very simple daemon which pings larsbot every minute and confirms that larsbot is running and happy.

Arbitersmsd is like an spider with tentacles everywhere – it sits on standalone servers with connections to all our networks, internal and external as well as a special external network uplink which is physically separate from our main connection. It also monitors those links. If larsbot can’t get a network connection to the world, or the one of the links goes down, arbitersmsd will scream to everybody through every link it can.

There’s also a copy running in Iceland, which we hear from occasionally when the intercontinental links have problems – so we’re keenly aware of how reliable (or not) our Iceland network is.

Automate the response, but have a human decide

At one stage, we decided to try to avoid having to be woken for some types of failure by using Heartbeat, a high availability solution for Linux, on our frontend servers. The thing is, our servers are actually really reliable, and we found that heartbeat failed more often than our systems – so the end result was reduced reliability! It’s counter-intuitive, but automated high-availability often isn’t.

Instead, we focused on making the failover itself as easy as possible, but having a human make the decision to actually perform the action. Combined with our fast response paging system, this gives us high availability, safely. It also means we have great tools for our every day operations. I joke that you should be able to use the tools at 3am while not sober and barely awake. There’s a big element of truth though – one day it will be 3am, and I won’t be sober – and the tool needs to be usable. It’s a great design goal, and it makes the tools nice to use while awake as well.

A great example of how handy these tools are was when I had to upgrade the BIOS on all our blades because of a bug in our network card firmware which would occasionally flood the entire internal network. This was the cause of most of our outages through 2011-2012, and took months to track down. We had everything configured so that we could shut down an entire bladecentre, because there were failover pairs in our other bladecentre.

It took me 5 cluster commands and about 2 hours to do the whole thing (just because the BIOS upgrades took so long, there were about 20 updates to apply, and I figured I should do them all at once):

  1. as -r bc1 -l fo -a – move all the services to the failover pair
  2. asrv -r bc1 all stop – stop all services on the machines
  3. as -r bc1 -a shutdown -h now – shut down all the blades
  4. This step was a web-based bios update, in which I checked checkboxes on the management console for every blade and every BIOS update. The console fetched the updates from the web and uploaded them onto each blade. This is the bit that took 2 hours. At the end of the updates, every server was powered up automatically.
  5. asrv -r bc1 -a all start
  6. as -r bc2 -l fo -m – move everything from bladecentre 2 that’s not mastered there back to its primary host

It’s tooling like this which lets us scale quickly without needing a large operations team. For the second bladecentre I decided to reinstall all the blades as well, it took an extra 10 minutes and one more command: as -r bc2 -l utils/ReinstallGrub2.pl -r before the shutdown.

Nobody knows

Some days are worse than others. Often we go weeks without a single incident.

The great thing about our monitoring is that it usually alerts us of potential issues before they are visible. We often fix things, sometimes even in the middle of the night, and our customers aren’t aware. We’re very happy when that happens – if you never know that we’re here, then we’ve done our job well!

BZZZT Twitch – oh, just a text from a friend. Phew.

Posted in Advent 2014. Comments Off on Dec 14: On Duty!

Dec 13: FastMail DNS hosting

This blog post is part of the FastMail 2014 Advent Calendar.

The previous post on 12th December was about our multi-master database replication system. The following post on 14th December is about our 24/7 oncall duty roster.

Technical level: low-medium

Part of running any website or email service is that you need to publish DNS records for your domains. DNS is what allows computers to convert names like “fastmail.com” into IP addresses that computers use to actually talk to each other.

In the very early days of FastMail, we used an external service called ZoneEdit to do this. That’s fine when your DNS requirements are simple and don’t change much, but over time our DNS complexity and requirements increased, so we ended up moving to and running our own DNS servers.

For a long time, we used a program called TinyDNS to do our DNS serving. TinyDNS was written by Dan Bernstein (DJB). DJB’s software has a history of being very security conscious and concise, but a bit esoteric in its configuration, setup and handling.

While TinyDNS worked very well for us (extremely reliable and low resource usage), one issue with TinyDNS is that it reads its DNS data from a single constant-only database file that is built from a corresponding input text file. That means to make any DNS changes, you have to modify/rebuild the input data file, and then rebuild the entire database file each time you make even a single change.

That was fine when DNS hosting was just for us, but over time we found more and more people wanted to use their own domain for email, so we opened up DNS hosting to our users. To make it as easy as possible for users, when you add a domain to your FastMail account, we automatically publish some default DNS records, with the option to customise as much as you want.

Allowing people to host their DNS with us is particularly important for email because there’s actually a number of complex email-related standards that rely on DNS records. For websites, it’s mostly about just creating the right A (or in some cases, CNAME) record. For email though, there’s the MX records for routing email for your domain, wildcard records to support subdomain addressing, SPF records for stopping forged SMTP FROM addresses, DKIM records for controlling signing of email from your domain. There’s also SRV records to allow auto-discovery of our servers in email and CalDAV clients, and a CNAME record for mail.yourdomain.com to log in to your FastMail account. In the future, there’s also DMARC records we want to allow people to easily publish. For more information on these records and standards, check out our previous post about email authentication.

The problem with TinyDNS was that any change people made to their DNS, or any new domains added at FastMail, we couldn’t just immediately rebuild the database file because it’s a single file for ALL domains, so it’s quite large. So instead we’d only rebuild it every hour, and people had to be aware that any DNS changes they made might take up to an hour to propagate to our actual DNS servers. Not ideal.

So a few years back, we decided to tackle that problem. We looked around at the different DNS software available, and settled on PowerDNS. One of the things we particularly liked about PowerDNS was its plug-able backends, and its support for DNSSEC. Using this, we were able to build a pipe based backend that talked to our internal database structures. This meant that DNS changes are nearly immediate (there’s still a small internal caching time).

Because DNS is so important, we tested this change very carefully. One of the things we did was to take a snapshot of our database, and capture all the DNS packets to/from our TinyDNS server for an hour. On a separate machine, we tested our PowerDNS based implementation with the same database snapshot, and replayed all the DNS packets to it, and checked that all the responses were the same.

With this confirmation, we were able to rollout the change from TinyDNS to PowerDNS. Unfortunately even with that testing, we still experienced some problems, and had to rollback for a while. After some more fixing and tests, we finally rolled it out permanently in Feb 2013 and it’s been happily powering DNS for all of our domains (e.g. fastmail.com, fastmail.fm, messagingengine.com, etc) and all user domains since.

Our future DNS plans include DNSSEC support (which then means we can also do DANE properly, which allows server-to-server email sending to be more secure), DMARC record support, and ideally one day Anycast support to make DNS lookups faster.

For users, if you don’t already have your own domain, we definitely recommend it as something to consider. By controlling your own domain, you’ll never be locked to a particular email provider, and they have to work harder to keep your business, something we always aim to do :)

With the new GTLDs that have been released and continue to be released, there’s now a massive number of new domains available. We use and recommend gandi.net and love their no bullshit policy. For around $15-$50/year, your own domain name is a fairly cheap investment on keeping control of your own email address forever into the future, and with FastMail (and an Enhanced or higher personal account, or any family/business accounts), we’ll do everything else for you. Your email, DNS and even simple static website hosting.

Posted in Advent 2014. Comments Off on Dec 13: FastMail DNS hosting

Dec 12: FastMail’s MySQL Replication: Multi-Master, Fault Tolerance, Performance. Pick Any Three

This blog post is part of the FastMail 2014 Advent Calendar.

The previous post on 11th December was from our support team. The following post on 13th December is about hosting your own domain with us.

Technical level: medium

For those who prefer watching videos over reading, here’s a talk I gave at Melbourne Perl Mongers in 2011 on FastMail’s open-sourced, custom MySQL replication system.

Most online services store their data within a database, and thanks to the culture of Open Source, these days there are plenty of robust RDBMs to choose from. At FastMail, we use a Percona build of MySQL 5.1 because of their customised tooling and performance patches (If you haven’t heard of Percona, I recommend trying them out). However, even though MySQL 5.1 is a great platform to work with, we do something differently here – we don’t use its built-in replication system and instead opted to roll our own.

First, what’s the problem with running an online service on a single database? The most important reason against this is the lack of redundancy. If your database catches on fire or as what happens more often the oom-killer chooses to zap your database server as it’s usually the biggest memory hog on a machine, none of your applications can continue without their needed data and so your service is taken offline. By using multiple databases, when a single database server is downed, your applications still have the others to chose from and connect to.

Another reason against using a single database for your online service is degraded performance – as more and more applications connect and perform work, your database’s server load increases. Once a server can’t take the requested load any longer, you’re left with query timeout responses and even refused connections, which again takes your service offline. By using multiple database servers, you can tell your applications to spread their load across the database farm thus reducing the work a single database server has to cope with, while gaining an upshot in performance for free across the board.

Clearly the best practice is to have multiple databases, but why re-invent the wheel? Mention replication to a veteran database admin and then prepare yourself a nice cup of hot chocolate while they tell your horror stores from the past as if you’re sitting around a camp fire. We needed to re-invent the wheel because there are a few fundamental issues with MySQL’s built-in replication system.

When you’re working with multiple databases and problems arise in your replication network, your service can grind to a halt and possibly take you offline until every one of your database servers back up and replicating happily again. We never wanted to be put in a situation like this and so wanted the database replication network itself to be redundant. By design, MySQL’s built-in replication system couldn’t give us that.

What we wanted was a database replication network where every database server could be a “master”, all at the same time. In other words, all database servers could be read and written to by all connecting applications. Each time an update occurred on any master, the query would then be replicated to all the other masters. MySQL’s built-in replication system allows for this, but it comes with a very high cost – it is a nightmare to manage if a single master was downed.

To achieve master-master replication with more than two masters, MySQL’s built-in replication system needs the servers be configured in a ring network topology. Every time an updated occurs on a master, it executes the query locally, then passes it off to the next server in the ring, which applies the query to its local database, and so on – much like participants playing pass-the-parcel. And this works nicely and is in place in many companies. The nightmares begin however if a single database server is downed, thus breaking the ring. Since the path of communication is broken, queries stop travelling around the replication network and data across every database server begin to get stale.

Instead, our MySQL replication system (MySQL::Replication) is based on a peer-to-peer design. Each database server runs its own MySQL::Replication daemon which serves out its local database updates. They then run a separate MySQL::Replication client for each master it wants a feed from (think of a mesh network topology). Each time a query is executed on a master, the connected MySQL::Replication clients take a copy and applies it locally. The advantage here is that when a database server is downed, only that single feed is broken. All other communication paths continue as normal, and query flow across the database replication network continue as if nothing ever happened. And once the downed server comes back online, the MySQL::Replication clients notice and continue where they left off. Win-win.

Another issue with MySQL’s built-in replication system is that a slave’s position relative to its master is recorded in a plain text-file called relay-log.info which is not atomically synced to disk. Once a slave dies and comes back online, files may be in an inconsistent state. If the InnoDB tablespace was flushed to disk before the crash but relay-log.info wasn’t, the slave will restart replication from an incorrect position and so will replay queries, leaving your data in an invalid state.

MySQL::Replication clients store their position relative to their masters inside the InnoDB tablespace itself (sounds recursive, but it’s not since there is no binlogging of MySQL::Replication queries). As updates are done within the same transaction as replicated queries are executed in, writes are completely atomic. Once a slave dies and comes back online, we are still in a consistent state since either the transaction was committed or it will be rolled back. It’s a nice place to be in.

MySQL::Replication – multi-master, peer-to-peer, fault tolerant, performant and without the headaches. It can be found here.

Posted in Advent 2014, MySQL. Comments Off on Dec 12: FastMail’s MySQL Replication: Multi-Master, Fault Tolerance, Performance. Pick Any Three

SSL certificates updated to SHA-256, RC4 disabled

Today we’re rolling out SHA-256 certificates. We announced this last month, and you can read that post for more information about why this is necessary.

At the same time, we’ve disabled the RC4 cipher suite. RC4 has long been considered broken and the browser security community recently started actively discouraging its use. The SSL Labs test penalises it, and Chrome has started presenting a low-priority warning.

All this means that we’re now get an A+ grade on the SSL Labs test, which is a good indicator that when it comes to our SSL/TLS configuration we’re pretty much in step with current industry best-practice.

If, like most of our users, you use the web client in a modern web browser, you won’t notice any difference. In older browsers and some IMAP/POP3/DAV/LDAP clients, you may start seeing disconnection problems if they don’t know how to handle SHA-256 certificates or rely on RC4. In these cases you’re encouraged to upgrade your clients and if necessary, contact your the author of your client for an update. In the meantime, you can use insecure.fastmail.com (web) and insecure.messagingengine.com (IMAP/POP/SMTP), both of support RC4 and have a SHA-1 certificate. As always, we highly discourage the use of these service names because they leave your data open to attack, and we may remove them in the future.

Posted in News. Comments Off on SSL certificates updated to SHA-256, RC4 disabled

Dec 11: FastMail Support

This blog post is part of the FastMail 2014 Advent Calendar.

The previous post on 10th December was the second security post, on availability. The following post on 12th December is about our multi-master database replication.

Technical level: low

Support system

FastMail has a comprehensive support system. We have a well written, and well maintained online help system, where you can read in detail about every aspect of the service. This is complemented by our blog(which is likely where you are reading this) which you can subscribe to, where you can read about what we are working on, or where the service is generally heading including any recent changes.

To see the live running status of various FastMail services, you can see our own status page. That page also gives you an idea about how the service has been operating in the past week. If you would like to see an uptime report of FastMail from an independent third-party, there’s always the status report from Pingdom.

But in spite of all that, sometimes you will need to get in touch with a human being who knows the service really well. In that case, you can get in touch with a member of our friendly support team using our ticket system.

Support team and support process

Back in early 2000-ish, I realized IMAP as The True Path while the mainstream practice was POPanism, and searched hard and wide for a good provider. I found FastMail to be the best IMAP provider on the planet, and signed up for an account. And I would say the same thing today as well – as far as the ‘best IMAP provider’ bit goes. As for ‘The True Path’, I am a born-again IMAPian, and would say JMAP fits the bill these days as we are transitioning from the Destkopian age to the AndroiPhonean age. Our new apps for iPhone and Android use JMAP to do their work, and you can see how well they work, especially on high-latency connections.

But I digress.

So, these were early days, and soon FastMail advertised a tech-support position. I had done a side-project of developing a basic email client mainly to learn email protocols, so when the opportunity came along, I was very keen and I applied and soon landed the job.

I was joined by Vinodh and Yassar around 2008, and from then on, they have been handling front-line requests until recently.

Now a days, you will find a good chunk of your support ticket requests handled by our new recruit Afsal (and soon by Anto as well), especially during US day time.

New support tickets are handled by those techs in front-line support. Any issues escalated will usually be dealt by Yassar, and further escalations will be handled by myself. Yassar and I escalate issues to either an engineer on duty, or if we are sure the issue is related to somebody’s area of expertise, directly to that developer or admin.

Future plans

We plan to provide 24 hour support coverage in the future, and we are working hard towards that. We will extend support coverage to the US day time first, after which we will start extending coverage to other timezones as well. Recruiting and training new people will take time, but we’ll get there eventually.

Most frequently asked questions

As today’s post is about support, I’ll list here the most common questions that we see in support tickets, which you can get quick help for from our help documentation itself:

Remember; there is a wealth of information in our support system online, so thats a good first place to go to, to learn about FastMail. But our support team is always at hand, should you have questions!

See you again!

It is very satisfying to engage with our users, and help them make the most out of their FastMail accounts. Sometimes its a 75 year old grandma in the US who asks how best they can share their new recipes to a selected set of contacts(think address book groups), and sometimes its the uber-geek from across the globe who asks what scheme we use to hash our passwords(we use bcrypt)! Our customers are diverse, their questions intimidating interesting, and the experience satisfying!

FastMail has seen tremendous growth in the past few years, and we are working hard on scaling our support team to match. You should see the results of this work in the months to come.

Posted in Advent 2014. Comments Off on Dec 11: FastMail Support

Dec 10: Security – Availability

This blog post is part of the FastMail 2014 Advent Calendar.

The previous post on 9th December was about email authentication. The following post is from our support team.

Technical level: medium

Availability is the ability for authorised users to gain access to their data in a timely manner.

While both Confidentiality and Integrity are important, they are not noticed unless something goes wrong. Availability on the other hand, is super visible. If you have an outage then it will be all through the media – like Yahoo, or Microsoft, or Google, or indeed FastMail.

Availability at FastMail

Our record really speaks for itself. Our public pingdom page and status page show how reliably available we are. FastMail has great uptime.

We achieve this by reducing single points of failure, and having all data replicated in close to real time.

I was in New York earlier this year consolidating our machines, removing some really old ones, and also moving everything to new cabinets which have a better cooling system and more reliable power. We had out-grown the capacity in our existing cabinets, and didn’t have enough power to run completely on just half our circuits any more.

Our new cabinets have redundant power – a strip up each side of the rack – and every server is wired to both strips, and able to run from just one. Each strip has the capacity to run the entire rack by
itself.

03-pdu-powerin
05-back-04

The servers are laid out in such a way that we can shut down any one cabinet. In fact, we can shut down half the cabinets at a time without impacting production users. In 2014 it’s not such a big deal to be able to reinstall any one of your machines in just a few minutes – but in 2005 when we switched to fully automated installation of all our machines, only a few big sites were doing it. For the past few years, we’ve been at the point where we can shut down any machine with a couple of minutes’ notice to move service off it, and users don’t even notice that it’s gone. We can then fully reinstall the operating system.

We have learned some hard lessons about availability over the years. The 2011 incident took a week to recover from because it hit every server at exactly the same time. We couldn’t mitigate it by moving load to the replicas. We are careful not to upgrade everywhere at once any more, no matter how obvious and safe the change looks!

Availability and Jurisdiction

People often ask why we’re not running production out of our Iceland datacentre. We only host secondary MX and DNS, plus an offsite replica of all data there.

While we work hard on the reliability of our systems, a lot of the credit for our uptime has to go to our awesome hosting provider, NYI. They provide rock-solid power and network. To give you some examples:

  • During Hurricane Sandy, when other datacentres were bucketing fuel up the staircases and having outages, we lost power on ONE circuit for 30 seconds. It took out two units which hadn’t been cabled correctly, but they weren’t user facing anyway.
  • We had a massive DDOS attempted against us using the NTP flaw a while ago. They blocked just the NTP port to the one host being attacked, and informed us of the attack while they asked their upstream providers to push the block out onto the network to kill off the attack. Our customers didn’t even notice.
  • They provide 24/7 onsite technical staff. Once when they were busy with another emergency, I had to wait 30 minutes for a response on an issue. The CEO apologised to me personally for having to wait. Normal response times are within 2 minutes.

The only outage we’ve had this year that can be attributed to NYI at all is a 5 minute outage when they switched the network uplink from copper to fibre, and managed to set the wrong routing information on the new link. 5 minutes in a year is pretty good.

The sad truth is, we just don’t have the reliability from our Iceland datacentre to provide the uptime that our users expect of us.

  • Network stats to New York: you see the only time it drops below 99.99% is July, when I moved all the servers, and there was the outage on the 26th (actually 5 minutes by my watch). As far as I can tell, the outages on the 31st were actually a pingdom error rather than a problem in NYI
  • Network stats to Iceland: Ignore the 5 hour outage in August, because that was actually me in the datacentre. We don’t have dual cabinet redundancy there, so I couldn’t keep services up while I replaced parts. Even so, there are multiple outages longer than 10 minutes. These would have been very user-visible if users saw them. As it is, they just page the poor on-call engineer.

If we were to run production traffic to another datacentre, we would have to be convinced that they provide a similar level of quality to that provided by NYI. Availability is the life-blood of our customers. They need email to be up, all the time.

Human error

Once you get the underlying hardware and infrastructure to the level of reliability we have, the normal cause of problems is human error.

We have put a lot of work this year into processes to help avoid human errors causing production outages. There will be more on the testing process and beta => qa => production rollout stages in a later blog post. We’ve also had to change our development style slightly to deal with the fact that we now have two fully separate instances of our platform running in production – we’ll also blog about that, since it’s been a major project this year.

General internet issues

Of course, the internet itself is never 100% reliable, as was seen by our Optus and Vodafone using customers in Australia recently. Optus were providing a route back from NYI which went through Singtel, and it wasn’t passing packets. There was nothing we could do, we had to wait for Optus to figure out what was wrong and fix it at their end.

We had a similar situation with Virgin Media in the UK back in 2013, but then we managed to route traffic via a proxy in our Iceland datacentre. This wouldn’t have worked for Australia, because traffic from Australia to Iceland travels through New York too.

We are looking at what is required to run up a proxy in Australia for Asia-Pacific region traffic if there are routing problems from this part of the world again. Of course, that depends on the traffic from our proxy being able to get through.

One of the nastiest network issues we’ve ever had was when traffic to/from Iceland was being sent through two different network switches in London, depending on the exact source/destination address pair – and one of the switches was faulty – so only half our traffic was getting through. That one took 6 hours to be resolved. Thankfully, there was no production traffic to Iceland, so users didn’t notice.

Posted in Advent 2014. Comments Off on Dec 10: Security – Availability

Dec 9: Email authentication

This blog post is part of the FastMail 2014 Advent Calendar.

The previous post on 8th December was about our rich text email editor. The following post on 10th December is the second security post, on availability.

Technical level: medium.

What is it?

Email authentication is a collection of techniques to verify that an email is legitimately from the sender it claims to be from.

History

Back when email was first designed the world was a much more trusting place. The early internet consisted mainly of government agencies and educational institutions, and the notion that anyone would forge sender details in an email wasn’t considered.
Although standards have moved on since then, compatibility is always a concern when implementing new standards, as a result, email standards do not have sender authentication at their core.

In today’s internet, spam and phishing emails are or course a big problem, email authentication has become something which needs to be addressed, and a number of techniques have been developed to help achieve this. Here is a brief summary of some of the popular ones, and how FastMail uses each one.

Real-time Blackhole List (RBL)

RBL, also known as a DNS-Based Blackhole list check is a method where inbound IP addresses are checked against a list of known bad servers. This gives an early indication of which mail is more likely to be spam.
In addition to checking incoming mail against these lists we also monitor our own IP addresses on the lists as an early warning system. When one of our addresses is listed we take steps to find and remove the problem account while sending mail via one of our other outgoing servers to minimise disruption to our legitimate users.

Sender Policy Framework (SPF)

SPF is a domain based authentication system. It allows a recipient to verify that the sending server is authorised to send email for a particular domain.

Domain owners can publish a list of valid addresses in their DNS records, and can suggest how to deal with email which does not come from an address on that list.

Unfortunately, the sender address verified by SPF is not necessarily the sender address that you see in your email client. The address checked by SPF is the return path sent in the SMTP transaction, which may differ from the address in the email’s From: header.

Microsoft attempted to fix this by introducing SenderID, which is similar to SPF but can verify the From address header. However, there were numerous problems with this as a standard and it isn’t widely implemented.

At FastMail we use SPF as one of the many factors to spam filter incoming mail. For outgoing mail we specify our servers explicitly so that they get a positive score for successful SPF, but also say “?all” to allow for other systems to send from addresses on our domains. If you have your own domain with us, then of course then you can setup your own SPF records to be as strict or as liberal as you need.

DomainKeys Identified Mail (DKIM)

DKIM is a merging of two older standards; DomainKeys, and Identified Internet Mail. The intention of DKIM is to verify that an email associated with a particular domain was sent by an authorised agent of that domain and has not been modified since being sent. Based on Public Key Cryptography, a domain owner creates a public and private key pair, publishes the public part, and then uses the private part to sign the body and selected headers of an email. The receiver of an email is then able to check that signature against the public part of the key to verify that the sender of the email had access to the private part of the key, and therefore is authorised to send email on behalf of that domain.

Problems: Again, there is no stipulation that the domain in the From: header signs the email. It is possible for an email to be signed by any domain and pass basic DKIM checks.

At FastMail we sign all outgoing emails with a key for our messagingengine.com domain, and also with a key for the domain of your email address (e.g. fastmail.com).

If you use your own domain with us, we will automatically sign emails with a DKIM key if you host your DNS with us. We also make it super easy to setup both SPF and DKIM.

For incoming mail, again, we use DKIM as a tool in spam filtering. DKIM is also used to validate official emails from FastMail, We use the DKIM signature combined with the headers to validate that the email was sent by an official FastMail staff account and add a green tick next to legitimate emails.

Author Domain Signing Practices (ADSP)

ADSP is an extension to DKIM whereby a domain owner can publish a policy stating how email from their domain should be signed. A domain owner can state one of the following.

  1. Legitimate email from this domain may or may not be signed by the domain.
  2. Legitimate email from this domain will be signed by the domain.
  3. Legitimate email from this domain will be signed by the domain, and non signed email should be discarded.

The domain used for ADSP is the domain in the From header of the email, and is the one most likely seen by the recipient.

Domain-based Message Authentication, Reporting & Conformance (DMARC)

DMARC brings together the SPF and DKIM checks and ties them in with the sender shown in the From address of the email.

In order to be considered ‘DMARC aligned’ an email must pass SPF and DKIM checks, the SPF domain must match the domain of the From address, and at least one DKIM signature must also match that domain. This provides a good level of certainty that the email is not forged.

Domain owners who choose to publish DMARC records can suggest what should be done with messages which do not pass DMARC tests, reject outright, or quarantine (treat as spam).

The reporting part of DMARC is a tool for domain owners rather than end users. Email receivers who fully implement DMARC build reports on email received, and send reports to domain owners who request them (via the published DMARC record). This report shows some basic aggregate information on number of emails received, the servers they were received from, and their SPF and DKIM status. This allows domain owners to discover how their domain is being used, which can then inform decisions on the best SPF/DKIM/DMARC policies to publish.

We are investigating how we can use DMARC to benefit FastMail users. Given the wide range of ways in which our users are using our services it would be bad to publish reject or quarantine policies. The number of legitimate emails which would be blocked by this would far outweigh any possible benefit. Implementing over restrictive policies in an environment where email is used in a diverse way would result in pain for users.

Challenges

The biggest issue by far is Mailing Lists. An email sent to a mailing list will typically be re-sent from the mail server of the list, breaking SPF, and usually has some alterations made to it (such as subject changes or unsubscribe links added), which break DKIM signatures.

It is also fairly common that a third party would send email on your behalf, for example, a company might contract out their support and ticketing system to a third party, and emails from the ticketing system would be sent from the companies domain. Care needs to be taken to ensure that these emails are also considered in SPF policies, and that the third party is able to properly DKIM sign these messages.

Third party senders has been one of the challenges we faced while implementing our green tick and phishing warnings system. This blog is hosted by WordPress, and sends email on behalf of FastMail (the blog notifications). We needed to make sure that these emails from WordPress could be identified and validated against the WordPress DKIM signature, check that they were sent from our WordPress blog, and make sure those emails were not marked with the phishing warning box. This needs to be done on a case by case basis as what identifies emails on WordPress isn’t likely to be the same as what identifies emails on other services such as Twitter.

Another common source of SPF failures is forwarding. If, for example, a user has migrated to FastMail from another provider, and has set up their old provider to forward email to their FastMail address, then we would see the IP address of the forwarding server, not the originating server, and this could result in an SPF failure for an otherwise legitimate email. There are some standards which attempt to address this such as Sender Rewriting Scheme (SRS), which involves rewriting the envelope sender to one at the forwarding domain. This fixes the SPF problem, but if the originating domain uses DMARC, then the email would no longer be aligned as the from addresses will no longer match.

The trouble of course is that phishing emails can be sent from entirely unrelated domains and are still successful. So sender authentication doesn’t always help.

Another problem with email authentication is a misunderstanding of what is being verified. We can take technical steps to verify that an email did come from a legitimate email account, but we can make no claim over how trustworthy the author of that email is. Anybody can purchase a domain, set it up properly with SPF, DKIM, and DMARC, and then use it to send spam or phishing emails. Also any service, including FastMail, faces the problem that accounts can be compromised and used to send bad content. Detecting and dealing with this is a whole other blog post.

Posted in Advent 2014. Comments Off on Dec 9: Email authentication
Follow

Get every new post delivered to your Inbox.

Join 6,860 other followers

%d bloggers like this: