Dec 18: Billing and Payments — a potted history

This blog post is part of the FastMail 2014 Advent Calendar.

The previous post on 17th December was about how we test our software. Stay tuned for another post tomorrow.

Technical level: low

Billing is not very glamorous — but it is important.

If you’ve ever run a business, you know that it doesn’t matter how great your service or product is, you have to be able to bill your customers, or eventually you’ll no longer be able to provide that service or product.

The financial side of FastMail is broadly divided into two parts: billing — which is keeping track of which services are being used and how much people have paid, and payments — which is about actually collecting money from customers.

Billing

When FastMail started, our billing system was fairly ad-hoc, and pretty much the bare minimum we needed to keep track of user balances. When a user signed up, or paid, or renewed, we would

  • create a record with the effect on their balance and a description of the event, and
  • update any attributes affected, such as the service level, or subscription expiry date

This did the job for a few years, but as we slowly grew it became apparent that it was not sufficient. Manual adjustments caused problems, and it was difficult to extract information that our accountants needed.

So, we redid things in a kind of engineery-accounting way.

The basic record keeping part of the system has these properties:

  • data in the billing system is never changed or deleted, only added
  • every event is modelled as a bunch of square pulses with a start time, end time, resource type, and height of the pulse
  • a pulse may have no end time for non-expiring resources — in this case the pulse becomes a step function
  • if the resource type is “money” then all steps (money is non-expiring) are paired in such a way that this is equivalent to double-entry bookkeeping
  • each event has a separate time that the event occurred
  • the authoratitive billing information is calculated by adding together all the pulses and steps.

On top of that record keeping are a bunch of simple views and convenience methods for commonly used queries and actions, such as resources which are always linked to the duration of the current subscription (this was the case for Extra Aliases and Extra Domains, when we offered those “a la carte”).

This allowed us to do a number of things that were previously difficult or impossible. We can:

  • See exactly how many resources (e.g. subscriptions) were being used for any time in the past,
  • Reconstruct all the transactions and purchases that led to the current state — this is important to show customers a statement of account,
  • Easily and accurately calculate pro-rata changes, even when prices have changed since the original purchase,
  • Audit our user attributes to make sure that they match the information in our billing system,
  • Reconcile our billing information with our payment gateway, and
  • Report on things that are important to our accountant, like “deferred revenue” and “unrealised FX gain / loss”

This billing system has now been in use for the past 10 years, with only minor changes needed in that time. Part of the reason for the small number of changes is that over time our pricing and product offerings have become progressively simpler.

Aside:

Some of our users don’t like this, because they like to optimise their subscription so that they are paying for exactly what they need, and no more. Most of our users, though, find the all-inclusive model to be much more appealing.

Some things that are still lacking in our billing system are automated “discount coupon” and “special prices for charities” functionality. I’d like to add these one day.

Payments

As a credit card merchant, an important thing is the chargeback rate — that is, the percentage of your payments that the cardholders dispute with their bank. If this rate gets too high, then the payment gateways deem you to be risky, and they will charge much higher fees, or cancel your account completely.

When we first started out we used integrated billing only for credit cards. This was done via a payment gateway — we chose Worldpay. In those days, the only payment gateways were banks, and Worldpay was then part of the Royal Bank of Scotland. Banks are pretty conservative by nature and we had to go to some lengths to reassure them that we were a good risk.

All other payment methods were manual — someone would send us a cheque or a PayPal payment or a telegraphic transfer, and we would try to find the FastMail user and apply the credit to that account.

We were fairly successful at keeping the chargeback rate low — we have a number of fraud checks, and we try to refund disputed payments promptly. So, we got along just fine with Worldpay.

Over time, Worldpay’s APIs improved and we were able to improve the user interface, do automated reconciliation, and even perform delayed capture.

However, when FastMail was bought by Opera in 2010, Worldpay would not let us keep our merchant account under the new structure. We would need to reapply for a new merchant account, and Worldpay’s policies had changed, so we would now need to keep a much larger deposit — so large that it wasn’t feasible.

Finding a payment gateway that had acceptable deposit terms turned out to be difficult. At the time, it seemed to be a common practice to require merchants to provide a continual deposit of between three to twelve months of revenue! The mindset among many of the payment gateways appeared to be that the risk of total bankruptcy of their merchants was so high that they could not tolerate any possible exposure to chargebacks, ever.

We did a lot of searching around and found that even as recently as 2010, it was hard to find a payment gateway that met our requirements, which were:

  • take payments in USD
  • take payments from anywhere in the world
  • pay into a USD bank account in Australia
  • support delayed capture
  • have acceptable deposit terms
  • not require us to see the most PCI-DSS sensitive data

This last requirement was important, because we were not a huge company, and the cost of maintaining (and certifying) compliance with the higher levels of PCI-DSS is significant.

But we still wanted to be able to conduct recurring billing to make it easy for users to renew their subscriptions, and also so that our customers could purchase “a la carte” additions to their account.

The solution is for us to redirect to the payment gateway (in 2010) or use javascript (now) to send card information directly from the user’s web browser to the payment gateway. In both cases, the communication is directly between the user’s web browser and the payment gateway, without passing through our servers. The payment gateway would then give us a token that we could use to process additional payments when necessary.

This means that even if our servers were compromised, the attacker would not be able to steal the card details of our users from our system. The best way to ensure confidentiality is not actually having the data in the first place!

We eventually found a payment gateway that met all of our requirements, and signed up with Global Collect.

An advantage of Global Collect was that it was able to process payments of many different methods, so in theory it was able to act as an abstraction layer, and allow us to easily take payments using almost any scheme, including local bank transfers in many different countries. In practice, there was a fair amount of work needed, and in the end the only additional payment method that we used was automatic payments via PayPal.

A substantial amount of effort was required to make things work with the Global Collect — all the integration points were slightly different from Worldpay and the failure modes were different too. There was substantially more work involved in the “non-payment” part of the integration. These include reconciliation with the payment gateway (to make sure that FastMail has credited all the payments to the right users, even if we didn’t initiate them), and dealing with payments that unexpectedly change status (from Succeeded to Failed or vice versa) some days or months after the actual payment.

This refactoring was not a lot of fun, so to try to reduce this in the future, a lot of groundwork was laid for our own payment abstraction layer.

We attempted to move all the Worldpay card data into Global Collect, so that customers would not have to re-enter their billing details. This generally worked technically but had some giant wrinkles. As a standing authorisation was created in Global Collect for each user, an “auth” of $1.00 was processed, which never appeared on customers’ card statements. For most customers this was fine. A small number of customers had their bank contact them about the $1.00 charge, which they didn’t know about, and an even smaller number of customers had their cards summarily cancelled because their bank deemed that the charges were probably fraudulent. None of this showed up in our testing with our own credit cards.

The whole point of this effort was to make things as convenient as possible for our customers, but the only people who noticed were those who were so severely inconvenienced that we wished we’d never tried. It was a good lesson for future migrations though.

The “delayed capture” feature worked very well with Global Collect. We would process an “auth” which would reserve the funds, and then a few days later we would “capture” the payment, which would actually take the funds from the user’s card.

Spammers and scammers are often trying to use our systems. Many of these are kept away by the requirement to pay, but a few determined spammers make use of stolen payment credentials to sign up accounts. Often we are alerted to abuse of our system within a few days, and if we find out before the payment is “captured”, then we can cancel the payment. In this case, the cardholder will never see a transaction on their paper statement.

In this situation we know that the cardholder’s credentials have been stolen, and are being actively used for fraud on the internet (because they have just been used to try to defraud us). However, we are repeatedly told by all the payment gateways and banks we have asked that there is NO WAY for us to tell the card issuing bank or the cardholder that those details are compromised.

Fast-forward a few more years to 2013, when FastMail was split from Opera. Unfortunately the merchant agreement was now no longer applicable, so we need to change to a different payment gateway.

When it seemed likely that we would have to switch payment gateways again, we pushed ahead with realising our payment abstraction layer. This would allow us to easily support multiple payment gateways at the same time, and direct users to the appropriate payment gateway.

Aside:

It’s possible that a service like Chargify might have been able to provide this for us, but that would have required us to refactor all our billing code as well. Also, I didn’t know about Chargify or similar services then.

By this time, there were a number of the new breed of “disruptive” payment gateways around — such as Pin Payments, Stripe, Braintree, and others. These don’t require large deposits from their merchants.

This means they are exposed to a chargeback risk, but have evaluated risk as being small enough that they don’t need every merchant to carry a deposit large enough to cover the maximum chargeback risk.

They effectively keep about a week of revenue as a deposit, by paying out funds a week after they were taken from the merchant’s customers. This is very convenient for a merchant as it means you can just dip a toe into the waters of a payment gateway, without taking a plunge.

After the difficulties encountered when importing data in to Global Collect, we decided not to use any data migration schemes this time. This meant our customers who already had a billing agreement with us had to enter their billing details again with a new payment gateway.

We selected Pin Payments to handle the bulk of our credit card payments, and have been pretty happy with them in general.

They had a “delayed capture” feature already, and when we asked if that could be automated (so that payments were automatically captured after 3 days) they were happy to add it. Unfortunately this came back to bite us — it turned out that the payment gateway had added a special case to deal with the automatic delayed capture, and this special case was not fully covered in their internal testing. When they made some internal changes later on, this caused a bug, and a bunch of payments were captured a second time, and a third time and a fourth! This was a giant headache to fix, and made us look pretty bad to the affected customers. As a result, we’ll shortly be doing the delayed capture ourselves — we don’t want to be skipping test cases in our payment gateway.

Speaking of “delayed capture”, it doesn’t work quite as cleanly with Pin Payments as it used to with Global Collect. Depending on the card and the bank, sometimes the “auth” appears as a transaction on the card, and then the “capture” appears as another transaction. For the affected customers, they will see two transactions on an online statement. After a while, the “auth” transaction will completely disappear (and it will never appear on paper statements) but there is a period when customers may be concerned. We have many reports of this occuring with Pin Payments, and the Stripe documentation also mentions that this may happen. We didn’t get any reports of this when we were using Global Collect.

In addition, when we started using Pin Payments they didn’t support American Express payments. Many of our business customers prefer to pay with American Express, so this was a challenge. We worked around this by initially advising American Express cardholders to pay via PayPal, and later by automatically sending those payments via Stripe.

A benefit of having our payment abstraction layer was that it was easy for us to gradually phase in payments via Pin. We started off with a couple of percent of payments, and slowly increased the percentage as we grew more confident that everything was going according to plan.

The abstraction layer makes it straightforward to integrate with PayPal directly, which we’ve had to do since late 2013.

It also means that using new payment gateways is pretty straightforward — an implementation class need to be written, and a reconciliation script. As an experiment, we’ve processed a small percentage of payments via Stripe, and we’re confident that we could switch if there was ever a persistent problem with Pin Payments.

And, after testing for a few months, today we’ve also added bitcoin payments (via BitPay) in our main web interface.

Dec 17: Testing

This blog post is part of the FastMail 2014 Advent Calendar.

The previous post on 16th December was about Confidentiality at FastMail. The following post on 16th December is how we get money.

Technical level: low

You wake up, check your email and notice that a new Calendar feature has appeared overnight.

Behind the scenes, several months of planning and hard work goes in to providing our customers with dozens of innovative features before they’re released. Hundreds of thousands of customers connect to FastMail using devices ranging from feature phones to the latest iOS devices for business or personal email, web storage, notes, their calendar and address book. Our previous blog post on regular system monitoring explained how we keep systems running with near 100% uptime.

This blog post examines the development process at FastMail, specifically focusing on introducing new features and testing.

Almost every feature you use in the web interface has been written from scratch at some point over the past 15 years, many features are being migrated from the classic interface (which we maintain to support backwards compatibility) to the new AJAX interface. We test these features to help determine when we think they’re ready for release.

Each week FastMail’s developers add hundreds of lines of code to our Git repository, each line of code could potentially make our service unusable for one or more customers, break a feature on a specific platform or break the work-flow someone has been using for years.

Of course it’s impossible to verify that every feature still works in every possible way for every user every time anyone changes something – we would never release anything.

Anyone who has ever ended up on the advanced screen will know we provide significantly more than the average email service. Given the number of features we have and the number of customer scenarios, there are potentially an unlimited number of things we could test, but only 38 hours in our working week. To prioritise testing, we do risk-based testing to focus our testing effort on what we think is likely to have broken as a result of any significant changes. Any developer who is working on a major change, feature or project informs our QA engineer who tries to break the feature. His job is to consider how a feature may be used by our customers, how someone with less technical experience and knowledge than the person who developed the feature may use the it and how a malicious user may use or abuse the newly introduced/changed feature.

In addition to manual testing, we use Selenium which allows us to test the entire website in real web browsers (e.g. Chrome, Firefox, IE, etc) and control the web browser from code, so we can simulate clicks, mouse movements, etc just like a real person actually using the site would do.

As a small team it isn’t practical to maintain a copy of every version of every operating system with every version of every browser, so we use SauceLabs for most automated functional tests. Selenium WebDriver scripts perform actions such as logging in from a variety of browsers, navigating around the web interface, adding events and sending emails. Our tests run constantly so it becomes obvious who broke what when things go wrong.

The Selenium tests are run using Jenkins for each git commit. Jenkins failures are reported to our QA engineer who verifies whether an issue is a genuine failure and not an intended change before informing the appropriate developer.

We take security seriously and are happy to pay out reasonable bounties for serious bugs. This year we introduced a bug bounty program which has seen a number of issues fixed that could have otherwise gone unreported.

FastMail run a beta version of the web mail service where customers can try new features before they’re available to everyone. Major features are available on beta for several weeks before they go to live for all customers. If nothing significant is found internally and if Jenkins looks happy, we leave the change on our beta site for a few more days before making the new feature available to everyone. Beta customers provide valuable feedback via our beta feedback email address.

With the exception of emergency fixes, most changes go on to our beta server and are available for staff to test as well as anyone who’s interested in trying the latest features, in exchange for some instability. Calendar for example was available on beta for several months before it was launched in June with hundreds of issues being found and fixed during the beta period.

Most issues are discovered and fixed within a day. Internal communication is either done in person, via email or IRC. Issues that take a bit longer to fix go in to our bug tracking tool. Every software development team needs a bug tracking tool to keep track of known issues and help determine when a project is near completion. For historic reasons we were using JIRA. Unfortunately we’d recently been experiencing significant performance problems with the OnDemand hosted JIRA platform (the time to load the first page each day was >30 seconds in some cases!), and so ended up looking for an alternative. We moved to YouTrack, which so far has been performing much better for us.

We believe we offer the best email service on the market. However, it’s always possible to improve and we value all constructive feedback sent via our beta program. If you would like to try the latest features before they’re released and help improve the quality of service we provide, log in to the beta site at beta.fastmail.com.

Dec 16: Security – Confidentiality

This blog post is part of the FastMail 2014 Advent Calendar.

The previous post on 15th December was about how we load the initial mailbox view quickly. The following post is about how we test our changes.

Technical level: low-medium

Information wants to be free.

It’s the catch phrase of everyone who wants the latest episode of their favourite show, or the latest song, and knows they can get it for free more easily and in a less encumbered format than if they were to pay for it.

It’s also a fact of life on the internet. An attacker only has to find one flaw, and once they have data, it can be copied endlessly.

The Landscape

Confidentiality is about keeping data private. This includes protecting against threats like: pervasive surveillance, identify theft, targeted data theft, and activist hacking.

And that’s just active attacks. If you’re not paying for the service, then your data is probably being sold directly to pay for the costs of keeping it online.

Hosting Jurisdictions and Confidentiality

The headline “security risk” that everyone thinks of when talking about hosting jurisdiction and security is that the NSA or equivalent national spying agency will insert covert data taps, with or without the cooperation of the target service.

In fact, that’s it. The only jurisdiction dependent risk is that the national intelligence agency of the host country wants to access data, but they don’t want it enough to resort to illegal means or just making a special deal for access.

Other avenues of attack are either delivered over the internet (hacking, compromised hardware, viruses) or done by subverting/bribing/blackmailing service or data centre staff. If a determined attacker has the budget and agents to find the right person and apply the right pressure, these risks are present anywhere: any country, any data centre.

Mind you, credential theft (compromise of individual accounts rather than the entire service) happens all the time – whether through keyloggers on individual machines, viruses, password reuse from other sites which have been hacked, or just old-fashioned phishing. We find that compromised accounts are frequently used to send spam (taking advantage of our good reputation as a sender) rather than having their data stolen.

Non-Jurisdiction-Dependent Risks

There are data centre specific risks like physical security, trustworthiness of employees, resistance to social engineering attacks – and then there’s everything else.

The majority of possible attacks can be carried out over the internet, from anywhere.

Confidentiality at FastMail

The most important thing for confidentiality is that all our accounts are paid accounts. We don’t offer Guest accounts any more, and we don’t even offer “pay once, keep forever” Member accounts. Both these account types have a problem – they don’t keep paying for themselves. That leaves us hunting for alternative sources of income. We flirted with advertisements on Guest accounts at one stage, but we were never really comfortable with them – even though they were text only and not targeted. Ads are gone, and they’re not coming back.

We are very clear on where we stand. We provide a great service, and we proudly charge for it. Our loyalties are not divided. Our users pay the bills – we have no need to sell data.

We have spelled out in our privacy policy and public communications that we don’t participate in blanket surveillance. We are an Australian company, and to participate in such programs would be in violation of Australian law.

We frequently blog about measures we take to improve confidentiality for our users:

There are also other things we do which don’t have blog posts of their own:

  • Physically separate networks for internal data and management. We don’t use VLANs in shared switch equipment, there’s an actual air gap between our networks.
  • All the machines which have partitions containing user data (email, database, backups, filestore) are only connected to the internal production and internal management networks, and have no external IP addresses.
  • All user data is encrypted at rest, meaning there is no risk of data being recovered from discarded hard disks.
  • Our firewall rules only allow connections to network ports for services which are explicitly supposed to be on those machines, and the ports are only opened after the correct service is started and confirmed to be operating correctly.
  • Our management connections are via SSH and OpenVPN, and are only allowed from a limited set of IP addresses, reducing the our exposure to attacks.
  • We follow all the basic security practices like choosing software with good security records, not allowing password-based login for ssh, applying security patches quickly, and keeping on top of the security and announcement mailing lists for our operating system and key pieces of software.

Our goal is to make the cost of attacking our security much higher than the value of the data that could be stolen. We follow due process when dealing with law enforcement, providing individual data in response to the appropriate Australian warrant, so there is no justification to attempt wholesale surveillance of all our users.

We believe our security is as good as or better than anyone else in the same business. Of course we have had bugs, just like everyone, and we offer a generous bug bounty to encourage security researchers to test our systems. We recently had a respected independent security firm do a security audit, in which they had full access to our source code and a clone of our production systems (but not to any customer data or production security keys). They did not find any significant issues.

We are very happy with the trustworthiness and physical security at the NYI data centre where we host our data. I have visited a few times – I needed to have my name on the list at the lobby of the building to get a pass that would activate the lift, and then be escorted through two separate security doors to gain access to the racks with our servers, which are themselves locked. The staff are excellent.

Balancing Confidentiality, Integrity and Availability, I believe that hosting at NYI provides great security for our users’ data. Moving elsewhere would be purely security theatre, and would discard NYI’s great track record of reliability and availability for no real improvement in confidentiality.

Dec 15: Putting the fast in FastMail: Loading your mailbox quickly

This blog post is part of the FastMail 2014 Advent Calendar.

The previous post on 14th December was about our 24/7 monitoring and paging system. The following post on 16 December addresses confidentiality and where our servers are hosted.

Technical level: medium

From the moment you log in to your account (or go to the FastMail website and already have a logged in session), we obviously want to load your mailbox as fast as possible. To make that possible, we’ve implemented lots of small and large optimisations and features.

Use fast, reliable and consistent hardware

As we talked about in our email servers post, part of achieving fast performance is the hardware you choose.

In the case of email, the most important thing is making sure you have a fast disk system with lots of input/output (read/write) operations per second (IOPs). It’s worth noting that there is a massive difference between consumer SSDs and enterprise SSDs. Intel rate their enterprise 400 GB DC3700 as being able to handle 7.30 PB (petabytes) of data written. For comparison, a fairly good quality consumer drive like the Samsung 850 EVO is rated at 150 TB (terabytes), that’s 1/50th the write capacity rating. When you’re running a server 24/7 with thousands of reads and writes per second, every second, every day, every month, for years, you need very high reliability.

As well as high data reliability, enterprise drives also retain their IOPs rating for much longer. Many consumer drives advertise a very high IOPs rate. However that rate is only achieved while the drive is new (or has been hard erased). After you start writing data to the drive, the IOPs rate often drops off significantly. An enterprise drive like the DC3700 might not have as high an IOPs rating on paper, but it is much more consistent throughout its lifetime.

To maximise the usage of this hardware, it’s important to split your data correctly. The most commonly accessed data needs to be on the fast SSD drives, but you need to keep the cost down for the large archival data of emails from last week/month/year that most people don’t access that often.

Start pre-loading data as soon as possible

When you submit the login page and we authenticate you, we immediately send a special command down to the IMAP server to start pre-loading your email data (under the hood, this uses the posix_fadvise system call to start loading message indexes and other databases from disk). We then return a redirect response to your web browser, which directs your browser to start loading the web application.

This means that during the time it takes to return the redirect response to your browser, and while the browser starts loading the web application, in the background we’re already starting to load your email data from disk into memory.

Minimise, compress and cache the web application

To make loading the web application as quick as possible, we employ a number of techniques.

  • Minimisation – this is the automated rewriting of CSS and JavaScript to use shorter variable names and strip comments and whitespace in order to produce code that is smaller (and so quicker to download) but works exactly the same. This is standard practice and we use uglifyjs on all our code
  • Compression – This is again standard practice. We use gzip to pre-compress all files and the nginx gzip_static extension so we can store the compressed files on disk and serve them directly when a browser supports gzip (almost all of them). This means we can immediately serve the file rather than having to use CPU time to compress it for each download. It also means we can use the maximum gzip compression level to squeeze out every extra byte we can, as we only need to do the compression once rather than many times.
  • Concatenation – we join together all code and data files for a module into a single download. This includes not just javascript code, but CSS styles, icons, fonts and other information encoded into the javascript itself. This significantly reduces the number of separate connections (which each require a SSL/TLS negotiation, TCP window scale up to get to maximum speed, etc) and downloads, and allows as much data as possible to be streamed in one go
  • Modularisation – we build our code into several core code blocks. There is one main block for the application framework (something to talk about in another post), and one block for each part of the application (e.g. mailbox, contacts, calendar, settings, etc). So loading the mailbox application requires downloading a few files; the bootstrap loader, the application framework, the mailbox application and some localisation data. No styles, images, fonts or other content are needed.
  • Caching – each file downloaded includes a hash code of the content in the filename. We store the content of each file in the browser’s local storage area. This makes it easy for us to see if we already have any particular application file downloaded. If we do, we don’t even have to request it from the server, we can just load it directly from the browser local storage.

Multiple API calls in each request

The API the client uses to talk to the server is based on a JSON protocol. It’s not an HTTP based RESTful type API.

The advantage of this is that we can send multiple commands in a single call to the server. This avoids multiple round trips back and forth to request data, which significantly slows things down on high latency connections (e.g. mobile devices, people a long way away from the server, etc).

For instance, our web client requests the users preferences, a list of user personalities, a list of user mailbox folders, all mailbox folder message/conversation counts, a list of user saved searches, the initial Inbox mailbox message/conversation list, and the users overall mailbox quota state, all in a single call to the server. The server can gather all this data together in one go, and return it to the client in a single call result.

Pre-calculate as much as possible

Certain pieces of data can be expensive to calculate. One significant example of this is message previews. Message meta data/headers such as the From address, Subject, etc are stored in a separate “cache” file in Cyrus (our IMAP server) and are thus quicker to access, but each message body is stored in its own separate file. This means that generating the previews for 20 messages would normally involve opening 20 separate files, reading and parsing out the right bits, etc.

To avoid this, we calculate a preview of each message when it’s delivered to the mailbox using an annotator process, and store that in a separate annotation database.

So fetching the preview for 20 messages now only requires reading from a single database file, and we can start pre-loading that file immediately after login as well.


There’s no one single thing that makes FastMail fast, it’s a combination of features, and we’re always working to tune those features even more to make the FastMail experience as fast as possible. When you call yourself FastMail, you’re really signalling one of your core features in the name, and that means we have to always think about it in everything we do.

Dec 14: On Duty!

This blog post is part of the FastMail 2014 Advent Calendar.

The previous post on 13th December was about our DNS hosting system. The following post on 15th December shows how we load the mailbox screen so quickly.

Technical level: low

BZZZT, BZZZT. Wake up, something is broken! My wife hates the SMS tone on my phone. It triggers bad memories of the middle of the night.

When I started at FastMail in 2004, we already had a great production monitoring system. Like many things, we were doing this well before most internet services. Everyone was on call all the time, and we used two separate SMS services plus the jfax faxing service API to make a phone call (which would squeal at you) just in case the SMSes failed.

The first year was particularly bad – there was a kernel lockup bug which we hadn’t diagnosed. Thankfully Chris Mason debugged it to a problem in reiserfs when there were too many files deleted all at once, and fixed it. Things became less crazy after that.

End-to-end monitoring

Our production monitoring consists of two main jobs, “pingsocket” and “pingfm”. pingsocket runs every 2 minutes, pingfm runs every 10 minutes. There are also a bunch of individual service monitors for replication that inject data into individual databases or mail stores and check that the same data reaches the replicas in a reasonable timeframe. There are tests for external things like RBL listings and outbound mail queue sizes, and finally tests for server health like temperature, disk failure and power reliability.

pingsocket is the per-machine test. It checks that all services which are supposed to be running, are running. It has some basic abilities to restart services, but mostly it will alert if something is wrong.

pingfm is the full end-to-end test. It logs in via the website and sends itself an email. It checks that the email was delivered correctly and processed by a server-side rule into a subfolder. It then fetches the same email again via POP3 and injects that into yet another folder and confirms receipt. This tests all the major email subsystems and the web interface.

Any time something breaks that we don’t detect with our monitoring, we add a new test that will check for it. We also think of likely failure modes when we add something new, and test that they are working – for example the calendar service was being tested on every server even before we released it to beta.

Notification levels

The split has always been between “urgent” (sms) and non-urgent (email). Many issues like disk usage levels have both a warning level and an alert level. We will get emails telling us of the issue first, and only a page if the issue hits the alert stage. This is great for getting sleep.

We now have a middle level where it will notify us via our internal IRC channel, so it interrupts more than just an email, but still won’t wake anybody.

Duty Roster

Until 2011, every alert went to everybody in the company – though it would choose somebody first based on a complex algorithm. If you didn’t want to be paged, you had to turn your phone off. This was hard if anyone went on holiday, and occasionally we would have everyone go offline at once, which wasn’t great. Once I switched my phone on in Newark Airport to discover that the site was down and nobody else had responded, so I sat perched on a chair on my laptop and fixed things. We knew we needed a better system.

That system is called ‘Lars Botlesen’ or larsbot for short (as a joke on the Opera CEO’s name). The bot has been injected into the notification workflow, and if machines can talk to the bot, they will hand over their notification instead of paging or emailing directly. The bot can then take action.

The best thing we ever added was the 30 second countdown before paging. Every 5 seconds the channel gets a message. Here’s one from last night (with phone numbers and pager keys XXXed out):

<Lars_Botlesen> URGENT gateway2 - [rblcheck: new: 66.111.4.224 dirty, but only one clean IP remaining. Run rblcheck with -f to force] [\n   "listed",\n   {\n      "https://ers.trendmicro.com/reputations/index?ip_address=66.111.4.224" : 1\n   },\n   {\n      "https://ers.trendmicro.com/reputations/index?ip_address=66.111.4.224" : "listed"\n   }\n]\n
<Lars_Botlesen> LarsSMS: 'gateway2: [\n   "listed",\n   {\n      "https://ers.trendmicro.com/reputations/index?ip_address=66.111.4.224" : 1\n   },\n   {\n      "https://ers.trendmicro.com/reputations/index?ip_address=66.111.4.224" : "listed"\n   }\n]\n' in 25 secs to brong (XXX)
<Lars_Botlesen> LarsSMS: 'gateway2: [\n   "listed",\n   {\n      "https://ers.trendmicro.com/reputations/index?ip_address=66.111.4.224" : 1\n   },\n   {\n      "https://ers.trendmicro.com/reputations/index?ip_address=66.111.4.224" : "listed"\n   }\n]\n' in 20 secs to brong (XXX)
<brong_> lars ack
<Lars_Botlesen> sms purged: brong (XXX) 'gateway2: [\n   "listed",\n   {\n      "https://ers.trendmicro.com/reputations/index?ip_address=66.111.4.224" : 1\n   },\n   {\n      "https://ers.trendmicro.com/reputations/index?ip_address=66.111.4.224" : "listed"\n   }\n]\n'
* Lars_Botlesen has changed the topic to: online, acked mode, Bron Gondwana in Australia/Melbourne on duty - e:XXX, p:+XXX,Pushover
<Lars_Botlesen> ack: OK brong_ acked 1 issues. Ack mode enabled for 30 minutes

We have all set our IRC clients to alert us if there’s a message containing the text LarsSMS – so if anyone is on their computer, they can acknowledge the issue and stop the page. By default it will stay in the acknowledged mode for 30 minutes, suppressing further message – because the person who has acknowledged is actively working to fix the issue and looking out for new issues as well.

This was particularly valuable while I was living in Oslo. During my work day, the Australians would be asleep – so I could often catch an issue before it woke the person on duty. Also if I broke something, I would be able to avoid waking the others.

If I hadn’t acked, it would have SMSed me (and pushed through a phone notification system called Pushover, which has replaced jfax). Actually, it wouldn’t have because this happened 2 seconds later:

<robn> lars ack
* Lars_Botlesen has changed the topic to: online, acked mode, Bron Gondwana in Australia/Melbourne on duty - e:XXX, p:+XXX,Pushover
<Lars_Botlesen> ack: no issues to ack. Ack mode enabled for 30 minutes

We often have a flood of acknowlements if it’s during a time when people are awake. Everyone tries to be first!

If I hadn’t responded within 5 minutes of the first SMS, larsbot would go ahead and page everybody. In the worst case where the on-duty doesn’t respond, it is only 5 minutes and 30 seconds before everyone is alerted. This is a major part of how we keep our high availability.

Duty is no longer automatically allocated – you tell the bot that you’re taking over. You can use the bot to pass duty to someone else as well, but it’s discouraged. The person on duty should be actively and knowingly taking on the responsibility.

We pay a stipend for being on duty, and an additional amount for being actually paged. The amount is enough that you won’t feel too angry about being woken, but low enough that there’s no incentive to break things so you make a fortune. The quid-pro-quo for being paid is that you are required to write up a full incident report detailing not only what broke, but what steps you took to fix it so that everyone else knows what to do next time. Obviously, if you _don’t_ respond to a page and are on duty, you get docked the entire day’s duty payment and it goes to the person who responded instead.

Who watches the watchers?

As well as larsbot, which runs on a single machine, we have two separate machines running something called arbitersmsd. It’s a very simple daemon which pings larsbot every minute and confirms that larsbot is running and happy.

Arbitersmsd is like an spider with tentacles everywhere – it sits on standalone servers with connections to all our networks, internal and external as well as a special external network uplink which is physically separate from our main connection. It also monitors those links. If larsbot can’t get a network connection to the world, or the one of the links goes down, arbitersmsd will scream to everybody through every link it can.

There’s also a copy running in Iceland, which we hear from occasionally when the intercontinental links have problems – so we’re keenly aware of how reliable (or not) our Iceland network is.

Automate the response, but have a human decide

At one stage, we decided to try to avoid having to be woken for some types of failure by using Heartbeat, a high availability solution for Linux, on our frontend servers. The thing is, our servers are actually really reliable, and we found that heartbeat failed more often than our systems – so the end result was reduced reliability! It’s counter-intuitive, but automated high-availability often isn’t.

Instead, we focused on making the failover itself as easy as possible, but having a human make the decision to actually perform the action. Combined with our fast response paging system, this gives us high availability, safely. It also means we have great tools for our every day operations. I joke that you should be able to use the tools at 3am while not sober and barely awake. There’s a big element of truth though – one day it will be 3am, and I won’t be sober – and the tool needs to be usable. It’s a great design goal, and it makes the tools nice to use while awake as well.

A great example of how handy these tools are was when I had to upgrade the BIOS on all our blades because of a bug in our network card firmware which would occasionally flood the entire internal network. This was the cause of most of our outages through 2011-2012, and took months to track down. We had everything configured so that we could shut down an entire bladecentre, because there were failover pairs in our other bladecentre.

It took me 5 cluster commands and about 2 hours to do the whole thing (just because the BIOS upgrades took so long, there were about 20 updates to apply, and I figured I should do them all at once):

  1. as -r bc1 -l fo -a – move all the services to the failover pair
  2. asrv -r bc1 all stop – stop all services on the machines
  3. as -r bc1 -a shutdown -h now – shut down all the blades
  4. This step was a web-based bios update, in which I checked checkboxes on the management console for every blade and every BIOS update. The console fetched the updates from the web and uploaded them onto each blade. This is the bit that took 2 hours. At the end of the updates, every server was powered up automatically.
  5. asrv -r bc1 -a all start
  6. as -r bc2 -l fo -m – move everything from bladecentre 2 that’s not mastered there back to its primary host

It’s tooling like this which lets us scale quickly without needing a large operations team. For the second bladecentre I decided to reinstall all the blades as well, it took an extra 10 minutes and one more command: as -r bc2 -l utils/ReinstallGrub2.pl -r before the shutdown.

Nobody knows

Some days are worse than others. Often we go weeks without a single incident.

The great thing about our monitoring is that it usually alerts us of potential issues before they are visible. We often fix things, sometimes even in the middle of the night, and our customers aren’t aware. We’re very happy when that happens – if you never know that we’re here, then we’ve done our job well!

BZZZT Twitch – oh, just a text from a friend. Phew.

Dec 13: FastMail DNS hosting

This blog post is part of the FastMail 2014 Advent Calendar.

The previous post on 12th December was about our multi-master database replication system. The following post on 14th December is about our 24/7 oncall duty roster.

Technical level: low-medium

Part of running any website or email service is that you need to publish DNS records for your domains. DNS is what allows computers to convert names like “fastmail.com” into IP addresses that computers use to actually talk to each other.

In the very early days of FastMail, we used an external service called ZoneEdit to do this. That’s fine when your DNS requirements are simple and don’t change much, but over time our DNS complexity and requirements increased, so we ended up moving to and running our own DNS servers.

For a long time, we used a program called TinyDNS to do our DNS serving. TinyDNS was written by Dan Bernstein (DJB). DJB’s software has a history of being very security conscious and concise, but a bit esoteric in its configuration, setup and handling.

While TinyDNS worked very well for us (extremely reliable and low resource usage), one issue with TinyDNS is that it reads its DNS data from a single constant-only database file that is built from a corresponding input text file. That means to make any DNS changes, you have to modify/rebuild the input data file, and then rebuild the entire database file each time you make even a single change.

That was fine when DNS hosting was just for us, but over time we found more and more people wanted to use their own domain for email, so we opened up DNS hosting to our users. To make it as easy as possible for users, when you add a domain to your FastMail account, we automatically publish some default DNS records, with the option to customise as much as you want.

Allowing people to host their DNS with us is particularly important for email because there’s actually a number of complex email-related standards that rely on DNS records. For websites, it’s mostly about just creating the right A (or in some cases, CNAME) record. For email though, there’s the MX records for routing email for your domain, wildcard records to support subdomain addressing, SPF records for stopping forged SMTP FROM addresses, DKIM records for controlling signing of email from your domain. There’s also SRV records to allow auto-discovery of our servers in email and CalDAV clients, and a CNAME record for mail.yourdomain.com to log in to your FastMail account. In the future, there’s also DMARC records we want to allow people to easily publish. For more information on these records and standards, check out our previous post about email authentication.

The problem with TinyDNS was that any change people made to their DNS, or any new domains added at FastMail, we couldn’t just immediately rebuild the database file because it’s a single file for ALL domains, so it’s quite large. So instead we’d only rebuild it every hour, and people had to be aware that any DNS changes they made might take up to an hour to propagate to our actual DNS servers. Not ideal.

So a few years back, we decided to tackle that problem. We looked around at the different DNS software available, and settled on PowerDNS. One of the things we particularly liked about PowerDNS was its plug-able backends, and its support for DNSSEC. Using this, we were able to build a pipe based backend that talked to our internal database structures. This meant that DNS changes are nearly immediate (there’s still a small internal caching time).

Because DNS is so important, we tested this change very carefully. One of the things we did was to take a snapshot of our database, and capture all the DNS packets to/from our TinyDNS server for an hour. On a separate machine, we tested our PowerDNS based implementation with the same database snapshot, and replayed all the DNS packets to it, and checked that all the responses were the same.

With this confirmation, we were able to rollout the change from TinyDNS to PowerDNS. Unfortunately even with that testing, we still experienced some problems, and had to rollback for a while. After some more fixing and tests, we finally rolled it out permanently in Feb 2013 and it’s been happily powering DNS for all of our domains (e.g. fastmail.com, fastmail.fm, messagingengine.com, etc) and all user domains since.

Our future DNS plans include DNSSEC support (which then means we can also do DANE properly, which allows server-to-server email sending to be more secure), DMARC record support, and ideally one day Anycast support to make DNS lookups faster.

For users, if you don’t already have your own domain, we definitely recommend it as something to consider. By controlling your own domain, you’ll never be locked to a particular email provider, and they have to work harder to keep your business, something we always aim to do :)

With the new GTLDs that have been released and continue to be released, there’s now a massive number of new domains available. We use and recommend gandi.net and love their no bullshit policy. For around $15-$50/year, your own domain name is a fairly cheap investment on keeping control of your own email address forever into the future, and with FastMail (and an Enhanced or higher personal account, or any family/business accounts), we’ll do everything else for you. Your email, DNS and even simple static website hosting.

Dec 12: FastMail’s MySQL Replication: Multi-Master, Fault Tolerance, Performance. Pick Any Three

This blog post is part of the FastMail 2014 Advent Calendar.

The previous post on 11th December was from our support team. The following post on 13th December is about hosting your own domain with us.

Technical level: medium

For those who prefer watching videos over reading, here’s a talk I gave at Melbourne Perl Mongers in 2011 on FastMail’s open-sourced, custom MySQL replication system.

Most online services store their data within a database, and thanks to the culture of Open Source, these days there are plenty of robust RDBMs to choose from. At FastMail, we use a Percona build of MySQL 5.1 because of their customised tooling and performance patches (If you haven’t heard of Percona, I recommend trying them out). However, even though MySQL 5.1 is a great platform to work with, we do something differently here – we don’t use its built-in replication system and instead opted to roll our own.

First, what’s the problem with running an online service on a single database? The most important reason against this is the lack of redundancy. If your database catches on fire or as what happens more often the oom-killer chooses to zap your database server as it’s usually the biggest memory hog on a machine, none of your applications can continue without their needed data and so your service is taken offline. By using multiple databases, when a single database server is downed, your applications still have the others to chose from and connect to.

Another reason against using a single database for your online service is degraded performance – as more and more applications connect and perform work, your database’s server load increases. Once a server can’t take the requested load any longer, you’re left with query timeout responses and even refused connections, which again takes your service offline. By using multiple database servers, you can tell your applications to spread their load across the database farm thus reducing the work a single database server has to cope with, while gaining an upshot in performance for free across the board.

Clearly the best practice is to have multiple databases, but why re-invent the wheel? Mention replication to a veteran database admin and then prepare yourself a nice cup of hot chocolate while they tell your horror stores from the past as if you’re sitting around a camp fire. We needed to re-invent the wheel because there are a few fundamental issues with MySQL’s built-in replication system.

When you’re working with multiple databases and problems arise in your replication network, your service can grind to a halt and possibly take you offline until every one of your database servers back up and replicating happily again. We never wanted to be put in a situation like this and so wanted the database replication network itself to be redundant. By design, MySQL’s built-in replication system couldn’t give us that.

What we wanted was a database replication network where every database server could be a “master”, all at the same time. In other words, all database servers could be read and written to by all connecting applications. Each time an update occurred on any master, the query would then be replicated to all the other masters. MySQL’s built-in replication system allows for this, but it comes with a very high cost – it is a nightmare to manage if a single master was downed.

To achieve master-master replication with more than two masters, MySQL’s built-in replication system needs the servers be configured in a ring network topology. Every time an updated occurs on a master, it executes the query locally, then passes it off to the next server in the ring, which applies the query to its local database, and so on – much like participants playing pass-the-parcel. And this works nicely and is in place in many companies. The nightmares begin however if a single database server is downed, thus breaking the ring. Since the path of communication is broken, queries stop travelling around the replication network and data across every database server begin to get stale.

Instead, our MySQL replication system (MySQL::Replication) is based on a peer-to-peer design. Each database server runs its own MySQL::Replication daemon which serves out its local database updates. They then run a separate MySQL::Replication client for each master it wants a feed from (think of a mesh network topology). Each time a query is executed on a master, the connected MySQL::Replication clients take a copy and applies it locally. The advantage here is that when a database server is downed, only that single feed is broken. All other communication paths continue as normal, and query flow across the database replication network continue as if nothing ever happened. And once the downed server comes back online, the MySQL::Replication clients notice and continue where they left off. Win-win.

Another issue with MySQL’s built-in replication system is that a slave’s position relative to its master is recorded in a plain text-file called relay-log.info which is not atomically synced to disk. Once a slave dies and comes back online, files may be in an inconsistent state. If the InnoDB tablespace was flushed to disk before the crash but relay-log.info wasn’t, the slave will restart replication from an incorrect position and so will replay queries, leaving your data in an invalid state.

MySQL::Replication clients store their position relative to their masters inside the InnoDB tablespace itself (sounds recursive, but it’s not since there is no binlogging of MySQL::Replication queries). As updates are done within the same transaction as replicated queries are executed in, writes are completely atomic. Once a slave dies and comes back online, we are still in a consistent state since either the transaction was committed or it will be rolled back. It’s a nice place to be in.

MySQL::Replication – multi-master, peer-to-peer, fault tolerant, performant and without the headaches. It can be found here.

Follow

Get every new post delivered to your Inbox.

Join 5,807 other followers