Dec 22: CardDAV beta release

This blog post is part of the FastMail 2014 Advent Calendar.

The previous post on 21st December was about our file storage system. Stay tuned for another post tomorrow.

Technical level: medium

After more than a year of anticipation we’re very happy to announce today that we’re releasing CardDAV support into public beta test.

CardDAV is a protocol for reading, writing and synchronising contact data. It’s built into iOS devices and available on Android with an inexpensive third-party application. If you’ve ever wanted to have your FastMail contacts available on your mobile device (and vice-versa), then this is exactly what you want.

Obviously, since this is a beta, there are still a few pointy edges and non-working bits. The most notable thing is that the beta is currently only available to personal accounts, not to business or family accounts. This is because support for shared contacts is not ready yet and there’s some potential for data loss and inconsistent behaviour if you try to use shared contacts without proper support. We’re working hard on finishing shared contact support and hope to make the beta available to business and family accounts within the next couple of months.

CardDAV is only available to Full account levels and higher. Member, Guest and Lite accounts will need to upgrade to be able to use CardDAV.

So now all the disclaimers are out of the way, you can sign up for the CardDAV beta here: https://www.fastmail.com/go/carddavbeta.

The contacts story

An address book is a fundamental component of any mail system and FastMail has had one almost since the beginning. It’s always been stored in the MySQL database and available through the web client. For much of its history it’s been confined to the web client. A few years back we did add a read-only LDAP interface, which is useful for desktop mail clients that could support LDAP address books. It works fine, but being read-only severely limits its usefulness. Some time later mobile devices happened, and it became clear that something else was needed.

In 2011 the CardDAV protocol was published, largely developed at Apple to allow device contacts to be synchronised with a server. The protocol is very similar to the earlier CalDAV protocol (which we also use for our calendar) which is good as it allows us to share a lot of code between our calendar and contacts system.

Towards the end of 2012 we started to seriously appreciate the need for both an integrated calendar and device contacts syncing. We weren’t the only ones, as the Cyrus project had started to add support for CalDAV and CardDAV to the Cyrus mail server. We looked at a few options for calendar and contacts and decided that we would implement both on top of the support being baked into Cyrus, and work began in earnest. We decided that calendar was more important because we already had a contacts system and while it wasn’t perfect we preferred to focus our engineering effort on a clear gap in our product lineup. That work took the best part of a year, and we finally released the calendar to production in June 2014. At this point we were able to focus on CardDAV-based contacts.

The actual CardDAV part of this work is actually fairly simple. Unlike CalDAV, the backend server (Cyrus) doesn’t really need much special knowledge. It mostly just saves and loads contact cards as required. Calendar entries are more complicated; the server needs to know about timezones, recurring events, alarms, etc. CardDAV is much easier and if all we had to do was ship CardDAV support, we probably could have done so months ago.

The thing that made it more difficult came from the fact that we already had a contacts system and plenty of code fairly tightly integrated with it. It’s more than just the two user interfaces. The mail delivery pipeline also makes use of user contacts for spam whitelists and distribution lists, so we needed to teach these systems about a whole new storage system for contacts. Up until this time they had simply hit the database for this information. To make matters worse, we always knew that we’d need to roll out CardDAV to users gradually which meant that both the UI and the delivery code needed to be able to work with either. In short, we needed to abstract away the implementation details of the contacts storage, which took a few months to build, test and deploy. We ended up with a nice abstraction based on the JMAP getContacts/setContacts model, with a database provider behind it.

The next step was to write a CardDAV provider for our contacts abstraction. That was actually pretty easy because most of the code needed to access DAV resources was already available from our CalDAV work.

The last piece of the puzzle was the actual data conversion layer. The existing contacts system has a data model that doesn’t match up perfectly with the vCard format used by CardDAV, so we had to develop a mapping. Most of the fields have a 1:1 mapping (addresses, email addresses and phone numbers). What we call “online” fields, however, do not. Our “online” field group includes URLs, Twitter handles and chat IDs. vCard doesn’t group those the same way but more annoyingly, it doesn’t have a standard set of fields for representing these. It took a long time to develop and test a mapping that works most of the time. It’s going to need improvement as we go but it’s not bad for now.

What’s next

The next few months will include a lot more testing, polishing and responding to user feedback and obviously completing the business and family support. That will bring us to a full release where everyone will be quietly and transparently migrated to the CardDAV backend. We can then start to clean up a lot of old code, always a nice thing to do.

If you’re trying the CardDAV beta test, we’d love to hear what you think. Let us know on twitter or by emailing carddavbeta@fastmail.com.

Dec 21: File Storage

This blog post is part of the FastMail 2014 Advent Calendar.

The previous post on 20th December saw us open source our core javascript library. The following post on 22nd December is the long awaited beta release of our contact syncing solution.

Technical level: high

Why does an email service need a file storage system?

For many years, well before cloud file storage became an everyday thing, FastMail has had a file storage feature. Like many features, it started as a logical extension of what we were already doing. People were accessing their email anywhere via a web browser, it would be really nice to be able to access important files everywhere as well. And there’s the obvious email integration points as well, being able to save attachments from emails to somewhere more structured, or having files that you commonly want to attach to emails (e.g. your resume) stored at FastMail, rather than having to upload it again each time.

Lots of FastMail’s early features exist because they’re the kind of thing that we wanted for ourselves!
It also turns out that have a generic file storage is useful for other features, as we discovered later.

A generic file service

The first implementation of our file storage system used a real filesystem on a large RAID array. To make the data available to our web servers, we used NFS. While in theory and nice and straight forward solution, unfortunately it all ended up being fairly horrible. We used the NFS server built into the Linux kernel at that time, and although it was supposed to be stable, that was not our experience at all. While all our other servers had many months of uptime, the file storage server running NFS would freeze/crash somewhere between every few days and every week. This was particularly surprising to us because we weren’t actually stressing it much, the workload wasn’t high compared to the IO that some file servers perform.

Having the NFS server go offline and people losing access to their files until it is rebooted was bad enough, but there was a much worse problem. Any process that tried to access a file on the NFS mount would freeze up until the server came back. Since the number of processes handling web requests was limited, all it took was a few 100 requests by users trying to access their file storage, and suddenly there were no processes left to handle other web requests, and ALL web requests would start failing, meaning no one was able to access the FastMail site at all. Not nice. We tried a combination of soft mounts and other flags, but couldn’t find a combination that was both consistently reliable and failure safe.

Apparently I have suppressed the memories — something to do with being woken by both young children AND broken servers, but Rob M remembers, and he says: “In one of those great moments of frustration at being woken up again by a crashed NFS server, Bron wanted to do a complete rewrite, and to use an entirely different way of storing the files. Instead of storing the file system structure in a real filesystem, we decided to use a database. However we didn’t want to store the actual file data in the database, that would result in a massive monolithic database with all user file data in it, not easy to manage. So the approach he came up with is a rather neat hybrid that has worked really well.” So there you go.

One of my first major projects at FastMail was this new file storage service. I was fresh from building data management tools for late-phase clinical trials (drugs, man) for Quintiles in New Jersey (that’s right, I moved from living in New Jersey and working on servers in Melbourne to living in Melbourne and working on servers in New York). I over-engineered our file storage system using many of the same ideas I had used for clinical data.

Interestingly, a lot of the work I’d been doing at Quintiles looked very similar in design to git, though it was years before git came out. Data addressed by digest (sha1), signatures on digests of lists of digests to provide integrity and authenticity checks over large blocks of data. That product doesn’t seem to exist any more though.

The original build of the file service was based on many of the same concepts. File version forking and merging (which was too complex and got scrapped) with very clever cache hierarchy and invalidation scheme. Blobs (file contents themselves) are stored in a key-value pool spread over multiple machines, with push to many copies before the store is successful, and a background cleaning task that ensures they are spread everywhere and garbage collected when no longer referenced.

The blob storage system is very simple – we could definitely build or grab off the shelf something a lot faster and better these days, but it’s very robust, and that matters to us more than speed.

Interestingly enough, while the caching system was great when there was a lower volume of changes and slow database servers, it eventually became faster to remove a layer of caching entirely as our needs and technology changed.

Database backed nodes

The same basic architecture still exists today. The file storage is a giant database table in our central mysql database. Every entry is a “Node”, with a primary key called NodeId, and a “ParentNodeId”. Node number 1 is treated specially, and is of class ‘RootNode’. It’s the top of the tree.

Because there are hundreds of thousands of top level nodes (ParentNodeId == 1), it would be crazy to read the key ‘ND:1′ (node directories, parent 1) for normal operations. Instead, we fetch “UA:$UserId” which is the ACL for the user’s userid, and then walk the tree back up from each ACL which grants the user any access, building a tree that way.

For example:

$ DEBUG_VFS=1 vfs -u brong@fastmail.fm ls /
INIT
Fetching ND:504452
/:
--
Fetching UA:485617
Fetching N:20872929
Fetching N:3
Fetching N:1394099
Fetching N:1394098
Fetching N:2
Fetching N:504452
d---   504452 2005-09-14 04:31:35 brong.fastmail.fm/
d---  1394098 2006-01-20 00:44:04 admin.fastmail.fm/
d---        2 2005-09-13 07:46:04 robm.fastmail.fm/

Whereas if we’re inside an ACL path we walk the tree normally from that ACL (we still need to check the other ACLs to see if they also impact the data we’re looking at…):

$ DEBUG_VFS=1 vfs -u brong@fastmail.fm ls '~/websites'
INIT
Fetching ND:504452
Fetching N:504452
Fetching UA:485617
Fetching N:20872929
Fetching N:3
Fetching N:1394099
Fetching N:1394098
Fetching N:2
Fetching UT:485617
/brong.fastmail.fm/files/websites:
----------------------------------
darw  6548549 2007-03-27 06:03:19 cherubfest/
darw 39907741 2008-11-25 10:27:55 custom-ui/
[...]
darw 335168869 2014-10-19 23:51:38 talks/
Fetching NFZ:504465

This structure also allows us to keep deleted files for a week, because when a node is “deleted”, it’s not actually deleted – it just gets an integer field “IsDeleted” set to the NodeId of the top node being deleted.

Let’s try that too:

[brong@utility2 ~]$ vfs -u brong@fastmail.fm mkdir '~/test/blog'
[brong@utility2 ~]$ echo "hello world" > /tmp/hello.txt
[brong@utility2 ~]$ vfs -u brong@fastmail.fm cat '~/test/blog' /tmp/hello.txt
Failed to write: Could not open /brong.fastmail.fm/files/test/blog for writing: Is a directory 
[brong@utility2 ~]$ vfs -u brong@fastmail.fm cat '~/test/blog/hello.txt' /tmp/hello.txt
[brong@utility2 ~]$ vfs -u brong@fastmail.fm cat '~/test/blog/hello.txt'
hello world
[brong@utility2 ~]$ vfs -u brong@fastmail.fm ls '~/test/blog/'
/brong.fastmail.fm/files/test/blog:
-----------------------------------
-arw 387950177 2014-12-21 00:31:56 hello.txt (12)
[brong@utility2 ~]$ vfs -u brong@fastmail.fm rm '~/test/blog/hello.txt'
[brong@utility2 ~]$ vfs -u brong@fastmail.fm ls '~/test/blog/'
/brong.fastmail.fm/files/test/blog:
-----------------------------------
[brong@utility2 ~]$ vfs -u brong@fastmail.fm lsdel '~/test/blog/'
/brong.fastmail.fm/files/test/blog: (deleted)
---------------------------------------------
-arw 387950177 2014-12-21 00:31:56 hello.txt (12)
[brong@utility2 ~]$ vfs -u brong@fastmail.fm undel 387950177
restored 387950177 (hello.txt) as /brong.fastmail.fm/files/test/blog/hello.txt
[brong@utility2 ~]$ vfs -u brong@fastmail.fm cat '~/test/blog/hello.txt'
hello world

And the versioning:

[brong@utility2 ~]$ echo "hello world v2" > /tmp/hello.txt 
[brong@utility2 ~]$ vfs -u brong@fastmail.fm cat '~/test/blog/hello.txt' /tmp/hello.txt
Failed to write: Could not open /brong.fastmail.fm/files/test/blog/hello.txt for writing: File exists
[brong@utility2 ~]$ vfs -u brong@fastmail.fm -f cat '~/test/blog/hello.txt' /tmp/hello.txt
[brong@utility2 ~]$ vfs -u brong@fastmail.fm cat '~/test/blog/hello.txt' 
hello world v2
[brong@utility2 ~]$ vfs -u brong@fastmail.fm ls '~/test/blog/'
/brong.fastmail.fm/files/test/blog:
-----------------------------------
-arw 387950597 2014-12-21 00:35:50 hello.txt (15)
[brong@utility2 ~]$ vfs -u brong@fastmail.fm lsdel '~/test/blog/'
/brong.fastmail.fm/files/test/blog: (deleted)
---------------------------------------------
-arw 387950177 2014-12-21 00:31:56 hello.txt (12)
[brong@utility2 ~]$ vfs -u brong@fastmail.fm undel 387950177 oldhello.txt
restored 387950177 (hello.txt) as /brong.fastmail.fm/files/test/blog/oldhello.txt
[brong@utility2 ~]$ vfs -u brong@fastmail.fm ls '~/test/blog/'
/brong.fastmail.fm/files/test/blog:
-----------------------------------
-arw 387950597 2014-12-21 00:35:50 hello.txt (15)
-arw 387950745 2014-12-21 00:36:53 oldhello.txt (12)

This means that if you delete a folder and all its contents, you can undelete it without also undeleting everything ELSE that was deleted in the same action, just by setting nodes with the same IsDeleted field as the top node back to zero again.

[brong@utility2 ~]$ vfs -u brong@fastmail.fm rm '~/test/blog/oldhello.txt'
[brong@utility2 ~]$ vfs -u brong@fastmail.fm ls '~/test/blog/'
/brong.fastmail.fm/files/test/blog:
-----------------------------------
-arw 387950597 2014-12-21 00:35:50 hello.txt (15)
[brong@utility2 ~]$ vfs -u brong@fastmail.fm lsdel '~/test/blog/'
/brong.fastmail.fm/files/test/blog: (deleted)
---------------------------------------------
-arw 387950177 2014-12-21 00:31:56 hello.txt (12)
-arw 387950745 2014-12-21 00:36:53 oldhello.txt (12)
[brong@utility2 ~]$ vfs -u brong@fastmail.fm rm '~/test/blog'
[brong@utility2 ~]$ vfs -u brong@fastmail.fm lsdel '~/test/'
/brong.fastmail.fm/files/test: (deleted)
----------------------------------------
darw 387950021 2014-12-21 00:38:20 blog/
[brong@utility2 ~]$ vfs -u brong@fastmail.fm undel 387950021
restored 387950021 (blog) as /brong.fastmail.fm/files/test/blog
[brong@utility2 ~]$ vfs -u brong@fastmail.fm ls '~/test/blog/'
/brong.fastmail.fm/files/test/blog:
-----------------------------------
-arw 387950597 2014-12-21 00:35:50 hello.txt (15)

So only the file that was in the folder when it was deleted is now present in the restored copy.

That’s the node tree. Alongside this is “Properties”, which are used not only for storing mime types of files (you can edit these in the filestorage) the expansion state of directories in the left view on the Files screen, but can be used via our WebDAV server to set any generic fields you like. There are ACLs and Locks which can also be used via DAV, and a Websites table which is used to manage our customer static-file web hosting.

The cache module can present any file as a regular Perl filehandle, including an append filehandle which already has the old content pre-loaded. Calling “close” on that filehandle returns a complete filestorage node on which other actions can be taken – and it does quota accounting on the fly. The whole system is object oriented and a joy to work with at the top level, despite all the weird bits underneath.

One thing – if you upload over the same filename again and again (webcams for example) we will only keep the most recent 30 copies, because we were finding that it slowed the directory listings too much otherwise.

File storage as a service

A long time ago, if you were in the middle of composing a draft, and you had uploaded a file, we stored that file in a temporary location on the web server’s filesystem. This was fine normally, but if we had to shut down one of our web servers, the session would move to another server, and the email would be sent without the attachment, because it couldn’t find the file.

We now keep uploaded files in a separate VFS directory for each user – and if you’re composing and we switch our web servers, you don’t even notice:

[brong@imap20 hm]$ vfs -u brong@fastmail.fm ls /brong.fastmail.fm
/brong.fastmail.fm:
-------------------
darw   504453 2014-10-18 16:13:19 files/
d-rw  6825201 2014-12-16 21:36:11 temp/

The home directory that you see in the Files screen is “/username.domain/files/” in the global filesystem. Your temporary files go into “/username.domain/temp/” with a special flag that says when they will be automatically deleted (default, 1 week from upload).

This has been very useful for other things too. We now upload imported calendar/addressbook files into VFS first via our general upload mechanism, and just use a VFS address for everything internally, meaning that the upload and the processing phase can be separate.

Calendar attachments are likewise done with an upload as a temporary file, and then a conversion into a long-lived URL later. Even on Mailr, which doesn’t have a file storage service for users, the file storage engine is used underneath for attachments and calendar.

Multiple ways to access

We use Perl as our backend language. We built both FTP and DAV servers, as well as our Files screen within the app. Another very valuable use of filestorage (indeed, one of the original reasons) is that you can copy attachments from emails into file storage, and attach from file storage into new emails. This is a great way to save having to download and upload giant attachments if you want to send them on to someone else.

As you can see from my example above, where my top level had 3 different directories, it’s possible to share file storage within a business. There’s a very powerful access control system to grant access only to individual folders, and separate read and write roles.

You can create websites, including password protected websites. We even have support for basic photo galleries.

Open Source?

A lot of the tooling on top of the VFS is based on open source modules, but hacked so much that they can’t really be shared easily back to the original authors. I have been more lax with this than I should have, particularly the DAV modules where I took the Net::DAV::Server module and almost entirely rewrote it based on the RFC, including a ton of testing against the clients of the day, and adding things like Apple’s quota support.

I don’t have time during this blog series to release anything nice, but there are no real secrets in the tree, so I’ve put up a tar file at http://opensource.brong.fastmail.fm/ME-VFS.tar.gz which contains the current version of the files I’ve talked about in this post, and send me an email at the obvious address if you have comments or suggestions. The file itself is hosted on a FastMail static website!

A road not taken

All of this was built in 2004/2005. These days there are tons of “filesystem as a service” products out there. We could have built one – our VFS has some nice properties which would have made syncing with clients very viable – but our focus is on email, and we didn’t make the time to learn how to build good clients.

We actually have an experimental Linux FUSE module which can talk directly to our VFS, and I built some experimental Windows and Mac clients as well, but we never took it anywhere — though we did talk about the potential to pivot the company into the file storage business.

These days we have gone the other way, with Dropbox integration on the compose screen and hooks to add other services too. It makes sense to integrate with services which do what they are good at, and concentrate our efforts on being the best at our field, email.

Dec 20: Open-sourcing OvertureJS – the JS lib that powers FastMail

This blog post is part of the FastMail 2014 Advent Calendar.

The previous post on 19th December was about Mailr. The following post covers our file storage backend.

Technical level: medium

Every so often we’re asked what library we use to build our awesome webmail. We don’t use Ember, Angular, React, or even jQuery. Instead, we use something we’ve developed internally over the last four years. And today, we’re making it open source under the MIT License. It’s called Overture, and it could be the start of your next great web app.

Overture is a library of most of the general-purpose frontend code that powers FastMail. It’s a powerful basis for building really slick web applications, with performance at, or surpassing, native apps.

At its heart, there’s a really powerful object system (originally inspired by SproutCore many years ago), which adds support for computed properties, observables (methods that trigger when properties change), events (which can pass from object to object) and bindings (to keep the property of one object in sync with another). The implementation is solid and extremely efficient, and allows you to write clear declarative-style code.

The datastore module provides the tools you need to manage the data in any CRUD-style application. It can keep track of client-side changes and efficiently synchronise with a server. Live-updating queries on the local cache allow your views to immediately update and feel completely responsive. Support for copy-on-write nested stores and a built-in Undo Manager give complete control over mutating data.

The view system uses the sugared DOM method to render views, which has many advantages over a standard template system, such as speed, XSS-invulnerability and access to the full power of JS, rather than a cut-down second programming language. The implementation in Overture lets you define views, which are essentially self-contained UI components, and easily nest other views inside.

There’s also one-line support for animating views. You declare the layout property and its dependencies, and Overture will handle animating it between the different states. Full support for drag and drop, localisation, keyboard shortcuts, inter-tab communication, routing and more mean you have everything you need to build an awesome app.

Overture is only 65 KB minified and gzipped, and has no dependencies (except for Squire if you want to use the rich text editing component). It supports all the browsers you’d expect including IE8+ (there’s a special little library you need to include for IE8 support).

There’s fairly complete API documentation available on the Overture website, and I’ve thrown together a classic Todo demo app. It’s a good showcase of what you can build in less than a day with Overture, and it’s heavily commented to help you understand what’s going on; use your browser web dev tools, or check out the source in the Overture repo, which is of course hosted on GitHub.

Dec 19: Mailr

This blog post is part of the FastMail 2014 Advent Calendar.

The previous post on 18th December was about how we take payments. The following post on 20th December saw us open source our core javascript library.

Technical level: low

We don’t often mention individual customers, for obvious privacy reasons.

While the negotiations for the sale of FastMail back to the staff in Australia were concluding, another long term contract was completed.

The end result is called Mailr and will be launching early next year. This was one of our largest projects this year, both in terms of development time, and in terms of interesting changes to the FastMail platform.

Co-branding and reselling FastMail

We have always offered great options for partnering with us and reselling our service. At the most basic level, everybody can generate their own affiliate URL and be paid for bringing us customers.

At the business level, value added resellers can create sub-businesses to leverage our systems for their own hosted email solution, and create their own custom landing page.

The deepest standard level of integration was used for the discontinued MyOpera Mail platform. It was a complete co-branding, with custom themes and even custom features. We have always had support for completely white-label co-branding, as evidenced by the annoyingly hard to type messagingengine.com domain in all our settings (don’t worry, we have plans to move to easier-to-type names in the fastmail.com domain soon).

A whole new world

Telenor wanted something even more separated, a completely standalone version of the FastMail system, integrated with their Global Backend to give single signon and a central management and support channel across all their services. There are also jurisdictional hosting requirements for some of their business units which mean that in future, we will need the ability to run separate instances with hosting in particular countries.

To support those requirements, we had to add another layer of abstraction to the FastMail platform. We call each fully standalone copy an “island’, which may consist of machines in multiple datacentres.

Mailr’s island is built with physical and virtual machines at Softlayer. Softlayer is a great fit for email, because email is very IO heavy, and they build real hardware to spec. This allowed us to have machines which are very similar to our regular email servers, with their great IO capabilities. We can also spin up virtual machines quickly for the services with no local storage requirements, allowing fast ramp-up of extra capability.

With Mailr now available as a fully integrated service through Telenor’s global backend, Telenor business units all around the world can very easily integrate a high quality email experience as part of their service offering.

An email service

When I say standalone, Mailr is completely isolated. It is updated separately to our FastMail datacentres in New York and Iceland. The source code is stored in a single git repository still, but the list of staff with access is different, and updates are applied independently.

It also has its very own larsbot in a separate IRC channel. It’s called something different, but still answers to lars because that’s hard-coded into our fingers (OK, I have typed “last ack” more than once…). We don’t offer FastMail as a boxed piece of software – set and forget. Email services just don’t work that way – they have to interact with the rest of the world in real time, and the 24 hour on-call roster and monitoring are a key part of our service.

If your company wants to move away from running your own service, FastMail is great option – we offer a fantastic user experience, are competitive on price and provide exceptional reliability. Hosting your email with us releases your best technical staff from the day-to-day grind of interacting with the rest of the world’s email servers and allows them time to add value to the things which are core to your business. Drop us a line at sales@fastmail.com.

Dec 18: Billing and Payments — a potted history

This blog post is part of the FastMail 2014 Advent Calendar.

The previous post on 17th December was about how we test our software. The following post on 19th December is about one of our biggest projects this year.

Technical level: low

Billing is not very glamorous — but it is important.

If you’ve ever run a business, you know that it doesn’t matter how great your service or product is, you have to be able to bill your customers, or eventually you’ll no longer be able to provide that service or product.

The financial side of FastMail is broadly divided into two parts: billing — which is keeping track of which services are being used and how much people have paid, and payments — which is about actually collecting money from customers.

Billing

When FastMail started, our billing system was fairly ad-hoc, and pretty much the bare minimum we needed to keep track of user balances. When a user signed up, or paid, or renewed, we would

  • create a record with the effect on their balance and a description of the event, and
  • update any attributes affected, such as the service level, or subscription expiry date

This did the job for a few years, but as we slowly grew it became apparent that it was not sufficient. Manual adjustments caused problems, and it was difficult to extract information that our accountants needed.

So, we redid things in a kind of engineery-accounting way.

The basic record keeping part of the system has these properties:

  • data in the billing system is never changed or deleted, only added
  • every event is modelled as a bunch of square pulses with a start time, end time, resource type, and height of the pulse
  • a pulse may have no end time for non-expiring resources — in this case the pulse becomes a step function
  • if the resource type is “money” then all steps (money is non-expiring) are paired in such a way that this is equivalent to double-entry bookkeeping
  • each event has a separate time that the event occurred
  • the authoratitive billing information is calculated by adding together all the pulses and steps.

On top of that record keeping are a bunch of simple views and convenience methods for commonly used queries and actions, such as resources which are always linked to the duration of the current subscription (this was the case for Extra Aliases and Extra Domains, when we offered those “a la carte”).

This allowed us to do a number of things that were previously difficult or impossible. We can:

  • See exactly how many resources (e.g. subscriptions) were being used for any time in the past,
  • Reconstruct all the transactions and purchases that led to the current state — this is important to show customers a statement of account,
  • Easily and accurately calculate pro-rata changes, even when prices have changed since the original purchase,
  • Audit our user attributes to make sure that they match the information in our billing system,
  • Reconcile our billing information with our payment gateway, and
  • Report on things that are important to our accountant, like “deferred revenue” and “unrealised FX gain / loss”

This billing system has now been in use for the past 10 years, with only minor changes needed in that time. Part of the reason for the small number of changes is that over time our pricing and product offerings have become progressively simpler.

Aside:

Some of our users don’t like this, because they like to optimise their subscription so that they are paying for exactly what they need, and no more. Most of our users, though, find the all-inclusive model to be much more appealing.

Some things that are still lacking in our billing system are automated “discount coupon” and “special prices for charities” functionality. I’d like to add these one day.

Payments

As a credit card merchant, an important thing is the chargeback rate — that is, the percentage of your payments that the cardholders dispute with their bank. If this rate gets too high, then the payment gateways deem you to be risky, and they will charge much higher fees, or cancel your account completely.

When we first started out we used integrated billing only for credit cards. This was done via a payment gateway — we chose Worldpay. In those days, the only payment gateways were banks, and Worldpay was then part of the Royal Bank of Scotland. Banks are pretty conservative by nature and we had to go to some lengths to reassure them that we were a good risk.

All other payment methods were manual — someone would send us a cheque or a PayPal payment or a telegraphic transfer, and we would try to find the FastMail user and apply the credit to that account.

We were fairly successful at keeping the chargeback rate low — we have a number of fraud checks, and we try to refund disputed payments promptly. So, we got along just fine with Worldpay.

Over time, Worldpay’s APIs improved and we were able to improve the user interface, do automated reconciliation, and even perform delayed capture.

However, when FastMail was bought by Opera in 2010, Worldpay would not let us keep our merchant account under the new structure. We would need to reapply for a new merchant account, and Worldpay’s policies had changed, so we would now need to keep a much larger deposit — so large that it wasn’t feasible.

Finding a payment gateway that had acceptable deposit terms turned out to be difficult. At the time, it seemed to be a common practice to require merchants to provide a continual deposit of between three to twelve months of revenue! The mindset among many of the payment gateways appeared to be that the risk of total bankruptcy of their merchants was so high that they could not tolerate any possible exposure to chargebacks, ever.

We did a lot of searching around and found that even as recently as 2010, it was hard to find a payment gateway that met our requirements, which were:

  • take payments in USD
  • take payments from anywhere in the world
  • pay into a USD bank account in Australia
  • support delayed capture
  • have acceptable deposit terms
  • not require us to see the most PCI-DSS sensitive data

This last requirement was important, because we were not a huge company, and the cost of maintaining (and certifying) compliance with the higher levels of PCI-DSS is significant.

But we still wanted to be able to conduct recurring billing to make it easy for users to renew their subscriptions, and also so that our customers could purchase “a la carte” additions to their account.

The solution is for us to redirect to the payment gateway (in 2010) or use javascript (now) to send card information directly from the user’s web browser to the payment gateway. In both cases, the communication is directly between the user’s web browser and the payment gateway, without passing through our servers. The payment gateway would then give us a token that we could use to process additional payments when necessary.

This means that even if our servers were compromised, the attacker would not be able to steal the card details of our users from our system. The best way to ensure confidentiality is not actually having the data in the first place!

We eventually found a payment gateway that met all of our requirements, and signed up with Global Collect.

An advantage of Global Collect was that it was able to process payments of many different methods, so in theory it was able to act as an abstraction layer, and allow us to easily take payments using almost any scheme, including local bank transfers in many different countries. In practice, there was a fair amount of work needed, and in the end the only additional payment method that we used was automatic payments via PayPal.

A substantial amount of effort was required to make things work with the Global Collect — all the integration points were slightly different from Worldpay and the failure modes were different too. There was substantially more work involved in the “non-payment” part of the integration. These include reconciliation with the payment gateway (to make sure that FastMail has credited all the payments to the right users, even if we didn’t initiate them), and dealing with payments that unexpectedly change status (from Succeeded to Failed or vice versa) some days or months after the actual payment.

This refactoring was not a lot of fun, so to try to reduce this in the future, a lot of groundwork was laid for our own payment abstraction layer.

We attempted to move all the Worldpay card data into Global Collect, so that customers would not have to re-enter their billing details. This generally worked technically but had some giant wrinkles. As a standing authorisation was created in Global Collect for each user, an “auth” of $1.00 was processed, which never appeared on customers’ card statements. For most customers this was fine. A small number of customers had their bank contact them about the $1.00 charge, which they didn’t know about, and an even smaller number of customers had their cards summarily cancelled because their bank deemed that the charges were probably fraudulent. None of this showed up in our testing with our own credit cards.

The whole point of this effort was to make things as convenient as possible for our customers, but the only people who noticed were those who were so severely inconvenienced that we wished we’d never tried. It was a good lesson for future migrations though.

The “delayed capture” feature worked very well with Global Collect. We would process an “auth” which would reserve the funds, and then a few days later we would “capture” the payment, which would actually take the funds from the user’s card.

Spammers and scammers are often trying to use our systems. Many of these are kept away by the requirement to pay, but a few determined spammers make use of stolen payment credentials to sign up accounts. Often we are alerted to abuse of our system within a few days, and if we find out before the payment is “captured”, then we can cancel the payment. In this case, the cardholder will never see a transaction on their paper statement.

In this situation we know that the cardholder’s credentials have been stolen, and are being actively used for fraud on the internet (because they have just been used to try to defraud us). However, we are repeatedly told by all the payment gateways and banks we have asked that there is NO WAY for us to tell the card issuing bank or the cardholder that those details are compromised.

Fast-forward a few more years to 2013, when FastMail was split from Opera. Unfortunately the merchant agreement was now no longer applicable, so we need to change to a different payment gateway.

When it seemed likely that we would have to switch payment gateways again, we pushed ahead with realising our payment abstraction layer. This would allow us to easily support multiple payment gateways at the same time, and direct users to the appropriate payment gateway.

Aside:

It’s possible that a service like Chargify might have been able to provide this for us, but that would have required us to refactor all our billing code as well. Also, I didn’t know about Chargify or similar services then.

By this time, there were a number of the new breed of “disruptive” payment gateways around — such as Pin Payments, Stripe, Braintree, and others. These don’t require large deposits from their merchants.

This means they are exposed to a chargeback risk, but have evaluated risk as being small enough that they don’t need every merchant to carry a deposit large enough to cover the maximum chargeback risk.

They effectively keep about a week of revenue as a deposit, by paying out funds a week after they were taken from the merchant’s customers. This is very convenient for a merchant as it means you can just dip a toe into the waters of a payment gateway, without taking a plunge.

After the difficulties encountered when importing data in to Global Collect, we decided not to use any data migration schemes this time. This meant our customers who already had a billing agreement with us had to enter their billing details again with a new payment gateway.

We selected Pin Payments to handle the bulk of our credit card payments, and have been pretty happy with them in general.

They had a “delayed capture” feature already, and when we asked if that could be automated (so that payments were automatically captured after 3 days) they were happy to add it. Unfortunately this came back to bite us — it turned out that the payment gateway had added a special case to deal with the automatic delayed capture, and this special case was not fully covered in their internal testing. When they made some internal changes later on, this caused a bug, and a bunch of payments were captured a second time, and a third time and a fourth! This was a giant headache to fix, and made us look pretty bad to the affected customers. As a result, we’ll shortly be doing the delayed capture ourselves — we don’t want to be skipping test cases in our payment gateway.

Speaking of “delayed capture”, it doesn’t work quite as cleanly with Pin Payments as it used to with Global Collect. Depending on the card and the bank, sometimes the “auth” appears as a transaction on the card, and then the “capture” appears as another transaction. For the affected customers, they will see two transactions on an online statement. After a while, the “auth” transaction will completely disappear (and it will never appear on paper statements) but there is a period when customers may be concerned. We have many reports of this occuring with Pin Payments, and the Stripe documentation also mentions that this may happen. We didn’t get any reports of this when we were using Global Collect.

In addition, when we started using Pin Payments they didn’t support American Express payments. Many of our business customers prefer to pay with American Express, so this was a challenge. We worked around this by initially advising American Express cardholders to pay via PayPal, and later by automatically sending those payments via Stripe.

A benefit of having our payment abstraction layer was that it was easy for us to gradually phase in payments via Pin. We started off with a couple of percent of payments, and slowly increased the percentage as we grew more confident that everything was going according to plan.

The abstraction layer makes it straightforward to integrate with PayPal directly, which we’ve had to do since late 2013.

It also means that using new payment gateways is pretty straightforward — an implementation class need to be written, and a reconciliation script. As an experiment, we’ve processed a small percentage of payments via Stripe, and we’re confident that we could switch if there was ever a persistent problem with Pin Payments.

And, after testing for a few months, today we’ve also added bitcoin payments (via BitPay) in our main web interface.

Dec 17: Testing

This blog post is part of the FastMail 2014 Advent Calendar.

The previous post on 16th December was about Confidentiality at FastMail. The following post on 16th December is how we get money.

Technical level: low

You wake up, check your email and notice that a new Calendar feature has appeared overnight.

Behind the scenes, several months of planning and hard work goes in to providing our customers with dozens of innovative features before they’re released. Hundreds of thousands of customers connect to FastMail using devices ranging from feature phones to the latest iOS devices for business or personal email, web storage, notes, their calendar and address book. Our previous blog post on regular system monitoring explained how we keep systems running with near 100% uptime.

This blog post examines the development process at FastMail, specifically focusing on introducing new features and testing.

Almost every feature you use in the web interface has been written from scratch at some point over the past 15 years, many features are being migrated from the classic interface (which we maintain to support backwards compatibility) to the new AJAX interface. We test these features to help determine when we think they’re ready for release.

Each week FastMail’s developers add hundreds of lines of code to our Git repository, each line of code could potentially make our service unusable for one or more customers, break a feature on a specific platform or break the work-flow someone has been using for years.

Of course it’s impossible to verify that every feature still works in every possible way for every user every time anyone changes something – we would never release anything.

Anyone who has ever ended up on the advanced screen will know we provide significantly more than the average email service. Given the number of features we have and the number of customer scenarios, there are potentially an unlimited number of things we could test, but only 38 hours in our working week. To prioritise testing, we do risk-based testing to focus our testing effort on what we think is likely to have broken as a result of any significant changes. Any developer who is working on a major change, feature or project informs our QA engineer who tries to break the feature. His job is to consider how a feature may be used by our customers, how someone with less technical experience and knowledge than the person who developed the feature may use the it and how a malicious user may use or abuse the newly introduced/changed feature.

In addition to manual testing, we use Selenium which allows us to test the entire website in real web browsers (e.g. Chrome, Firefox, IE, etc) and control the web browser from code, so we can simulate clicks, mouse movements, etc just like a real person actually using the site would do.

As a small team it isn’t practical to maintain a copy of every version of every operating system with every version of every browser, so we use SauceLabs for most automated functional tests. Selenium WebDriver scripts perform actions such as logging in from a variety of browsers, navigating around the web interface, adding events and sending emails. Our tests run constantly so it becomes obvious who broke what when things go wrong.

The Selenium tests are run using Jenkins for each git commit. Jenkins failures are reported to our QA engineer who verifies whether an issue is a genuine failure and not an intended change before informing the appropriate developer.

We take security seriously and are happy to pay out reasonable bounties for serious bugs. This year we introduced a bug bounty program which has seen a number of issues fixed that could have otherwise gone unreported.

FastMail run a beta version of the web mail service where customers can try new features before they’re available to everyone. Major features are available on beta for several weeks before they go to live for all customers. If nothing significant is found internally and if Jenkins looks happy, we leave the change on our beta site for a few more days before making the new feature available to everyone. Beta customers provide valuable feedback via our beta feedback email address.

With the exception of emergency fixes, most changes go on to our beta server and are available for staff to test as well as anyone who’s interested in trying the latest features, in exchange for some instability. Calendar for example was available on beta for several months before it was launched in June with hundreds of issues being found and fixed during the beta period.

Most issues are discovered and fixed within a day. Internal communication is either done in person, via email or IRC. Issues that take a bit longer to fix go in to our bug tracking tool. Every software development team needs a bug tracking tool to keep track of known issues and help determine when a project is near completion. For historic reasons we were using JIRA. Unfortunately we’d recently been experiencing significant performance problems with the OnDemand hosted JIRA platform (the time to load the first page each day was >30 seconds in some cases!), and so ended up looking for an alternative. We moved to YouTrack, which so far has been performing much better for us.

We believe we offer the best email service on the market. However, it’s always possible to improve and we value all constructive feedback sent via our beta program. If you would like to try the latest features before they’re released and help improve the quality of service we provide, log in to the beta site at beta.fastmail.com.

Dec 16: Security – Confidentiality

This blog post is part of the FastMail 2014 Advent Calendar.

The previous post on 15th December was about how we load the initial mailbox view quickly. The following post is about how we test our changes.

Technical level: low-medium

Information wants to be free.

It’s the catch phrase of everyone who wants the latest episode of their favourite show, or the latest song, and knows they can get it for free more easily and in a less encumbered format than if they were to pay for it.

It’s also a fact of life on the internet. An attacker only has to find one flaw, and once they have data, it can be copied endlessly.

The Landscape

Confidentiality is about keeping data private. This includes protecting against threats like: pervasive surveillance, identify theft, targeted data theft, and activist hacking.

And that’s just active attacks. If you’re not paying for the service, then your data is probably being sold directly to pay for the costs of keeping it online.

Hosting Jurisdictions and Confidentiality

The headline “security risk” that everyone thinks of when talking about hosting jurisdiction and security is that the NSA or equivalent national spying agency will insert covert data taps, with or without the cooperation of the target service.

In fact, that’s it. The only jurisdiction dependent risk is that the national intelligence agency of the host country wants to access data, but they don’t want it enough to resort to illegal means or just making a special deal for access.

Other avenues of attack are either delivered over the internet (hacking, compromised hardware, viruses) or done by subverting/bribing/blackmailing service or data centre staff. If a determined attacker has the budget and agents to find the right person and apply the right pressure, these risks are present anywhere: any country, any data centre.

Mind you, credential theft (compromise of individual accounts rather than the entire service) happens all the time – whether through keyloggers on individual machines, viruses, password reuse from other sites which have been hacked, or just old-fashioned phishing. We find that compromised accounts are frequently used to send spam (taking advantage of our good reputation as a sender) rather than having their data stolen.

Non-Jurisdiction-Dependent Risks

There are data centre specific risks like physical security, trustworthiness of employees, resistance to social engineering attacks – and then there’s everything else.

The majority of possible attacks can be carried out over the internet, from anywhere.

Confidentiality at FastMail

The most important thing for confidentiality is that all our accounts are paid accounts. We don’t offer Guest accounts any more, and we don’t even offer “pay once, keep forever” Member accounts. Both these account types have a problem – they don’t keep paying for themselves. That leaves us hunting for alternative sources of income. We flirted with advertisements on Guest accounts at one stage, but we were never really comfortable with them – even though they were text only and not targeted. Ads are gone, and they’re not coming back.

We are very clear on where we stand. We provide a great service, and we proudly charge for it. Our loyalties are not divided. Our users pay the bills – we have no need to sell data.

We have spelled out in our privacy policy and public communications that we don’t participate in blanket surveillance. We are an Australian company, and to participate in such programs would be in violation of Australian law.

We frequently blog about measures we take to improve confidentiality for our users:

There are also other things we do which don’t have blog posts of their own:

  • Physically separate networks for internal data and management. We don’t use VLANs in shared switch equipment, there’s an actual air gap between our networks.
  • All the machines which have partitions containing user data (email, database, backups, filestore) are only connected to the internal production and internal management networks, and have no external IP addresses.
  • All user data is encrypted at rest, meaning there is no risk of data being recovered from discarded hard disks.
  • Our firewall rules only allow connections to network ports for services which are explicitly supposed to be on those machines, and the ports are only opened after the correct service is started and confirmed to be operating correctly.
  • Our management connections are via SSH and OpenVPN, and are only allowed from a limited set of IP addresses, reducing the our exposure to attacks.
  • We follow all the basic security practices like choosing software with good security records, not allowing password-based login for ssh, applying security patches quickly, and keeping on top of the security and announcement mailing lists for our operating system and key pieces of software.

Our goal is to make the cost of attacking our security much higher than the value of the data that could be stolen. We follow due process when dealing with law enforcement, providing individual data in response to the appropriate Australian warrant, so there is no justification to attempt wholesale surveillance of all our users.

We believe our security is as good as or better than anyone else in the same business. Of course we have had bugs, just like everyone, and we offer a generous bug bounty to encourage security researchers to test our systems. We recently had a respected independent security firm do a security audit, in which they had full access to our source code and a clone of our production systems (but not to any customer data or production security keys). They did not find any significant issues.

We are very happy with the trustworthiness and physical security at the NYI data centre where we host our data. I have visited a few times – I needed to have my name on the list at the lobby of the building to get a pass that would activate the lift, and then be escorted through two separate security doors to gain access to the racks with our servers, which are themselves locked. The staff are excellent.

Balancing Confidentiality, Integrity and Availability, I believe that hosting at NYI provides great security for our users’ data. Moving elsewhere would be purely security theatre, and would discard NYI’s great track record of reliability and availability for no real improvement in confidentiality.

Follow

Get every new post delivered to your Inbox.

Join 5,842 other followers