Copyright © 1999–2016 FastMail Pty Ltd
This is a copy of a post I put in our forum explaining the reason it took us some time to get replication setup.
The initial issue that made us realise we had to implement some form of replication occurred in November 2005 last year (http://blog.fastmail.fm/?p=521) when corruption on one of our major volumes caused 3 days of down time. After that, we started working on how we were going to get replication setup. On the whole, the process went slower than expected. I'd put this down to a couple of things:
1. The cyrus replication code wasn't really production ready.
We knew this when we started, and thought about our options which really were:
- Use cyrus replication and help bring it up to production readiness
- Use some other replication method (e.g. block based replication via DRBD - http://www.drbd.org/)
We decided to go with cyrus replication because with block level replication, you're still not protected from kernel filesystem bugs. If the kernel screws up and writes garbage to a filesystem, both the master and replica are corrupted. Protection against filesystem corruption was one of our major goals with replication.
This wasn't really that crazy because we knew the main replication code itself came from David Carter at Cambridge (http://www-uxsup.csx.cam.ac.uk/~dpc22/), so the original code was used in a university environment. The problems were really to do with integrating those changes into the main cyrus branch and accommodating other new cyrus 2.3 features, so we thought it wouldn't be that much work.
Unfortunately it seemed that not that many people were actually using cyrus 2.3 replication, so ironing out the bugs took longer than expected. Additional problems included CMU adding largish new features (modsequence support) to cyrus within the 2.3 branch itself that totally broke replication.
Still, we spent quite a bit of time setting up small test environments for replication and ironing out the bugs along with a few others. Unfortunately even after rolling out, there were still other bugs present and the CMU change that broke replication was damn annoying since it wasn't immediately obvious and caused some downtime when we had to switch to the replica (basically replication appeared to work fine, but it turned out when you actually tried to fetch a message from the replica, it was empty). After that disaster we implemented some code that allows us do replication "test" on users, to see that what the master IMAP server presents to the world is exactly the same as what the replica IMAP server presents to the world. We now run that on a regular basis.
A few example postings to the cyrus mailing list with some details
2. Our original replication setup was flawed
There's a number of ways to do replication. The most obvious is you have one machine as the master, and a separate one as the replica. That's a waste however, because the replica doesn't take as much resources as the master (one writer, no readers). So our plan was to have replica pairs, with half masters on one replicating to have replicas on the other and vice-versa. This would provide better performance in the general case when both machines were up.
The problem with this is it turned out to be a bit inflexible, and when one machine goes down, the "master" load on the other machine doubles. It also means the second machine then becomes a single point of failure until the other machine is restored. Neither of these are nice.
After a bit of rethinking, we came up with the new slots + stores architecture (see Bron's posts elsewhere). Basically everything is now broken into 300G "slots", and a pair of these slots on 2 different machines makes a replicated "store". The nice thing about this approach is that:
- Each machine runs multiple cyrus instances. Each instance is smaller, can be stopped & started independently, can be moved easier, restored more quickly from backup if needed, volume checked more quickly, etc. Smaller units are just easier to deal with
- By spreading out each store pair to different machines, when one machine dies, the load is spread out to all the other servers evenly
- Even after one machine dies, a second machine dieing would only affect maybe one or two slots, rather than a whole machines worth of users
The downside to this solution is management. There's now many, many slots/stores to deal with, which means we had to write management tools.
Had we gone with this from the start, it would have saved time. On other other hand, it was only really clear that this was a better solution after we went down the first road and saw the effects. Hindsight is a wonderful thing
3. The original servers we bought proved to be less reliable than expected.
Because we knew we had replication, and because we knew we had a very specific setup we wanted (2U server, 12 drives, 8 x high capacity SATA, 4 x high speed SATA, RAID controller with battery backup, etc) that IBM couldn't deliver, we went with a third party supplier. (http://blog.fastmail.fm/?p=524)
Suffice to say, this was a mistake. There is a big difference between hardware that runs stable for years vs hardware that runs stable for months. Replication should be a more a "disaster recovery" scenario, or a "controlled failover" scenario, it shouldn't replace very reliable hardware.
We went back to equipment we trusted (IBM servers + external SATA-to-SCSI storage units). It's a pity IBM are now 2.5 months late on delivering the servers they promised us. Trust me, we've already complained to them pretty severely about this. It's lucky we were able to re-purpose some existing servers for new replicated roles.
So all up, how would I summarise.
Had we followed the "perfect" path straight up, things would have gotten to the fully replicated stage faster, though not enormously so, the debugging and software stage still took quite some time, it was more the hardware that slowed us down. On the other hand, the "perfect" path is often only visible with the benefit of hindsight. Additionally, by following some dud paths now, you learn not to take them again in the future.
I've mentioned this in other posts now, but I should re-iterate that 85% of users were on replicated stores when this failure occurred. As Bron has mentioned, had it happened 1-2 weeks later, no-one would have noticed because that machine would have been out of service. This is actually part of the reason that soon as the restore was done, we could say "everyone was replicated". So it's not like 11 months had passed and nothing had happened.
- We'd chosen, tested and helped debugged a replication system
- We'd built 2 actual replication setups, scrapping the first after we realised a better arrangement
- We'd bought and organised 2 sets of extra hardware
- We'd already moved 85% of our user base to completely new servers