All users email is now on replicated servers. This means that every email delivered or deleted and every email action performed is replicated within a second to a completely separate server with a completely separate copy of all users emails.
We now have at least three levels of redundancy, three copies of every email, and all those copies are on RAID redundant storage themselves.
All users now have their email stored on a system with RAID disks and all servers and RAID arrays have dual power supplies.
This means a single drive or power supply failure should cause no interruption to service at all, we just replace the drive/power supply while the system is live and online. Hard drives and power supplies are the most common failing hardware components in computer systems.
All users now have their email replicated to an identical replica system (RAID drives, dual power supplies, etc). Each system is completely separate; it’s own operating system, filesystem, drives, power, connections, etc. The replication is performed at the semantic email level, not at the filesystem level. So a filesystem corruption on the source server will not be replicated. This means if there is a disk or filesystem corruption on a single machine, we can just switch to the replica (failover) and it won’t cause a multi-day outage.
The failover is not automatic, it is manual. Thus depending on the actual problem that occurs and our ability to analyse and respond, it should be on the order of minutes to an hour to failover to a replica if we decided it’s needed. In some cases, we may decide it’s easier and safer to reboot a frozen or crashed machine than failover to the replica, so it might be possible to still have outages up to an hour. If we believe the outage is going to go over that time, we will most likely failover to the replica.
We can also use the failover ability to do maintenance on machines more easily. If we decide a machine needs servicing (kernel upgrade, hardware change, etc), we can just failover to a replica machine safely, do the work, start the machine up again and wait for replication to catch up, then failback to the machine. For users, the only visible downtime will be the controlled failover portion, which is usually on the order of 1 minute or so.
All users have their email store backed up incrementally each night to a separate system and RAID array. The backups of email are kept for 1 week after the email is deleted to allow restoring in case of accident. In an emergency situation if both a master and replica server should fail catastrophically, we can still perform a restore from this backup
We believe that this will provide us the highest possible reliability while still allowing us to continue to grow our user base.