Disaster Planning

No matter how good your server software, sooner or later the worst will happen and a hardware failure will occur. When this happens it is usually considered a major catastrophe as all communication in your organisation will stop. The problem becomes even more pronounced with groupware and IMAP because all of the essential information and company mail is stored on the server, preventing anyone from looking at any mail they have received. This can completely cripple an organisation. Because of this it is worth considering what steps are appropriate to return mail handling to normal in the shortest time possible.

There are a number of measures that can be taken to provide various levels of protection and differing costs. This paper will look at some possible configurations that offer different recovery times at different costs.

All of these considerations will focus only on the single point of failure problem. This is the scenario in which only one failure occurs, such as the motherboard of the Mail Server fails, rather than the multiple failure scenario, such as a lightning strike that blows up every computer on the network.

Single Point of Failure

The mail server can be considered as three main parts:

  1. The mail server software
    FTGate in our case
     

  2. The Mail server Configuration
    The options in use and mailboxes configured etc
     

  3. The mail store
    The computers hard drive

The disaster recovery plan should consider how each part should be recovered or protected.

Basic Protection

In the simplest scenario the administrator will take a backup each day using a tape drive or other system. This protects both the server software, the configuration and the mail store. In the event of a server failure the backup will be restored to either another server or the repaired server.

While this approach is low cost it can also result in extensive system down time, which may prove expensive in other ways. It also relies on the backup system not being damaged by the failure and that another PC is available or the original can be repaired quickly. In addition, any mail received since the last backup will be lost.

While this is the most common approach it is not considered to be a suitable solution.

Minimal Downtime

Any viable solution for disaster recovery should allow the administrator to recover normal operation in the shortest possible time. Thus it is important that the system in use is protected against the failure of a single server or component of the server. This implies that we should separate those parts onto different machines.

Dual Machines

At this stage it becomes obvious that the minimal downtime can be created by running two servers which are connected. At various times of the day the entire mail store and configuration are copied from the main server to the backup server. This results in a machine being available which can, at short notice, be used to replace the original.

In the event of a failure, the IP address of the backup PC will be changed to match the original and the mail server software will be started. This is required otherwise the mail client software of the users will not be able to connect to the new server. The physical changes needed will be quite small and can be made in as little as 15 minutes.

However, the issues with this type of system are that any mail received or configuration changes made between the copy interval will be lost and the IP addresses of the PC will need to be altered. Also, while the time taken to switch between machines can be low, if the failure occurs during unmanned hours, the actual outage could be very long. Thus in addition to the backup machine an MX relay should also be incorporated to hold inbound mail in the event of a failure.

Segmented Cluster

This solution is the most complex and expensive but offers a system that can result in any single failure affecting only a small number of users. In this system the user accounts are separated over different machines and the failure of any one machine only effects the accounts of those on that machine. This also has the advantage that high bandwidth users can be handled by the faster machines.

Full discussion of this will be made in a separate White Paper.