LISTSERV Maestro Tech Tip: How can I use LISTSERV Maestro in a high-availability environment?

LISTSERV Maestro Tech Tip

Q: How can I use LISTSERV Maestro in a high-availability environment?

Answer by Liam Kelly
Senior Consulting Analyst, L-Soft

High availability or disaster recovery is a common concern for LISTSERV Maestro customers designing their network architectures. The idea of high availability is to minimize downtime in the event of a network or server outage by having a failover server available to take over for the primary server. In this tech tip, we'll look at different ways to use LISTSERV Maestro in a high-availability context, including how to keep data synchronized between the two servers, how to detect outages, how to perform failover and how to recover from failover back to the primary server.

Assumptions

For the purposes of this tech tip, we'll assume that you are already familiar with the basics of LISTSERV Maestro administration. We will also assume that you are familiar with networking and filesystem maintenance on your platform of choice. We'll also assume that you're running all of the components of LISTSERV Maestro (the HUB, the LUI, the Tracker, LISTSERV and the DBMS) on the same server. Multi-server distributed installations are more complicated and beyond the scope of this tech tip.

Virtual Machine Clustering

Probably the cleanest way to ensure high availability with LISTSERV Maestro is to run the Maestro server as a virtual machine in a clustered environment. For example, on Windows you might have a Microsoft Hyper-V cluster, with LISTSERV Maestro running on a virtual server within the cluster. Similar clustering solutions exist for both Linux and Solaris. This is the means by which L-Soft maximizes uptime of our own ListPlex Maestro hosting service.

With LISTSERV Maestro running on a virtual server in a clustered environment, the specifics of the clustering are unknown to the LISTSERV Maestro application and the virtual operating system. The file synchronization, outage detection, failover and recovery are all handled by the cluster management software and are transparent from the perspective of the virtual server. In such an environment, LISTSERV Maestro administration is handled exactly as if it were running on a single server.

If you don't have the luxury of a cluster environment in which to run LISTSERV Maestro, but you still require high availability, there are four pieces you need to consider: filesystem synchronization, outage detection, failover and recovery.

Filesystem Synchronization

Suppose we have two servers at our disposal with which to create our high-availability environment for LISTSERV Maestro: a primary server and a failover server. LISTSERV Maestro will be running on the primary server and installed but not running on the failover server. On the failover server, the LISTSERV Maestro services must be set for manual startup so that they are never running at the same time as the services on the primary server. Having both the primary and the failover services running at the same time poses a serious risk of data corruption if there is any cross-talk between the two Maestro services. If, for example, your hostname is MAESTRO.EXAMPLE.ORG and the DNS points that hostname to the primary server, you don't want the failover server to start up with Maestro configured as MAESTRO.EXAMPLE.ORG because it will attempt to connect to the database on the primary server. Two running Maestro instances writing to the same database is a sure recipe for disaster.

In a normal "cold" backup recovery scenario, we would have a scheduled a nightly on-disk backup of LISTSERV Maestro that gets copied from the primary to the failover server, and in the event of a failure, we restore the Maestro backup to the failover server and start the Maestro services there, as described in LISTSERV Maestro Admin Tech Doc 2. While this is the standard supported method for LISTSERV Maestro backup and restore, it doesn't satisfy the requirements of high availability because the nightly backup may be as much as 24 hours old, and we could have as much as a day's worth of data loss when we restore. In order to avoid that scenario, we need to synchronize files between the primary and failover servers in as close to real-time as possible.

Proper synchronization is crucial in the context of LISTSERV Maestro because the application data and the database data need to be consistent with one another. You cannot, for example, restore the application data from one slice in time and the database data from another slice in time because you run the risk of having application objects that don't have corresponding database objects, and the result will be data corruption. In such an instance, the only alternative is to restore from the nightly on-disk Maestro backup, which is the outcome we're trying to avoid.

The safest way to handle the file synchronization between the primary and the failover servers is with block-level replication. This goes by different names on different platforms, but the idea is that whatever gets written to the disk accessed by the primary server gets simultaneously written to the disk accessed by the failover server. They may in fact be the same disk array accessed via a storage area network (SAN), or they may be different disk arrays replicated across a network in real time. The point is that if the primary server becomes unavailable, the failover server has access to an exact copy of the filesystem at the moment of failure. When we later start LISTSERV Maestro on the failover server, it will appear to the application that it is recovering from a local crash, and it will attempt recovery of the open files.

Less preferable, but still possible, is to do incremental synchronization using a technology like rsync. This allows you to copy data from the primary server to the failover server on a scheduled basis, updating only those files that have been created, modified or deleted since the last synchronization. As long as each synchronization event completes successfully, it should result in the failover server having an exact copy of the data on the primary server at the time of synchronization. This method is inferior to block-level replication for three reasons: 1) It requires CPU resources on both the primary and failover servers to calculate and perform the incremental file synchronization. 2) Because it is not a real-time synchronization, you still lose whatever data was updated since the last synchronization event. 3) If an outage should occur during the synchronization, you may end up with an incomplete copy and therefore file inconsistencies and/or database corruption. In that event, the alternative is once again to restore from the on-disk LISTSERV Maestro backup and lose the current day's data.

Whichever method we use, the desired goal is for the failover server to contain an exact copy of the disk on the primary server at a single moment in time. In most cases, that should allow for a successful startup on the failover server.

Outage Detection

An important aspect of high availability is outage detection. In short, what constitutes an outage, and how will you determine when an outage has taken place? The answer to the first question is a policy decision, not entirely a technical decision. If, for example, someone reboots the LISTSERV Maestro server to apply operating system updates, you may not want to invoke the failover scenario because the effort and expense of performing the failover are greater than the expense of a few seconds of outage while the machine reboots. You'll need to determine what period of time is the threshold at which failover becomes worthwhile. As for the second question, you'll need to be able to determine when the outage conditions have been met. The simplest model would be to run a continual ping from the failover server to the primary server. However, this method is not foolproof because you may have a network outage between the two servers when the primary server is, in fact, still available to the Internet. For example, someone could update a firewall rule that blocked ICMP (ping) requests between the two servers, which would cause the appearance to the failover server that the primary server is down. It may make more sense to have a third-party server monitor on both the primary and failover servers and invoke the failover scenario only once it has been determined conclusively that the primary server is down and the failover server is available. The specifics of how to do that detection are policy- and network-dependent.

Failing Over

Once an outage has been detected and we've decided to invoke the failover scenario, there are a few steps we need to take: 1) Make sure that when the primary server comes back online, the LISTSERV Maestro services and the filesystem synchronization are stopped or otherwise unavailable. 2) Update the DNS and/or networking rules to direct traffic to the failover server. 3) Start the LISTSERV Maestro services on the failover server. 4) Verify a successful startup and test connectivity.

First, we need to make sure that the LISTSERV Maestro services do not start back up when the primary server comes back online. The reason for this is, if we've moved the hostname MAESTRO.EXAMPLE.ORG to the failover server, we don't want the Maestro services on the primary server to attempt to connect to the database at MAESTRO.EXAMPLE.ORG, which now resides on the failover server instead of the primary server. As noted above, having two LISTSERV Maestro instances connecting to the same database will result in data corruption. So we need to make sure that when the primary server comes back online, either the LISTSERV Maestro services are disabled, or the primary server is unable to contact the failover server. Similarly, we need to make sure that any filesystem synchronization doesn't copy files from the (stopped) primary server to the (now running) failover server when the primary server comes back online, so we don't overwrite the current data. How exactly you choose to handle this is platform- and network-specific.

Secondly, we need to redirect network traffic from the primary server to the failover server. Again, the specifics of this are network-dependent. In some configurations, this may mean transferring the IP address from the primary to the failover server. In other configurations, you may update your DNS to point MAESTRO.EXAMPLE.ORG to the failover server's IP address. In still other cases, you may need to update Network Address Translation (NAT) rules to point to the failover server. If you're using a method in which the failover server will have a different IP address than the primary server, you will need to make sure that any firewall rules for the primary server IP address also apply to the failover server IP address. You will also need to pay attention to any IP addresses you may have specified in LISTSERV Maestro's INI files, as well as any IP-based relay rules that you may have set on your SMTP servers.

Once the network traffic is directed to the failover server, start the LISTSERV Maestro services on the failover server. If your filesystem synchronization was done correctly, it will appear to LISTSERV Maestro that it experienced an unclean shutdown (because the files were copied from a running instance of Maestro), and it should attempt a recovery. Pay attention to the Maestro log files, particularly the LUI-yyyymmdd.LOG file for the current date. If there are any problems with the database tables discovered on startup, they will be logged there. Once you've verified a successful startup, log into LISTSERV Maestro and send a test mailing. Verify that the job is sent, that bounces are being received, and that click-through tracking is working correctly. Make any needed adjustments to firewall or mail routing rules if you discover problems during testing.

Recovering From Failover

Once we're ready to recover back to the primary server, the process is essentially failover in reverse. Since the recovery is a planned event, we can shut down the LISTSERV Maestro services on both the primary and the failover servers so we don't have to worry about copying open files. We now need to copy any new data from the failover server back to the primary server. Synchronize files back to the primary server in the same way that you had previously synchronized to the failover server. Then restore the previous IP addresses/network rules and start the LISTSERV Maestro services back up on the primary server. As with the failover, check the LISTSERV Maestro log files, send a test mailing, and verify delivery, bounce handling and click-through tracking. Once you are satisfied that things are working correctly, set the LISTSERV Maestro services back to automatic startup, and resume your previous filesystem synchronization from the primary to the failover server.

Conclusion

The methods described above should be enough to establish a high-availability or disaster-recovery protocol for LISTSERV Maestro. They are not intended as a replacement for a sensible backup and recovery plan, and they are not intended to replace the need for Maestro's on-disk backups. Those backups are still crucial for cases when crash recovery fails, when there is data corruption because of failed filesystem synchronizations, or when there is data corruption on the primary server itself due to a full disk, a disk failure, a virus, or simple user error. As long as the on-disk backups complete successfully nightly, they will always serve as a supported and reliable fallback.

Subscribe to LISTSERV at Work.