Chapter 39. High Availability and Failover

We define high availability as the ability for the system to continue functioning after failure of one or more of the servers.

A part of high availability is failover which we define as the ability for client connections to migrate from one server to another in event of server failure so client applications can continue to operate.

39.1. Live - Backup Pairs

HornetQ allows pairs of servers to be linked together as live - backup pairs. In this release there is a single backup server for each live server. A backup server is owned by only one live server. Backup servers are not operational until failover occurs.

Before failover, only the live server is serving the HornetQ clients while the backup server remains passive. When clients fail over to the backup server, the backup server becomes active and starts to service the HornetQ clients.

39.1.1. HA modes

HornetQ provides two different modes for high availability, either by replicating data from the live server journal to the backup server or using a shared store for both servers.

Note

Only persistent message data will survive failover. Any non persistent message data will not be available after failover.

39.1.1.1. Data Replication

In this mode, data stored in the HornetQ journal are replicated from the live server's journal to the backup server's journal. Note that we do not replicate the entire server state, we only replicate the journal and other persistent operations.

Replication is performed in an asynchronous fashion between live and backup server. Data is replicated one way in a stream, and responses that the data has reached the backup is returned in another stream. Pipelining replications and responses to replications in separate streams allows replication throughput to be much higher than if we synchronously replicated data and waited for a response serially in an RPC manner before replicating the next piece of data.

When the user receives confirmation that a transaction has committed, prepared or rolled back or a durable message has been sent, we can guarantee it has reached the backup server and been persisted.

Data replication introduces some inevitable performance overhead compared to non replicated operation, but has the advantage in that it requires no expensive shared file system (e.g. a SAN) for failover, in other words it is a shared-nothing approach to high availability.

Failover with data replication is also faster than failover using shared storage, since the journal does not have to be reloaded on failover at the backup node.

39.1.1.1.1. Configuration

First, on the live server, in hornetq-configuration.xml, configure the live server with knowledge of its backup server. This is done by specifying a backup-connector-ref element. This element references a connector, also specified on the live server which specifies how to connect to the backup server.

Here's a snippet from live server's hornetq-configuration.xml configured to connect to its backup server:

  <backup-connector-ref connector-name="backup-connector"/>

  <connectors>
     <!-- This connector specifies how to connect to the backup server    -->
     <!-- backup server is located on host "192.168.0.11" and port "5445" -->
     <connector name="backup-connector">
       <factory-class>org.hornetq.integration.transports.netty.NettyConnectorFactory</factory-class>
       <param key="host" value="192.168.0.11"/>
       <param key="port" value="5445"/>
     </connector>
  </connectors>

Secondly, on the backup server, we flag the server as a backup and make sure it has an acceptor that the live server can connect to. We also make sure the shared-store paramater is set to false:

  <backup>true</backup>
  
  <shared-store>false<shared-store>
  
  <acceptors>
     <acceptor name="acceptor">
        <factory-class>org.hornetq.integration.transports.netty.NettyAcceptorFactory</factory-class>
        <param key="host" value="192.168.0.11"/>
        <param key="port" value="5445"/>
     </acceptor>
  </acceptors>               
              

For a backup server to function correctly it's also important that it has the same set of bridges, predefined queues, cluster connections, broadcast groups and discovery groups as defined on the live node. The easiest way to ensure this is to copy the entire server side configuration from live to backup and just make the changes as specified above.

39.1.1.1.2. Synchronizing a Backup Node to a Live Node

In order for live - backup pairs to operate properly, they must be identical replicas. This means you cannot just use any backup server that's previously been used for other purposes as a backup server, since it will have different data in its persistent storage. If you try to do so, you will receive an exception in the logs and the server will fail to start.

To create a backup server for a live server that's already been used for other purposes, it's necessary to copy the data directory from the live server to the backup server. This means the backup server will have an identical persistent store to the backup server.

One a live server has failed over onto a backup server, the old live server becomes invalid and cannot just be restarted. To resynchonize the pair as a working live backup pair again, both servers need to be stopped, the data copied from the live node to the backup node and restarted again.

The next release of HornetQ will provide functionality for automatically synchronizing a new backup node to a live node without having to temporarily bring down the live node.

39.1.1.2. Shared Store

When using a shared store, both live and backup servers share the same journal using a shared file system.

When failover occurs and the backup server takes over, it will load the persistent storage from the shared file system and clients can connect to it.

This style of high availability differs from data replication in that it requires a shared file system which is accessible by both the live and backup nodes. Typically this will be some kind of high performance Storage Area Network (SAN). We do not recommend you use Network Attached Storage (NAS), e.g. NFS mounts to store any shared journal (NFS is slow).

The advantage of shared-store high availability is that no replication occurs between the live and backup nodes, this means it does not suffer any performance penalties due to the overhead of replication during normal operation.

The disadvantage of shared store replication is that it requires a shared file system, and when the backup server activates it needs to load the journal from the shared store which can take some time depending on the amount of data in the store.

If you require the highest performance during normal operation, have access to a fast SAN, and can live with a slightly slower failover (depending on amount of data), we recommend shared store high availability

39.1.1.2.1. Configuration

To configure the live and backup server to share their store, configure both hornetq-configuration.xml:

                   <shared-store>true<shared-store>
                

In order for live - backup pairs to operate properly with a shared store, both servers must have configured the location of journal directory to point to the same shared location (as explained in Section 15.2, “Configuring the message journal”)

If clients will use automatic failover with JMS, the live server will need to configure a connector to the backup server and reference it from its hornetq-jms.xml configuration as explained in Section 39.2.1, “Automatic Client Failover”.

39.1.1.2.2. Synchronizing a Backup Node to a Live Node

As both live and backup servers share the same journal, they do not need to be synchronized. However until, both live and backup servers are up and running, high-availability can not be provided with a single server. After failover, at first opportunity, stop the backup server (which is active) and restart the live and backup servers.

In the next release of HornetQ we will provide functionality to automatically synchronize a new backup server with a running live server without having to temporarily bring the live server down.

39.2. Failover Modes

HornetQ defines two types of client failover:

  • Automatic client failover

  • Application-level client failover

HornetQ also provides 100% transparent automatic reattachment of connections to the same server (e.g. in case of transient network problems). This is similar to failover, except it's reconnecting to the same server and is discussed in Chapter 34, Client Reconnection and Session Reattachment

During failover, if the client has consumers on any non persistent or temporary queues, those queues will be automatically recreated during failover on the backup node, since the backup node will not have any knowledge of non persistent queues.

39.2.1. Automatic Client Failover

HornetQ clients can be configured with knowledge of live and backup servers, so that in event of connection failure at the client - live server connection, the client will detect this and reconnect to the backup server. The backup server will then automatically recreate any sessions and consumers that existed on each connection before failover, thus saving the user from having to hand-code manual reconnection logic.

HornetQ clients detect connection failure when it has not received packets from the server within the time given by client-failure-check-period as explained in section Chapter 17, Detecting Dead Connections. If the client does not receive data in good time, it will assume the connection has failed and attempt failover.

HornetQ clients can be configured with the list of live-backup server pairs in a number of different ways. They can be configured explicitly or probably the most common way of doing this is to use server discovery for the client to automatically discover the list. For full details on how to configure server discovery, please see Section 38.2, “Server discovery”. Alternatively, the clients can explicitly specifies pairs of live-backup server as explained in Section 38.5.2, “Specifying List of Servers to form a Cluster”.

To enable automatic client failover, the client must be configured to allow non-zero reconnection attempts (as explained in Chapter 34, Client Reconnection and Session Reattachment).

Sometimes you want a client to failover onto a backup server even if the live server is just cleanly shutdown rather than having crashed or the connection failed. To configure this you can set the property FailoverOnServerShutdown to true either on the HornetQConnectionFactory if you're using JMS or in the hornetq-jms.xml file when you define the connection factory, or if using core by setting the property directly on the ClientSessionFactoryImpl instance after creation. The default value for this property is false, this means that by default HornetQ clients will not failover to a backup server if the live server is simply shutdown cleanly.

Note

By default, cleanly shutting down the server will not trigger failover on the client.

Using CTRL-C on a HornetQ server or JBoss AS instance causes the server to cleanly shut down, so will not trigger failover on the client.

If you want the client to failover when its server is cleanly shutdown then you must set the property FailoverOnServerShutdown to true

For examples of automatic failover with transacted and non-transacted JMS sessions, please see Section 11.1.55, “Transaction Failover With Data Replication” and Section 11.1.33, “Non-Transaction Failover With Server Data Replication”.

39.2.1.1. A Note on Server Replication

HornetQ does not replicate full server state betwen live and backup servers. When the new session is automatically recreated on the backup it won't have any knowledge of messages already sent or acknowledged in that session. Any in-flight sends or acknowledgements at the time of failover might also be lost.

By replicating full server state, theoretically we could provide a 100% transparent seamless failover, which would avoid any lost messages or acknowledgements, however this comes at a great cost: replicating the full server state (including the queues, session, etc.). This would require replication of the entire server state machine; every operation on the live server would have to replicated on the replica server(s) in the exact same global order to ensure a consistent replica state. This is extremely hard to do in a performant and scalable way, especially when one considers that multiple threads are changing the live server state concurrently.

It is possible to provide full state machine replication using techniques such as virtual synchrony, but this does not scale well and effectively serializes all operations to a single thread, dramatically reducing concurrency.

Other techniques for multi-threaded active replication exist such as replicating lock states or replicating thread scheduling but this is very hard to achieve at a Java level.

Consequently it xas decided it was not worth massively reducing performance and concurrency for the sake of 100% transparent failover. Even without 100% transparent failover, it is simple to guarantee once and only once delivery, even in the case of failure, by using a combination of duplicate detection and retrying of transactions. However this is not 100% transparent to the client code.

39.2.1.2. Handling Blocking Calls During Failover

If the client code is in a blocking call to the server, waiting for a response to continue its execution, when failover occurs, the new session will not have any knowledge of the call that was in progress. This call might otherwise hang for ever, waiting for a response that will never come.

To prevent this, HornetQ will unblock any blocking calls that were in progress at the time of failover by making them throw a javax.jms.JMSException (if using JMS), or a HornetQException with error code HornetQException.UNBLOCKED. It is up to the client code to catch this exception and retry any operations if desired.

If the method being unblocked is a call to commit(), or prepare(), then the transaction will be automatically rolled back and HornetQ will throw a javax.jms.TransactionRolledBackException (if using JMS), or a HornetQException with error code HornetQException.TRANSACTION_ROLLED_BACK if using the core API.

39.2.1.3. Handling Failover With Transactions

If the session is transactional and messages have already been sent or acknowledged in the current transaction, then the server cannot be sure that messages sent or acknowledgements have not been lost during the failover.

Consequently the transaction will be marked as rollback-only, and any subsequent attempt to commit it will throw a javax.jms.TransactionRolledBackException (if using JMS), or a HornetQException with error code HornetQException.TRANSACTION_ROLLED_BACK if using the core API.

It is up to the user to catch the exception, and perform any client side local rollback code as necessary. The user can then just retry the transactional operations again on the same session.

HornetQ ships with a fully functioning example demonstrating how to do this, please see Section 11.1.55, “Transaction Failover With Data Replication”

If failover occurs when a commit call is being executed, the server, as previously described, will unblock the call to prevent a hang, since no response will come back. In this case it is not easy for the client to determine whether the transaction commit was actually processed on the live server before failure occurred.

To remedy this, the client can simply enable duplicate detection (Chapter 37, Duplicate Message Detection) in the transaction, and retry the transaction operations again after the call is unblocked. If the transaction had indeed been committed on the live server successfully before failover, then when the transaction is retried, duplicate detection will ensure that any durable messages resent in the transaction will be ignored on the server to prevent them getting sent more than once.

Note

By catching the rollback exceptions and retrying, catching unblocked calls and enabling duplicate detection, once and only once delivery guarantees for messages can be provided in the case of failure, guaranteeing 100% no loss or duplication of messages.

39.2.1.4. Handling Failover With Non Transactional Sessions

If the session is non transactional, messages or acknowledgements can be lost in the event of failover.

If you wish to provide once and only once delivery guarantees for non transacted sessions too, enabled duplicate detection, and catch unblock exceptions as described in Section 39.2.1.2, “Handling Blocking Calls During Failover”

39.2.2. Getting Notified of Connection Failure

JMS provides a standard mechanism for getting notified asynchronously of connection failure: java.jms.ExceptionListener. Please consult the JMS javadoc or any good JMS tutorial for more information on how to use this.

The HornetQ core API also provides a similar feature in the form of the class org.hornet.core.client.SessionFailureListener

Any ExceptionListener or SessionFailureListener instance will always be called by HornetQ on event of connection failure, irrespective of whether the connection was successfully failed over, reconnected or reattached.

39.2.3. Application-Level Failover

In some cases you may not want automatic client failover, and prefer to handle any connection failure yourself, and code your own manually reconnection logic in your own failure handler. We define this as application-level failover, since the failover is handled at the user application level.

To implement application-level failover, if you're using JMS then you need to set an ExceptionListener class on the JMS connection. The ExceptionListener will be called by HornetQ in the event that connection failure is detected. In your ExceptionListener, you would close your old JMS connections, potentially look up new connection factory instances from JNDI and creating new connections. In this case you may well be using HA-JNDI to ensure that the new connection factory is looked up from a different server.

For a working example of application-level failover, please see Section 11.1.1, “Application-Layer Failover”.

If you are using the core API, then the procedure is very similar: you would set a FailureListener on the core ClientSession instances.