This is a draft of expected Failover Behaviour

Qpid client failover basic principles.

When connection to broker is lost due to network failure a Qpid client should be able to re-establish connection to a Qpid broker if failover policy is not switched off by specifying "nofailover" as a failover option in a connection URL.

The failover functionality on Qpid client should be based on principle "stop the world". When connection is lost and failover is started the Qpid Client should not allow an invocation of JMS operations which requires sending or receiving data over the network (such as producer.send(), connection.createSession(), consumer#receive etc). Such operations should be blocked until failover functionality restores the connectivity with any of the supported failover methods ('singlebroker', 'roundrobin', 'failover_exchange').

Rajith: I agree with the stop the world concept. I believe the existing design tried to achieve the same, albeit with less than desirable side effects. Therefore I'd be very interested in seeing the design for this, especially how we coordinate btw the JMS layer and the lower layer to achieve this.
Also this means that there will be some form of synchronization happening with these methods (send(), receive(), createSession()..etc) and we need to ensure that it doesn't affect performance too badly.

The failover reconnect feature should keep trying to reconnect to the Broker(s) in the background until connection is restored or the application calls Connection.close() or all failover method reconnection attempts are exhausted. Each failover method defines their own reconnection options and behavior.

How to configure failover with connection URL is depicted in Connection URL Format (This document describes all existing failover methods and their configuration options).

On restoring connectivity blocked JMS operations should be allowed to finish. If the failover functionality cannot re-establish the connection a JMSException should be thrown within any JMS operation requiring transferring data over the network.
Rajith: I think there are quite a few un answered questions, and we need to describe a bit more clearly about what we are going to support.

If we cannot re-establish the connection,
1. should we also notify via the exception listener? Or should we only report it if we cannot throw a JMS exception via a method that was being blocked while failover was in progress. Ex. send()
2. if we choose the latter, what would happen to applications that rely solely on a connection listener?
3. if we choose the latter, then this option will be untenable until we provide a consistent error code mechanism which allows applications to identify various failures, especially figure out a connection exception from a session level exception.
4. if there are multiple sessions in a connection being driven by multiple threads, how will we handle the exception notification? we need to ensure thread safety

When the client connection is recreated, existing Sessions, Producers, Consumers will be refreshed transparently to allow message processing to continue, with certain caveats described further below.

Failover notification via Exception Listener

In case if failover happens (or not, in the case of NoFailover method) and a Connection has registered ExceptionListener a special JMS exception (ConnectionLostException) needs to be sent into ExceptionListener to notify user that network failure happens. The JMS client application code can then decide what approach to take; call connection.close() etc, or take advantage of their configured failover reconnect feature.

If the client has no Exception Listener, they will not receive this notification. Exceptions indicating failover occured will only be thrown from other synchronous JMS methods as required by their functionality, e.g. commit() and acknowledge() (see below for further details).
Rajith: please see my above comments about exception handling

Auto Acknowledge

In Auto Acknowledge mode, by default the last message received by the application may fail to be acknowledged if the connection gets closed during onMessage, or before the call to receive() completes the acknowledgement.

In the receive() case, any such failure should be propagated as a JMSException through the method call, and in onMessage such failure has to be notified through the ExceptionListener if there is one.

Rajith: As I understood, the above considers the case where the failure happens while the methods are in progress. We also need to consider the case where failover happens after these methods return but before the ack is sent to the broker. In auto-ack mode, the ack is sent to the broker once the onMessage() method or the receive() method returns. Therefore failover can happen after those methods return, but before the ack is sent to the broker. In the case of message listener it's same, we need to notify via the exception listener. In the case where receive() is used we still have to notify via the exception listener. In others words if we don't have any blocking operations at the time failover happens, then we'd have to use the exception listener.

The spec allows for redelivery of this message:

4.4.14 Duplicate Delivery of Messages
...

When a client uses the AUTO_ACKNOWLEDGE mode it is not in direct control of message acknowledgment. Since such clients cannot know for certain if a particular message has been acknowledged, they must be prepared for re-delivery of the last consumed message. This can be caused by the client completing its work just prior to a failure that prevents the message acknowledgment from occurring. Only a session’s last consumed message is
subject to this ambiguity.

Any call to recover() performed following failover should be successfull, as the failover occurrence was already notified through the ExceptionListener if there was one, and the request to recover() would result in cleaning the Session and resuming message delivery with the first message sent by the new broker.

Rajith: Just to clarify, after failover the session should be cleaned up and message delivery should be started with the oldest unacked message. Therefore calling recover() after failover would essentially do the same thing again. i.e stop the message flow and restart delivery from the oldest unacked message

Dups Ok Acknowledge

Duplicates are allowed in this mode, therefor any application using it should be prepared to accept any number of duplicates and thus failover can be performed silently (other than previously mentioned Connection level Exception Listener notification that failover has occurred).

Client Acknowledge

A Client Ack Session should be considered 'dirty' if any unacknowledged messages have been received by the application. When the Session is refreshed during failover, if the Session is dirty then any unacknowledged messages previously received on the Session before failover can no longer be acknowledged and must be considered 'stale'.

All messages held by the client prior to failover (unacknowledged messages given to the application, and prefetched messages) should be discarded by the client as they can no longer be acknowledged, and record retained as to whether the Session was dirty when failover occurred. Only messages given to the client after failover will now be available to the application. When the next call to message.acknowledge() is performed, recover() should be called implicitly to clean the Session and an exception should be thrown (which one TBC, but thrown in addition to previous notification through the Exception Listener that failover has occurred) to indicate that we were unable to complete the acknowledgement process. If none of the new messages given to the client by the broker have been received by the application, this recover() call could be a no-op other than marking the Session clean, otherwise it may have to perform a full recover against the broker.

Rajith: sounds good, however the devil is in the details. During implementation we need some specific test cases to ensure we cover all cases (I believe there is 3).

Any call to recover() performed following failover should be successfull, as the failover occurence was already notified through the ExceptionListener if there was one, and the request to recover() would result in cleaning the Session and resuming message delivery with the first message sent by the new broker.

Session Transacted

A Transacted Session should be considered 'dirty' if any uncommited send/receive operations exist. When the Session is refreshed during failover, if the Session is dirty then any uncommitted send/receive work previously conducted on the Session before failover must be considered 'stale'. When the next call to commit() is made, the Session will automatically be rolled back and TransactionRolledBackException thrown to notify the application of the situation, allowing it to simply replay its transaction and continue.

Rajith: We need to ensure when we recreate the new session it is marked transacted (using the appropriate tx and dtx amqp commands). Also we should not hold any messages in the replay buffer if a session is marked transacted.

Any call to rollback() performed following failover should be successfull, as the failover occurence was already notified through the ExceptionListener if there was one, and the request to rollback() would result in cleaning the Session and resuming message delivery with the first message sent by the new broker.

Queue Browsers

If failover occurred while iterating through QueueBrowser enumerations a sub-class of NotSuchElementException should be thrown by default.

Temporary Queues

On successful failover, it is expected that a Qpid client should restore all temporary queues (by redeclaring the queues with the same name+attributes) created before failover.

Rajith: I believe temporary queues here means queues created using createTemporaryQueue or createTemporaryTopic methods. The spec says the life time of these queues are tied to that of the respective JMS connection. Therefore one could argue that if we provide transparent failover at the JMS connection level, then we should support failover for these queues as well as we cannot simply discard them when the underlying AMQP connection is lost.

So marking these connections as auto-delete may not be correct if we interpret the spec to the letter. In other words the JMS TemporaryQueue and TemporaryTopic doesn't have a one-to-one relationship with the AMQP temporary queues.

Or else we make a clear statement saying that we don't support TemporaryQueue/Topic through failover. I'm fine with either approach.

Link Reliability Options

Where a Link Reliability option is specified on an Address, it must be in conformance with the Acknowledge Mode being used by the Session. E.g, if requesting at-least-once link behaviour for a destination on a No-Acknowledge Session, an exception should be thrown as this combination is contradictory. The JMS Session Acknowledge Mode set in the code should take precedence.

The work at client failover has been postponed

  • No labels