Components in Apache Stratos may go down due to various reasons (i.e., network issue, hardware failure etc.). Apache Stratos is geared to handle faults efficiently in order to ensure that the instances in the cluster are promptly available. The fault handling process used by Apache Stratos is as follows:

Fault handling when CA or instance is down

Cartridge Agent (CA) will periodically send health statistics via the Thrift protocol to Complex Event Processor (CEP) and CEP will monitor the health statistics. If CEP does not receive health statistics for a period of time, CEP will come to a conclusion that Cartridge Agent or the instance are not available. Thereafter, CEP will send the Member Fault event to the Summarized Health Stats Topic in Message Broker. As Auto-scaler has subscribed to the Summarized Health Stats Topic, Auto-scaler will receive this event and take the necessary action to rectify the issue. 
For example:

If the instance is still running, Auto-scaler will terminate the instance and remove all the in-memory content related to that member. Thereafter, based on the Auto-scaler rules, Auto-scaler will decide whether another instance should be spawned in the cluster.

Fault handling when MB is down

In the pub/sub mechanism, after a component subscribes to a topic in Message Broker, it will maintain a passive connection and listen for new events that are published to that topic. The subscriber will publish a Ping event to the Ping topic in Message Broker periodically (every 1 second) to check whether Message Broker is running. In the event Message Broker goes down, the subscriber will wait for 30 seconds and attempt to subscribe to the respective topic in Message Broker in order to establish the connection again. The latter mentioned process will take place periodically after every 30 seconds until the subscriber is able to connect to Message Broker.

In a production environment environment it is recommended to use multiple Message Broker instances as a fail over mechanism. Therefore, in the event one Message Broker instance goes down, the other Message Broker instance will take over. However, in the event both Message Brokers were to go down the Ping mechanism will help the subscriber to detect when Message Broker is down, so that the subscriber can re-establish its connection as soon as one of the Message Broker instances become active.

In addition, when Message Broker goes down the publisher will periodically (every 60 seconds) try to publish events to Message Broker until Message Broker is active; therefore, messages will not get lost.

  • No labels