To be Reviewed By: 2 April 2022
Authors: Mario Ivanac
Status: Draft | Discussion | Active | Dropped | Superseded
Superseded by: N/A
Related: N/A
Problem
It has been observed, that in case server with full parallel gateway sender queue is restarted,
after it is up, it unqueues events much slower , than other members in the cluster.
Reason for this is current logic, to mark all events in queue as possible duplicate when secondary buckets are becoming primary (or bucket is recovered and becoming primary).
Due to this, events on receiving site are processed slower (checking each event, is it duplicate).
If we look at current logic, if link between sites is temporarily down, then exact number of events will be marked as possible duplicate for each server.
This number is multiplication of number of dispatcher threads (default 5) and batch size (default 100). So for default configuration this number is 500.
Additionally, duplicate events are also possible, while dispatching events, if sender is stopped, or server hosting sender is gracefully shutdown.
Anti-Goals
Solution will not cover Parallel Async Events.
Solution will not cover situation when server is ungracefully shutdown.
Solution
Implement logic, that for each dispatch thread, in case events are unsuccessfully dispatched, when marking them as possible duplicate, for same batch of events notify secondary bucket.
Also add logic for pending batches at moment of stopping sender, when marking them as possible duplicate, for same batch of events notify secondary bucket.
Also add new API prepareForStop(), which will be called prior to closing of cache, so notifications can be sent prior shutdown.
Remove current logic, to mark all events in bucket, when it becomes primary.
PR with proposed solution: https://github.com/apache/geode/pull/7323
Changes and Additions to Public Interfaces
Add prepareForStop() method in GatewaySender interface.
Performance Impact
No impacts.
Backwards Compatibility and Upgrade Path
No impacts.
Prior Art
What would be the alternatives to the proposed solution? What would happen if we don’t solve the problem? Why should this proposal be preferred?
FAQ
Answers to questions you’ve commonly been asked after requesting comments for this proposal.
Errata
What are minor adjustments that had to be made to the proposal since it was approved?
4 Comments
Dan Smith
> This number is multiplication of number of dispatcher threads (default 5) and batch size (default 100). So for default configuration this number is 500.
Is this really true? For some reason I was under the impression that the gateway sender was streaming batches of data and receiving acks asnychronously. If that is the case the number of events that have been sent but not acknowledged could be much greater.
Mario Ivanac
For WAN, in case batch is unsuccessfully dispatched, we will retry with sending of the same batch until it is successful or gw sender is stopped. This is according to implementation in
Mario Ivanac
Currently this RFC only covers problem, when we have broken link between sites, and queue is filling. Problem is if any server is now restarted, that all events in buckets that become primary will be marked as possible duplicate (instead of 500 that was marked in restarted server).
Mario Ivanac
Updated solution