Improve possible duplicate logic for secondary buckets in WAN

To be Reviewed By: 2 April 2022

Authors: Mario Ivanac

Status: Draft | Discussion | Active | Dropped | Superseded

Superseded by: N/A

Related: N/A

Problem

It has been observed, that in case server with full parallel gateway sender queue is restarted,

after it is up, it unqueues events much slower , than other members in the cluster.

Reason for this is current logic, to mark all events in queue as possible duplicate when secondary buckets are becoming primary (or bucket is recovered and becoming primary).

Due to this, events on receiving site are processed slower (checking each event, is it duplicate).

If we look at current logic, if link between sites is temporarily down, then exact number of events will be marked as possible duplicate for each server.

This number is multiplication of number of dispatcher threads (default 5) and batch size (default 100). So for default configuration this number is 500.

Additionally, duplicate events are also possible, while dispatching events, if sender is stopped, or server hosting sender is gracefully shutdown.

Anti-Goals

Solution will not cover Parallel Async Events.

Solution will not cover situation when server is ungracefully shutdown.

Solution

Implement logic, that for each dispatch thread, in case events are unsuccessfully dispatched, when marking them as possible duplicate, for same batch of events notify secondary bucket.

Also add logic for pending batches at moment of stopping sender, when marking them as possible duplicate, for same batch of events notify secondary bucket.

Also add new API prepareForStop(), which will be called prior to closing of cache, so notifications can be sent prior shutdown.

Remove current logic, to mark all events in bucket, when it becomes primary.

PR with proposed solution: https://github.com/apache/geode/pull/7323

Changes and Additions to Public Interfaces

Add prepareForStop() method in GatewaySender interface.

Performance Impact

No impacts.

Backwards Compatibility and Upgrade Path

No impacts.

Prior Art

What would be the alternatives to the proposed solution? What would happen if we don’t solve the problem? Why should this proposal be preferred?

FAQ

Answers to questions you’ve commonly been asked after requesting comments for this proposal.

Errata

What are minor adjustments that had to be made to the proposal since it was approved?

Space shortcuts

Page tree

Problem

Anti-Goals

Solution

Changes and Additions to Public Interfaces

Performance Impact

Backwards Compatibility and Upgrade Path

Prior Art

FAQ

Errata

4 Comments

Dan Smith

Mario Ivanac

Mario Ivanac

Mario Ivanac