Authors: Alberto Bustamante Reyes (alberto.bustamante.reyes@est.tech)
Status: Development
Superseded by: N/A
Related: N/A
Problem
There is a problem with Geode WAN replication when GW receivers are configured with the same hostname-for-senders and port on all servers. The reason for such a setup is deploying Geode cluster on a Kubernetes cluster where all GW receivers are reachable from the outside world on the same VIP and port. Other kinds of configuration (different hostname and/or different port for each GW receiver) are not cheap from and resources perspective in cloud native environments and also limit some important use-cases (like scaling).
Currently, it is possible to set GW receivers , but there are some problems derived from this configuration that must be solved prior to state that this configuration is supported by Geode.
The printouts of this wiki were obtained from a minikube environment, using two Geode clusters, "Cluster-1" and "Cluster-2". Geode software used was develop branch, at c8413592e5573f675c538c63ef9ee9f97a349e73.
Each cluster contains a locator and two servers, but "Cluster-1" has gw senders and "Cluster-2" has gw receivers.
$ kubectl --namespace=geode-cluster-2 get all
NAME READY STATUS RESTARTS AGE
pod/locator-0 1/1 Running 0 55m
pod/server-0 1/1 Running 0 55m
pod/server-1 1/1 Running 0 55m
NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE
service/locator-site2-service ClusterIP None <none> 10334/TCP 55m
service/receiver-site2-service NodePort 10.103.196.204 <none> 32000:32000/TCP 55m
service/server-site2-service ClusterIP None <none> 30303/TCP 55m
NAME READY AGE
statefulset.apps/locator 1/1 55m
statefulset.apps/server 2/2 55m
|
Problem 1: Gw sender failover
The problem experienced is that shutting down one server is stopping replication to this cluster until the server is up again. This is because Geode incorrectly assumes there are no more alive servers when just one of them is down, because since they share hostname-for-senders and port, they are treated as one same server.
Example, using "cluster-1" and "cluster-2", both with one locator and two servers. :
Cluster-1 gfsh>list members
Member Count : 3
Name | Id
--------- | ------------------------------------------------------------
server-0 | 172.17.0.4(server-0:65)<v1>:41000
locator-0 | 172.17.0.6(locator-0:25:locator)<ec><v0>:41000 [Coordinator]
server-1 | 172.17.0.8(server-1:47)<v1>:41000
Cluster-1 gfsh>list gateways
GatewaySender Section
GatewaySender Id | Member | Remote Cluster Id | Type | Status | Queued Events | Receiver Location
---------------- | --------------------------------- | ----------------- | -------- | --------------------- | ------------- | --------------------------------------------------------------
sender-to-2 | 172.17.0.4(server-0:65)<v1>:41000 | 2 | Parallel | Running and Connected | 0 | receiver-site2-service.geode-cluster-2.svc.cluster.local:32000
sender-to-2 | 172.17.0.8(server-1:47)<v1>:41000 | 2 | Parallel | Running and Connected | 0 | receiver-site2-service.geode-cluster-2.svc.cluster.local:32000
Cluster-2 gfsh>list members
Member Count : 3
Name | Id
--------- | ------------------------------------------------------------
server-0 | 172.17.0.5(server-0:65)<v1>:41000
locator-0 | 172.17.0.7(locator-0:25:locator)<ec><v0>:41000 [Coordinator]
server-1 | 172.17.0.9(server-1:46)<v1>:41000
Cluster-2 gfsh>list gateways
GatewayReceiver Section
Member | Port | Sender Count | Senders Connected
--------------------------------- | ----- | ------------ | -----------------------------------------------------------------------------------------------------------------------------------------------
172.17.0.5(server-0:65)<v1>:41000 | 32000 | 6 | 172.17.0.4(server-0:65)<v1>:41000, 172.17.0.8(server-1:47)<v1>:41000, 172.17.0.8(server-1:47)<v1>:41000,
172.17.0.8(server-1:47)<v1>:41000, 172.17.0.8(server-1:47)<v1>:41000, 172.17.0.4(server-0:65)<v1>:41000 172.17.0.9(server-1:46)<v1>:41000 | 32000 | 8 | 172.17.0.8(server-1:47)<v1>:41000, 172.17.0.4(server-0:65)<v1>:41000, 172.17.0.4(server-0:65)<v1>:41000,
172.17.0.8(server-1:47)<v1>:41000, 172.17.0.4(server-0:65)<v1>:41000, 172.17.0.4(server-0:65)<v1>:41000, 172.17.0.4(server-0:65)<v1>:41000, 172.17.0.8(server-1:47)<v1>:41000 |
If one server is stopped on "cluster-2", both senders in "cluster-1" are disconnected:
Problem 2: Gw sender pings not reaching gw receivers
Gw sender use internally a client pool which sends ping messages to the gw receiver it is connected to. In the receivers, ClientHealthMonitor thread is in charge of handle these ping messages. If no one is received from a given sender, it is considered down and the connection is closed. When configuring all gw receivers with same host and port, ping messages are not reaching all the receivers, just one of them, so connections are closed.
When same host and port is used for all gw receivers, pings are not handled correctly. Following examples shows the errors we have seen:
Example 1:
Cluster-2 gfsh>list gateways
GatewayReceiver Section
Member | Port | Sender Count | Senders Connected
--------------------------------- | ----- | ------------ | ------------------------------------------------------------------------------------------------------------
172.17.0.5(server-0:65)<v1>:41000 | 32000 | 6 | 172.17.0.4(server-0:65)<v1>:41000, 172.17.0.8(server-1:47)<v1>:41000, 172.17.0.8(server-1:47)<v1>:41000,
172.17.0.8(server-1:47)<v1>:41000, 172.17.0.8(server-1:47)<v1>:41000, 172.17.0.4(server-0:65)<v1>:41000 172.17.0.9(server-1:46)<v1>:41000 | 32000 | 8 | 172.17.0.8(server-1:47)<v1>:41000, 172.17.0.4(server-0:65)<v1>:41000, 172.17.0.4(server-0:65)<v1>:41000,
172.17.0.8(server-1:47)<v1>:41000, 172.17.0.4(server-0:65)<v1>:41000, 172.17.0.4(server-0:65)<v1>:41000, 172.17.0.4(server-0:65)<v1>:41000, 172.17.0.8(server-1:47)<v1>:41000 |
But after some time, connections from one of the sender are closed. Connections from cluster-1/server-1 have dissapeared from cluster-2/server-1 list of connected senders:
Cluster-2 gfsh>list gateways
GatewayReceiver Section
Member | Port | Sender Count | Senders Connected
--------------------------------- | ----- | ------------ | ------------------------------------------------------------------------------------------------------------
172.17.0.5(server-0:65)<v1>:41000 | 32000 | 6 | 172.17.0.4(server-0:65)<v1>:41000, 172.17.0.8(server-1:47)<v1>:41000, 172.17.0.8(server-1:47)<v1>:41000,
172.17.0.8(server-1:47)<v1>:41000, 172.17.0.8(server-1:47)<v1>:41000, 172.17.0.4(server-0:65)<v1>:41000 172.17.0.9(server-1:46)<v1>:41000 | 32000 | 5 | 172.17.0.4(server-0:65)<v1>:41000, 172.17.0.4(server-0:65)<v1>:41000, 172.17.0.4(server-0:65)<v1>:41000,
172.17.0.4(server-0:65)<v1>:41000, 172.17.0.4(server-0:65)<v1>:41000 |
Looking for ClientHealtMonitor logs on both servers:
root@server-0:/# grep ClientHealthMonitor server-0/server-0.log
[info 2020/03/10 11:13:38.546 GMT <main> tid=0x1] ClientHealthMonitorThread maximum allowed time between pings: 60000
root@server-1:/# grep ClientHealthMonitor server-1/server-1.log
[info 2020/03/10 11:13:38.700 GMT <main> tid=0x1] ClientHealthMonitorThread maximum allowed time between pings: 60000
[warn 2020/03/10 11:14:52.763 GMT <ClientHealthMonitor Thread> tid=0x39] ClientHealthMonitor: Unregistering client with member id identity(172.17.0.8(server-1:47)<v1>:41000,connection=1 due to: Unknown reason
[warn 2020/03/10 11:14:52.763 GMT <ClientHealthMonitor Thread> tid=0x39] Monitoring client with member id identity(172.17.0.8(server-1:47)<v1>:41000,connection=1. It had been 60595 ms since the latest heartbeat. Max interval is 60000. Terminated client.
|
And some minutes later, all connections are lost:
Cluster-1 gfsh>list gateways
GatewaySender Section
GatewaySender Id | Member | Remote Cluster Id | Type | Status | Queued Events | Receiver Location
---------------- | --------------------------------- | ----------------- | -------- | ---------------------- | ------------- | --------------------------------------------------------------
sender-to-2 | 172.17.0.4(server-0:65)<v1>:41000 | 2 | Parallel | Running, not Connected | 0 | receiver-site2-service.geode-cluster-2.svc.cluster.local:32000
sender-to-2 | 172.17.0.8(server-1:47)<v1>:41000 | 2 | Parallel | Running, not Connected | 0 | receiver-site2-service.geode-cluster-2.svc.cluster.local:32000
Cluster-2 gfsh>list gateways
GatewayReceiver Section
Member | Port | Sender Count | Senders Connected
--------------------------------- | ----- | ------------ | -----------------
172.17.0.5(server-0:65)<v1>:41000 | 32000 | 0 |
172.17.0.9(server-1:46)<v1>:41000 | 32000 | 0 |
|
Checking the logs again, we can see new logs from the ClientHealthMonitor:
root@server-0:/# grep ClientHealthMonitor server-0/server-0.log
[info 2020/03/10 11:13:38.546 GMT <main> tid=0x1] ClientHealthMonitorThread maximum allowed time between pings: 60000
[warn 2020/03/10 11:20:12.203 GMT <ServerConnection on port 32000 Thread 3> tid=0x45] ClientHealthMonitor: Unregistering client with member id identity(172.17.0.8(server-1:47)<v1>:41000,connection=1 due to: The connection has been reset while reading the header
[warn 2020/03/10 11:22:22.336 GMT <ServerConnection on port 32000 Thread 6> tid=0x4c] ClientHealthMonitor: Unregistering client with member id identity(172.17.0.4(server-0:65)<v1>:41000,connection=1 due to: The connection has been reset while reading the header
root@server-1:/# grep ClientHealthMonitor server-1/server-1.log
[info 2020/03/10 11:13:38.700 GMT <main> tid=0x1] ClientHealthMonitorThread maximum allowed time between pings: 60000
[warn 2020/03/10 11:14:52.763 GMT <ClientHealthMonitor Thread> tid=0x39] ClientHealthMonitor: Unregistering client with member id identity(172.17.0.8(server-1:47)<v1>:41000,connection=1 due to: Unknown reason
[warn 2020/03/10 11:14:52.763 GMT <ClientHealthMonitor Thread> tid=0x39] Monitoring client with member id identity(172.17.0.8(server-1:47)<v1>:41000,connection=1. It had been 60595 ms since the latest heartbeat. Max interval is 60000. Terminated client.
[warn 2020/03/10 11:22:13.064 GMT <ClientHealthMonitor Thread> tid=0x39] ClientHealthMonitor: Unregistering client with member id identity(172.17.0.4(server-0:65)<v1>:41000,connection=1 due to: Unknown reason
[warn 2020/03/10 11:22:13.065 GMT <ClientHealthMonitor Thread> tid=0x39] Monitoring client with member id identity(172.17.0.4(server-0:65)<v1>:41000,connection=1. It had been 60747 ms since the latest heartbeat. Max interval is 60000. Terminated client.
|
Example 2:
Cluster-1 gfsh>list members
Member Count : 3
Name | Id
--------- | ------------------------------------------------------------
server-0 | 172.17.0.4(server-0:69)<v1>:41000
locator-0 | 172.17.0.6(locator-0:26:locator)<ec><v0>:41000 [Coordinator]
server-1 | 172.17.0.8(server-1:46)<v1>:41000
Cluster-1 gfsh>list gateways
GatewaySender Section
GatewaySender Id | Member | Remote Cluster Id | Type | Status | Queued Events | Receiver Location
---------------- | --------------------------------- | ----------------- | -------- | --------------------- | ------------- | --------------------------------------------------------------
sender-to-2 | 172.17.0.4(server-0:69)<v1>:41000 | 2 | Parallel | Running and Connected | 0 | receiver-site2-service.geode-cluster-2.svc.cluster.local:32000
sender-to-2 | 172.17.0.8(server-1:46)<v1>:41000 | 2 | Parallel | Running and Connected | 0 | receiver-site2-service.geode-cluster-2.svc.cluster.local:32000
Cluster-2 gfsh>list members
Member Count : 3
Name | Id
--------- | ------------------------------------------------------------
server-0 | 172.17.0.5(server-0:65)<v1>:41000
locator-0 | 172.17.0.7(locator-0:24:locator)<ec><v0>:41000 [Coordinator]
server-1 | 172.17.0.9(server-1:51)<v1>:41000
Cluster-2 gfsh>list gateways
GatewayReceiver Section
Member | Port | Sender Count | Senders Connected
--------------------------------- | ----- | ------------ | -------------------------------------------------------------------------------------------------------------------------------------------------
172.17.0.5(server-0:65)<v1>:41000 | 32000 | 7 | 172.17.0.8(server-1:46)<v1>:41000, 172.17.0.8(server-1:46)<v1>:41000, 172.17.0.8(server-1:46)<v1>:41000, 172.17.0.4(server-0:69)<v1>:41000,
172.17.0.4(server-0:69)<v1>:41000, 172.17.0.8(server-1:46)<v1>:41000, 172.17.0.4(server-0:69)<v1>:41000 172.17.0.9(server-1:51)<v1>:41000 | 32000 | 7 | 172.17.0.4(server-0:69)<v1>:41000, 172.17.0.4(server-0:69)<v1>:41000, 172.17.0.8(server-1:46)<v1>:41000, 172.17.0.8(server-1:46)<v1>:41000,
172.17.0.4(server-0:69)<v1>:41000, 172.17.0.8(server-1:46)<v1>:41000, 172.17.0.4(server-0:69)<v1>:41000 |
And after some seconds:
Cluster-2 gfsh>list gateways
GatewayReceiver Section
Member | Port | Sender Count | Senders Connected
--------------------------------- | ----- | ------------ | -------------------------------------------------------------------------------------------------------------------------------------------------
172.17.0.5(server-0:65)<v1>:41000 | 32000 | 0 |
172.17.0.9(server-1:51)<v1>:41000 | 32000 | 7 | 172.17.0.4(server-0:69)<v1>:41000, 172.17.0.4(server-0:69)<v1>:41000, 172.17.0.8(server-1:46)<v1>:41000, 172.17.0.8(server-1:46)<v1>:41000,
172.17.0.4(server-0:69)<v1>:41000, 172.17.0.8(server-1:46)<v1>:41000, 172.17.0.4(server-0:69)<v1>:41000 |
Logs of the servers. In this test, both senders were considered down by one of the receivers:
root@server-0:/# grep ClientHealthMonitor server-0/server-0.log
[info 2020/03/10 14:02:34.130 GMT <main> tid=0x1] ClientHealthMonitorThread maximum allowed time between pings: 60000
[warn 2020/03/10 14:03:56.191 GMT <ClientHealthMonitor Thread> tid=0x37] ClientHealthMonitor: Unregistering client with member id identity(172.17.0.8(server-1:46)<v1>:41000,connection=1 due to: Unknown reason
[warn 2020/03/10 14:03:56.192 GMT <ClientHealthMonitor Thread> tid=0x37] Monitoring client with member id identity(172.17.0.8(server-1:46)<v1>:41000,connection=1. It had been 60507 ms since the latest heartbeat. Max interval is 60000. Terminated client.
[warn 2020/03/10 14:03:56.194 GMT <ClientHealthMonitor Thread> tid=0x37] ClientHealthMonitor: Unregistering client with member id identity(172.17.0.4(server-0:69)<v1>:41000,connection=1 due to: Unknown reason
[warn 2020/03/10 14:03:56.194 GMT <ClientHealthMonitor Thread> tid=0x37] Monitoring client with member id identity(172.17.0.4(server-0:69)<v1>:41000,connection=1. It had been 60444 ms since the latest heartbeat. Max interval is 60000. Terminated client.
root@server-1:/# grep ClientHealthMonitor server-1/server-1.log
[info 2020/03/10 14:02:34.275 GMT <main> tid=0x1] ClientHealthMonitorThread maximum allowed time between pings: 60000
|
And some minutes later:
Cluster-2 gfsh>list gateways
GatewayReceiver Section
Member | Port | Sender Count | Senders Connected
--------------------------------- | ----- | ------------ | -------------------------------------------------------------------------------------------------------
172.17.0.5(server-0:65)<v1>:41000 | 32000 | 0 |
172.17.0.9(server-1:51)<v1>:41000 | 32000 | 3 | 172.17.0.8(server-1:46)<v1>:41000, 172.17.0.8(server-1:46)<v1>:41000, 172.17.0.8(server-1:46)<v1>:41000
|
Anti-Goals
N/A
Solution
Gw sender failover
Solution consists on refactoring some maps on LocatorLoadSnapshot class. They use ServerLocation objects as key, this has to change due to it will not be unique for each server. We changed the maps to use InternalDistributedMember objects as key for the map entries. The ServerLocation information is not lost, as it is contained in the entry value for all the maps.
The same refactoring is done in EndPointManager, as it holds a map of endpoints that also uses ServerLocation objects as key.
Check this commit for a draft of the proposed solution: https://github.com/apache/geode/pull/4824/commits/b180869c73095e7a810ba2e1c92e243a0220e888
Gw sender pings not reaching gw receivers
When PingTask are run by LiveServerPinger, they call PingOp.execute(ExecutablePool pool, ServerLocation server). PingOp only uses hostname and ip (ServerLocation) to get the connection to send the ping message. As all receivers are sharing the same host and port, it is not guaranteed that the connection is really pointing to the server we want to connect to.
Solution consists on the modification of the ping messages to include info about the server they want to reach. If the messages are received by other server, they can be sent to the proper server.
Other alternative is the addition of a retry mechanism to PingOp to be able to discard a connection if the endpoint of that connection is not the server we want to connect to. We have added a new method PingOp.execute(Executable pool, Endpoint endpoint) to solve this. In this way, if the connection obtained is not pointing to the required Endpoint, it can be discarded an ask for a new one.
Other alternatives to the retry mechanism that we have not explored could be:
- Add the option for deactivating the ping mechanism for gw sender/gw receivers communication
- Send the ping using just existing connections, not creating new ones.
Changes and Additions to Public Interfaces
N/A
When getting the connection to execute the ping, some retries could happen until the right connection is obtained so this operation will take longer, but we do not think it will impact performance.
Backwards Compatibility and Upgrade Path
N/A
Prior Art
After checking with the dev mailing list, we received the suggestion to configure serverAffinity in Kubernetes to solve the issue with the pings, but that option broke the failover of gw senders when a gw receiver is down.
FAQ
TBD
Errata
N/A
8 Comments
Alberto Bustamante Reyes
Please ignore this comment by the moment.
We have seen that the issue in testExecuteOp described in the Annex is happening too in develop branch, its not an issue we have introduce with our changes. Debugging the test case it can be seen that both endpoints corresponding to the two servers, have the same memberId.How to check it: debug the test case with a break point in OpExecutorImpl class, at the following line of the "public Object execute(Op op, int retries)" method:It can be seen that the endpoint object inside the conn object has a different port value in the memberId and in the location. For example (in this case, server ports were 22683 & 22684):If you see the same port, set another break point where the conn object is reassigned and check again:Bruce J Schuchardt
A lot of the changes you're suggesting stem from using the same hostname-for-clients in the servers but I don't think this is necessary. Instead you should focus on the Pools used to communicate with other servers/locators. Upthewaterspout (Dan Smith) recently introduced a set of Pool changes to use a gateway. Look at his RFC and code changes. His changes are for SSL but the same kind of setting on a Pool can be created for non-SSL. Set a gateway hostname in the Pool and change the Pool and SocketCreator code to connect to your gateway rather than the requested hostname. That will give you the same behavior you currently have without needing to change everything to use the cluster membership identifiers and will be in line with how everyone else is implementing gateway support. The Pool will get hostnames from servers that it may not be able to resolve, but that's okay - it doesn't need to resolve those hostnames if the SocketCreator is changed to always connect to the gateway.
The Ping problem will still remain with this approach. Instead of having the pool retry, causing extra connection processing in the server, have you considered changing the ping protocol to make the pool send the address of the server it's trying to contact? The server receiving the Ping could then forward it to the correct Server. You'd need to implement a new DistributionMessage to do the forwarding.
Alberto Bustamante Reyes
I have modified the description of the solution for the ping problem, including the Bruce's suggestion. Im also extending the date for receiving comments.
Bruce J Schuchardt
I think I'm convinced that we need these changes in order to support this kind of configuration. If we go with an overriding socket factory in the gateway Senders they'll be connecting to Receivers at random and cleanup of Connections in the Sender's Pool will be a problem because the Connection would actually be connected to some random Receiver and not necessarily the one it thinks.
Alberto Bustamante Reyes
Just want to clarify that this solution will not solve that senders connect to a random receiver, there is no way to control that, but at least we are providing a way to differentiate to which server they are connecting to.
I have been told that this solution is solving an issue when hostname-for-senders is used in locators: servers were not able to know that more than one locator was present, so they were sending load info just to one of them.
Aaron Lindsey
Why not use ingress routing combined with a headless service to solve this problem for Kubernetes?
In your example, your service exposes a VIP (clusterIP). This type of service is typically used for load balancing between replicas without unique identities, such as stateless web services. Headless services, on the other hand, do not create a VIP and allow DNS address resolution of individual pods with unique identities.
As you pointed out, it would be prohibitively expensive to allocate external IP addresses for each GW receiver. This is one of the primary use cases of ingress routing, where you have a single external IP backed by an ingress controller. In this scenario, the ingress controller could act as a reverse proxy to backend GW receivers.
I think I am basically describing what several other people have stated or alluded to either here or in the dev list email thread, so sorry if this is just repeating what has already been discussed. However, I still can't wrap my head around why we would not use ingress for this in Kubernetes.
Ref:
Aaron Lindsey
Dan Smith and I chatted offline and it seems I am confusing ingress (primarily for HTTP(S)) with a proxy/gateway approach like Bruce J Schuchardt described in his comment on Mar 23. I tend to prefer the solution Bruce described, as it feels strange to put the GW receivers behind a VIP load balancer where the sender does not know to which receiver it is going to end up connecting. That said, this solution has the benefit of not needing to install an additional proxy/gateway, and I don't see any issue with doing it this way.
I wish there had been more discussion in this RFC or prior to its creation about alternative approaches to solving the problem of WAN replication across Kubernetes clusters.
Alberto Bustamante Reyes
RFC implemented by GEODE-7565 and GEODE-8004.