Requirements
When using multiple WAN sites with geode users are required to set the distributed-system-id property (DSID). Each site must have its own unique DSID, and ids are limited to being between 1 and 255. This limits the number of WAN sites to 255. It also requires that external coordination before setting up a WAN site to make it that it gets a DSID. Once data is generated within a site, it is not possible to change the DSID without exporting and reimporting the data.
Goals:
- Allow more than 255 WAN sites
- Be able to do a rolling upgrade to a version that supports more that 255 WAN sites
- Nice to have - reduce coordination required when setting up a new WAN site. Do we have to specify and ID at all?
Not in scope:
- Support old clients, peers on WAN sites while also having large DSIDs. Before using larger IDs users will have to upgrade everything.
Background
The distributed-system-id is used for three things: WAN receiver discovery, WAN conflict detection and PDX type ID generation.
WAN receiver discovery
When creating a gateway-sender users specify the remote-distributed-system-id property. That property which indicates which WAN site the sender should send to. Users configure the remote-locators property on their locator to point to locators in other WAN sites. Geode then will discover the DSID and addresses of remote gateway-receivers and connect the sender to those receivers.
WAN Conflict Detection
Because WAN is asynchronous it's possible for two WAN sites to modify the same entry at the same time. For this case geode uses an eventual consistency model where all sites should end up choosing the same final value.
In order to detect that there is a conflict between updates from two different WAN sites geode compares the DSID of the two modifications. If the DSIDS are different geode then performs WAN specific conflict resolution, which involves comparing timestamps or invoking a user conflict resolver.
Geode passes the distributed-system-id along with events when doing a put over the WAN. See VersionTag.toData and GatewaySenderGFEBatchOpImpl for a couple of places where this is sent over the wire.
Pdx Type ID Generation
When serializing an object with PDX geode generates a PDXType which describes how to deserialize the object. Geode generates a unique ID for that type (PDXID) and embeds the ID in the serialized data. The PdxType is stored in the PDX type registry which is available everywhere, so that all members can find the PDXType using the PDXID.
In order to generate unique PDXIDs each WAN site embeds the DSID in PDXIDs generated in that site. That ensures two WAN sites never generate the same PDXID. PDXIDs are 4 bytes. The high byte is the DSID and the remaining 3 bytes are generated under a lock within a given WAN site.
Proposals
Option 1 - Just make the ID bigger
Most of the public APIs treat DSID as an integer. So we could just increase the size of the DSID to be an integer. This would add an extra 3 bytes of overhead to every update message and to every entry in the system, since DSIDs are passed around and stored for WAN Conflict Detection. It would also add an extra 3 bytes for every serialized object. Because values stored in geode are often object graphs and not single objects, that means that a single serialized value may have a much larger overhead increase than 3 bytes.
Option 2 - Change the ratio of DSID to other bytes in the PDXID
The PDXID currently allocates 1 byte for the DSID and 3 bytes for the unique ID of the type. We could give the DSID an extra byte without changing the size of the PDXID. This would allow 64K DSIDs and 64K types.
Since the DSID is larger update messages and entries will need an extra byte of overhead, but serialized values would stay the same size.
Unfortunately there is a lot of existing data out there that has a 4 byte PDXID that was generated with the old format. So this proposal would still require all of the rolling upgrade work to translate between formats that the other options require.
Option 3 - Use string IDs to identify WAN sites, switch to a 16 byte hash for the PDXID
This is the "if space is cheap" option. If we are willing to take on a bunch of extra overhead we could get some usability benefits by getting rid of the integer dsid entirely. Here's one way that might look:
- Change the id used for WAN discovery to a string. Since this is really just a label for the remote WAN site users should be able to put in whatever string they want and change it whenever they want.
- Generate a UUID to identify WAN sites for conflict detection. This would allow us to detect that two changes came from different WAN sites without requiring the user to select an id. This would increase the overhead of update messages and the storage space for every entry by 15 bytes.
- Using a cryptographic hash function, generate the ID of a PdxType by taking a 128 bit hash of the PDX type. With this solution we can generate a unique id for the type without any need to include a DSID or even do all of the complicated locking we do now. By using a hash instead of UUID we get the benefit that we can regenerate the type registry from a users domain classes if the registry gets lost somehow.
Rolling Upgrades
Rolling upgrade concerns with WAN receiver discovery and conflict detection
For receiver discovery and conflict detection, we have to change the size of the DSID that is passed around in various messages - In the VersionTag, in GatewaySenderGFEBatchOpImpl. We also have to change the size of the DSID written to disk. When sending messages to old members or reading from old disk files we would need to translate the ID for all of these messages.
If a newer member is trying to send a DSID larger than a byte to an older member, it should throw an exception.
This means that a user would be able to upgrade all of the WAN sites/clients, etc. while still using the small DSIDs. Once that is done they could start introducing sites with larger DSIDs.
Rolling upgrade concerns with the PDX Type ID
There are special concerns with changing the format of anything that is stored as a region value, because we pass around and persist values as opaque byte arrays. We don't keep track of the version of the member that generated the byte array.
One option for upgrading the type ID would be introduce a new DataSerializer code for PDX values with larger ids. PDX is really a subformat of DataSerializable. When a user stores a PDX object it is serialized using DataSerializable.writeObject. The actual bytes generated look something like this
93 | PDXID | pdx bytes |
The leading 93 is the DataSerializable constant that indicates that the following bytes are PDX (See DSCODE.PDX). So we could introduce a new code like DSCODE.PDX_2 that indicates PDX bytes with a different ID format. New members could use the old code if the PDXID is less than 255, and the new code if it is larger. In that way a user would be able to upgrade all of their sites, clients, etc. before introducing members with larger DSIDs.
If an older member was passed a value that was serialized using DSCODE.PDX_2, it would just fail to deserialize the value.
2 Comments
Galen O'Sullivan
Nice writeup!
Dan Smith
Mike Stolz Brought up the idea that with Option 1 (expand the dsid size) PDX_2 dsids could start from 255. In other words we always add 255 to the dsid we see in PDX_2 messages, since < 255 will be encoded as DSCODE.PDX. If we couple that with using a varint encoding for the dsid, PDX_2 objects might not really add 3 additional bytes of overhead (0 if the id < 255). Seems like a good idea.