RFC Title
To be Reviewed By: Friday, May 8th
Authors: ringles@vmware.com, sabbey@vmware.com
Status: Draft | Discussion | Active | Dropped | Superseded
Superseded by: N/A
Related: N/A
Problem
The current Redis API support in Geode has errors in implementation relative to the command specification. Additionally, the underlying data structures have concurrency and high availability errors, and suffer from significant performance issues.
Redis API Command implementation Issues
The current Redis API implementation does not fully implement all documented functionality of many commands, does not properly validate input in many cases, and does not return the same error messages that native Redis does when in many error conditions.
Redis API Data Structure Issues
The current implementation is not thread-safe and displays many concurrency issues when multiple clients are operating on the same keys or data structures. (Since the Redis API and data structures do not map directly to Geode APIs and data structures, the API implementation must perform a certain amount of translation and accounting, which have not been implemented in a thread-safe manner before.) Additionally, the data is not stored in a manner that can provide High Availability - data is not distributed among multiple servers so a single failure can lose data.
Additionally, while the current implementation stores basic string-value pairs and HyperLogLog entries in dedicated regions, Sets, Hashes, and Lists are stored in a separate region per key - that is to say, each individual Set or Hash is contained in its own region.
While this has some benefit to performance in certain circumstances, there are several drawbacks, including:
- Significant overhead in terms of region creation and destruction
- Implementation difficulties for backups, importing and exporting of data, etc.
- Implementing transactions is impractical when the operations involve region creation and destruction
- WAN replication support becomes very difficult
Anti-Goals
Implementing the Redis clustering API ("Sentinel API" and “Cluster” commands) is outside the scope of this proposal.
Solution
Redis API Command Implementation
Initially we propose implementing and fully testing the subset of Redis commands that are required for Spring Session Data Redis. Why this subset of commands? Session State caching is one of the most popular use cases for Redis. It is well defined and has a limited amount of Redis commands which makes it a manageable scope of work. The following commands will be fully implemented according to the Redis specification:
Connection: AUTH, PING, QUIT
Hashes: HSET, HMSET, HGETALL
Keys: DEL, EXISTS, EXPIRE, EXPIREAT, KEYS, PERSIST, PEXPIRE, PEXPIREAT, PTTL, RENAME, TTL, TYPE
Publish/Subscribe: PUBLISH, PSUBSCRIBE, PUNSUBSCRIBE, SUBSCRIBE, UNSUBSCRIBE
Sets: SADD, SMEMBERS, SREM
Strings: APPEND, GET, SET
Redis API Region Management
The proposed update is to store Redis data - Sets, Hashes, Lists, etc. - in a single region, and use Geode functions to interact with them. Delta propagation will be used to keep data in sync between Geode servers. This avoids the overhead of region creation and destruction, and limits network traffic, while allowing data to be shared across Geode servers to promote High Availability.
Note that currently, the regions used to implement the Redis API are not “internal” regions, and are therefore visible to the Geode API (gfsh, etc.). It is proposed that the new Redis-specific region be marked as “internal” going forward.
In terms of general implementation, the region will use the String/Set/Hash/List name as the key, and the value contains all the members of the collections object, which would implement the Geode Delta interface. This will limit network traffic to redundant copies, and also keep the value deserialized when stored in the region; these traits should benefit performance.
High AvaiIability
Native Redis supports HA via the “Sentinel” API and Redis Cluster commands, but this requires clients to be HA-aware, placing some burden on application developers to manage redundancy and failover.
Using the Geode mechanism of functions to execute operations on the primary region where the key is stored, and then disseminating that information to the other servers via delta propagation, the Redis data can be efficiently distributed to multiple Geode servers allowing for transparent redundancy and failover. If an individual server goes down, the client can connect to a live server and access all of their data.
To further ensure HA, the partition type of the internal Redis region will no longer default to PARTITION, and will no longer be customizable. It will be fixed to the PARTITION_REDUNDANT type, with the default redundancy of 1.
Note that the complexity of reconnecting to a new server can be minimized with a load balancer or DNS aliases. If a list of healthy servers is kept, clients can be directed to individual servers via the normal DNS lookup process. If the server a client is connected to fails, the client will try to reconnect to the same host, and the DNS alias will automatically direct that client to a healthy server. From the client’s perspective, it will be indistinguishable from a momentary network failure, and no special failover logic is necessary.
Changes and Additions to Public Interfaces
The commands necessary for Spring Session Data Redis will be fully implemented to support all needed features and return the correct error messages as compared to native Redis. This does change the existing behavior of the Geode Redis API, but since the commands will correctly match native Redis behavior this is not expected to be a problem for users.
Performance Impact
As these changes only affect the internal regions used by the Redis API, there should be no impact to the performance of the core Geode product.
Backwards Compatibility and Upgrade Path
The current subset of Redis APIs are marked as experimental, have not been worked on for 2+ years, and have errors in their implementation. Thus backwards compatibility is not a concern we are taking into consideration.
With this new data structure in place, we enable the potential for supporting rolling upgrades in the future.
Prior Art
The current Redis API implementation has enough issues that usage in production scenarios is problematic. To be useful, some updates are necessary. A few alternative approaches were investigated before presenting the option outlined above:
Multi-region
This approach investigated the existing use of individual regions per set with some fixes to the command processing.
Performance in benchmarks was approximately the same as the function/delta propagation approach, but each Set imposed approximately 1.5MB of overhead, even when empty. Each entry in the Set took up ~300 bytes, compared to ~50 bytes for the function/delta propagation approach. For 1,000 Sets with 1,000 members each, the memory usage was ~1.92 GB, versus 180-210MB for the function/delta propagation approach.
Since this approach also makes implementing transactions difficult and suffers from performance issues when creating and deleting regions (relatively expensive operations), this approach was deemed unsuitable.
Shared regions with composite keys
Another possible solution that was investigated was using a single region for Sets and Hashes, but storing the values using a composite key - for example, combining the Set name with the member name and using this as the key.
While this allows very fast lookup up on individual values, operations on the whole Set (such as SMEMBERS) become expensive. Returning all the members of the Set, or the count of members, requires iterating across the keyspace of the entire region. Key management also becomes more complex in this approach, requiring at least hashing of keys to prevent collisions when keys are similar (e.g. “moon” and “moonbeam”.)
Furthermore, implementing Lists or Sorted Sets in the future would become difficult with this approach.
FAQ
Answers to questions you’ve commonly been asked after requesting comments for this proposal.
Errata
What are minor adjustments that had to be made to the proposal since it was approved?
2 Comments
Anilkumar Gingade
>> The current implementation is not thread-safe and displays many concurrency issues when multiple clients are operating on the same keys or data structures.
This should be handled by Geode system, right? The api level implementation should not be dealing with concurrency issue...
The new proposal to address the HA is by having PARTITION_REDUNDANT region, is it not currently the case.
Also, if the region is fixed, what is the default Redundancy?
Ray Ingles
The Redis API and data structures do not map directly onto Geode APIs and data structures, so the API implementation has to do a certain amount of translation, which introduces some concurrency issues. For example, Redis's PUBLISH method returns a count of the subscribers that were notified. This requires tracking and keeping a count of the successful updates, which in a distributed system is an inherently concurrent operation.
That said, by moving to single regions, a great deal of the concurrency issues can in fact be handled by the usual Geode mechanisms. This is one of the key motivations for the change; apparently that's not been made clear in the text.
The default region type for Redis storage used to be simply PARTITION, not PARTITION_REDUNDANT. As noted here, the region type will be locked to PARTITION_REDUNDANT and won't be modifiable by the user. We'll use the default redundancy of 1.