Apache Pegasus 2.3.0 is a feature release. The change-list is summarized here: https://github.com/apache/incubator-pegasus/issues/818.
New Features
Partition split
For the Pegasus table, its partition count is fixed while creating, during table total storage growth, sometimes a single partition may store too much data, leading to performance downgrade. Partition split supports the scalability for the Pegasus table, each original partition will be divided into two partitions. Please check https://github.com/apache/incubator-pegasus/issues/754 for more details and the design document can be found: https://pegasus.apache.org/2020/02/06/partition-split-design.html.
User-defined compaction strategy
Pegasus supports update value TTL in table-level, this is supported by RocksDB compaction filter. To make the function more flexible and expand its usage, we provide several compaction rules and compaction operations. Users filter the table data through rules and execute compaction strategy by operations. Please check https://github.com/apache/incubator-pegasus/issues/773 for more details.
Cluster load balance
Pegasus meta server will trigger load balance when replica counts of replica servers are not balanced. Pegasus now only provides a table-level balance strategy, meaning if all tables are balanced, the meta server will consider cluster balanced. In some cases, especially if a cluster has many replica servers and small-partition-count tables, the whole cluster replica is not balanced. Cluster load balance is a new load balance strategy, it will make cluster replica balanced and all tables replica balanced. Please check https://github.com/apache/incubator-pegasus/issues/761 for more details.
One time backup
Pegasus used to provide policy to manage table backup, users can create a backup policy with information such as start time, interval etc, then add tables into policy, those tables will trigger backup by policy start time, and execute backup through interval. It is too complex to trigger backup immediately, as a result, we provide one time backup, and we plan to remove the backup policy in further releases. Please check https://github.com/apache/incubator-pegasus/issues/755 for more details.
Optimizations and improvements
Backup request throttling
We found that sometimes the backup request may lead to read QPS increasing rapidly, so we add the rate limiter for it to delay or reject unexpected requests. Please check https://github.com/XiaoMi/rdsn/pull/855 for more details.
Drop timeout user request
In current implementation, Pegasus will put requests received by clients into a queue, this request will be executed until all requests before it are executed. In fact, each request has a timeout option, if it's inqueue-time exceeds its client timeout, this request is not necessary to be executed, because the client has already considered it as timeout. In this release, we support this option and add perf-counters for it. Please check https://github.com/apache/incubator-pegasus/issues/786 for more details. In our production cluster, we noticed that adding this option can decrease thread queue length, long-tail requests and the possibility of memory rapid growth.
Disk related protection
When replica server disk spaces are all used, any write requests will be failed, besides the server instance will generate coredump. To avoid such a situation, we provide an option to configure disk space threshold, if available disk space is below such value, all user write requests will be rejected. Besides, we provide broken disk checks while starting a replica server and add a new disk path dynamically. Please check https://github.com/apache/incubator-pegasus/issues/787 and https://github.com/apache/incubator-pegasus/issues/788 for more details.
Read enhancement
In the previous implementation, all read operations will be executed in one threadpool, we separate single-read and range-read into different thread pools to mitigate possible slow range read block all read requests. Besides, we also do enhancement about range read iteration count. Please check https://github.com/apache/incubator-pegasus/pull/782 and https://github.com/apache/incubator-pegasus/issues/808 for more details.
New perf-counters
In this release, we also added many new counters facilities to manage and monitor the system, such as table-level compaction counters, table-level RocksDB read/write amplification and hit rate counters, replica server session counters etc. Please reference https://github.com/apache/incubator-pegasus/issues/818 for all counters.
Please reference https://github.com/apache/incubator-pegasus/issues/818 for all enhancements.
Fixed issues
Duplication related bug fix
Duplication has been released in previous releases, it can work well in most normal cases, but in some corner cases especially for error handling cases, it still has some bugs. In this release, we fix some duplication related bugs to make this function more robust.
Graceful exit related bug fix
Release 2.2.0 has a known issue that shutting down a Pegasus instance who had already connected to HDFS to execute bulk load or backup is highly possible to core. In this release, we fix related bugs.
Asan bug fix
In this release, we fix many bugs detected by Asan, including potential memory leaks etc.
Please reference https://github.com/apache/incubator-pegasus/issues/818 for all fix lists.
We use YCSB to get the following performance test result. We deploy a test cluster with two meta servers, five replica servers and one collector server. The target table has 64 partitions.
Read: Write | Client* Thread | --- | QPS | Avg Latency(us) | P99 Latency(us) | P999 Latancy(us) |
0:1 | 3*15 | read | -- | -- | -- | -- |
write | 42386 | 1060 | 6628 | 20007 | ||
1:0 | 3*50 | read | 331623 | 585 | 2611 | 16413 |
write | -- | -- | -- | -- | ||
1:1 | 3*30 | read | 38766 | 1067 | 15521 | 72340 |
write | 38774 | 1246 | 7791 | 24815 | ||
1:3 | 3*15 | read | 13140 | 819 | 11460 | 58068 |
write | 39428 | 863 | 4884 | 13793 | ||
1:30 | 3*15 | read | 1552 | 937 | 9524 | 118889 |
write | 46570 | 930 | 5315 | 15057 | ||
3:1 | 3*30 | read | 93746 | 623 | 6389 | 29988 |
write | 31246 | 996 | 5543 | 17409 | ||
30:1 | 3*50 | read | 254534 | 560 | 2627 | 17213 |
write | 8481 | 901 | 3269 | 17968 |
Known issues
Carefully using drop timeout user requests. In our heavy-throughput cluster, we had added this option and found out that the replica server might generate coredump occasionally. The coredump had happened in previous versions, recorded in https://github.com/apache/incubator-pegasus/issues/387. Now there is no evidence showing the connection between the coredump and function, but we should notice it through our observation.
Upgrading Notes
2.3.0 can only be upgraded from 2.x. Servers whose version is 1.x should upgrade to 2.0.x firstly before upgrading to 2.3.0.
Apache Pegasus 2.3.0是一个功能版本,所有的改动和commit都被总结在https://github.com/apache/incubator-pegasus/issues/818。
Partition split
Pegasus表在创建之后分片个数就是固定的,然而随着表数据量增长,有时单分片可能非预期存储过大数据,单分片过大可能影响性能。Partition split能够扩展表的分片个数,使每个分片一分为二,并且较小影响读写。可以查看https://github.com/apache/incubator-pegasus/issues/754了解更多细节,或者通过设计文档https://pegasus.apache.org/2020/02/06/partition-split-design.html了解更多设计细节。
Pegasus支持表级TTL功能,这个功能是通过RocksDB的compaction filter实现的。为了方便用户更灵活得修改TTL,扩展compaction功能,我们开发了用户自定义compaction策略功能,用户可以配置compaction操作来执行不同的compaction策略,并通过compaction规则筛选出表中待执行的数据。可以查看https://github.com/apache/incubator-pegasus/issues/773了解更多细节。
当replica server上的replica个数不均衡时,meta server会触发负载均衡功能。目前Pegasus只提供表级负载均衡策略,即当一个集群中每张表在replica server上是均衡的,meta server就认为整个集群是均衡的。然而,在部分场景下,特别是集群replica server节点数较多,集群中存在大量小分片表时,即使每张表是均衡的,整个集群也不是均衡的。在这个版本中,我们添加了集群负载均衡功能,在保障不改变表均衡的情况下让整个集群的replica个数均衡。可以查看https://github.com/apache/incubator-pegasus/issues/761了解更多细节。
Backup request限流
我们发现backup request在部分场景下,可能会产生大量突增,影响性能,因此我们为backup request添加了限流功能,可以查看https://github.com/XiaoMi/rdsn/pull/855了解更多细节。
如果replica server的磁盘空间耗尽,不但用户写请求会失败,而且会造成进程coredump。为了避免出现这种情况,我们新增了磁盘空间阈值配置,如果replica server的可用磁盘空间低于这个值,用户写请求都会被拒绝。另外,我们还添加了启动时坏盘检查,动态添加新盘等功能。可以查看https://github.com/apache/incubator-pegasus/issues/787和https://github.com/apache/incubator-pegasus/issues/788了解更多细节。
在之前的实现中,所有读请求都在同一个线程池执行,在这个版本中,我们将单条读和多条读拆分到不同的线程池执行。同时,我们还针对多条读迭代个数不准确等问题进行优化和改善。可以查看https://github.com/apache/incubator-pegasus/pull/782 和https://github.com/apache/incubator-pegasus/issues/808 了解更多细节。
在这个版本中,我们添加多个counter方便管理和观察集群,例如表级compaction counter、表级RocksDB读写放大、命中率counter,replica server session个数counter等。可以通过https://github.com/apache/incubator-pegasus/issues/818查看所有新增Perf-counter。
热备份功能bug fix
优雅退出bug fix
Asan bug fix
完整的Bug fix列表可以参考:https://github.com/apache/incubator-pegasus/issues/818
我们部署的测试集群有2台meta server,5台replica server和1台collector,被测试的表有64个分片,使用YCSB进行性能测试,详细的测试结果如下:
读写比 | 客户端数* 线程数 | --- | QPS | 平均延迟 (us) | P99 延迟 (us) | P999 延迟 (us) |
0:1 | 3*15 | 读 | -- | -- | -- | -- |
写 | 42386 | 1060 | 6628 | 20007 | ||
1:0 | 3*50 | 读 | 331623 | 585 | 2611 | 16413 |
写 | -- | -- | -- | -- | ||
1:1 | 3*30 | 读 | 38766 | 1067 | 15521 | 72340 |
写 | 38774 | 1246 | 7791 | 24815 | ||
1:3 | 3*15 | 读 | 13140 | 819 | 11460 | 58068 |
写 | 39428 | 863 | 4884 | 13793 | ||
1:30 | 3*15 | 读 | 1552 | 937 | 9524 | 118889 |
写 | 46570 | 930 | 5315 | 15057 | ||
3:1 | 3*30 | 读 | 93746 | 623 | 6389 | 29988 |
写 | 31246 | 996 | 5543 | 17409 | ||
30:1 | 3*50 | 读 | 254534 | 560 | 2627 | 17213 |
写 | 8481 | 901 | 3269 | 17968 |