Apache Pegasus 2.3.0 is a feature release. The change-list is summarized here: https://github.com/apache/incubator-pegasus/issues/818.

New Features

Partition split

For the Pegasus table, its partition count is fixed while creating, during table total storage growth, sometimes a single partition may store too much data, leading to performance downgrade. Partition split supports the scalability for the Pegasus table, each original partition will be divided into two partitions. Please check https://github.com/apache/incubator-pegasus/issues/754 for more details and the design document can be found: https://pegasus.apache.org/2020/02/06/partition-split-design.html.

User-defined compaction strategy

Pegasus supports update value TTL in table-level, this is supported by RocksDB compaction filter. To make the function more flexible and expand its usage, we provide several compaction rules and compaction operations. Users filter the table data through rules and execute compaction strategy by operations. Please check https://github.com/apache/incubator-pegasus/issues/773 for more details.

Cluster load balance

Pegasus meta server will trigger load balance when replica counts of replica servers are not balanced. Pegasus now only provides a table-level balance strategy, meaning if all tables are balanced, the meta server will consider cluster balanced. In some cases, especially if a cluster has many replica servers and small-partition-count tables, the whole cluster replica is not balanced. Cluster load balance is a new load balance strategy,  it will make cluster replica balanced and all tables replica balanced. Please check https://github.com/apache/incubator-pegasus/issues/761 for more details.

One time backup

Pegasus used to provide policy to manage table backup, users can create a backup policy with information such as start time, interval etc, then add tables into policy, those tables will trigger backup by policy start time, and execute backup through interval. It is too complex to trigger backup immediately, as a result, we provide one time backup, and we plan to remove the backup policy in further releases. Please check https://github.com/apache/incubator-pegasus/issues/755 for more details.

Optimizations and improvements

Backup request throttling

We found that sometimes the backup request may lead to read QPS increasing rapidly, so we add the rate limiter for it to delay or reject unexpected requests. Please check https://github.com/XiaoMi/rdsn/pull/855 for more details.

Drop timeout user request

In current implementation, Pegasus will put requests received by clients into a queue, this request will be executed until all requests before it are executed. In fact, each request has a timeout option, if it's inqueue-time exceeds its client timeout, this request is not necessary to be executed, because the client has already considered it as timeout. In this release, we support this option and add perf-counters for it. Please check  https://github.com/apache/incubator-pegasus/issues/786 for more details. In our production cluster, we noticed that adding this option can decrease thread queue length, long-tail requests and the possibility of memory rapid growth.

Disk related protection

When replica server disk spaces are all used, any write requests will be failed, besides the server instance will generate coredump. To avoid such a situation, we provide an option to configure disk space threshold, if available disk space is below such value, all user write requests will be rejected. Besides, we provide broken disk checks while starting a replica server and add a new disk path dynamically. Please check https://github.com/apache/incubator-pegasus/issues/787 and https://github.com/apache/incubator-pegasus/issues/788 for more details.

Read enhancement

In the previous implementation, all read operations will be executed in one threadpool, we separate single-read and range-read into different thread pools to mitigate possible slow range read block all read requests. Besides, we also do enhancement about range read iteration count. Please check https://github.com/apache/incubator-pegasus/pull/782 and https://github.com/apache/incubator-pegasus/issues/808 for more details.

New perf-counters

In this release, we also added many new counters facilities to manage and monitor the system, such as table-level compaction counters, table-level RocksDB read/write amplification and hit rate counters, replica server session counters etc. Please reference  https://github.com/apache/incubator-pegasus/issues/818 for all counters.

Please reference https://github.com/apache/incubator-pegasus/issues/818 for all enhancements.

Fixed issues

Duplication related bug fix

Duplication has been released in previous releases, it can work well in most normal cases, but in some corner cases especially for error handling cases, it still has some bugs. In this release, we fix some duplication related bugs to make this function more robust. 

Graceful exit related bug fix

Release 2.2.0 has a known issue that shutting down a Pegasus instance who had already connected to HDFS to execute bulk load or backup is highly possible to core. In this release, we fix related bugs.

Asan bug fix

In this release, we fix many bugs detected by Asan, including potential memory leaks etc.

Please reference  https://github.com/apache/incubator-pegasus/issues/818 for all fix lists.

Performance

We use YCSB to get the following performance test result. We deploy a test cluster with two meta servers, five replica servers and one collector server. The target table has 64 partitions.

Read:

Write

Client*

Thread

---

QPS

Avg Latency(us)

P99 Latency(us)

P999 Latancy(us)

0:1

3*15

read

--

--

--

--

write

42386

1060

6628

20007

1:0

3*50

read

331623

585

2611

16413

write

--

--

--

--

1:1

3*30

read

38766

1067

15521

72340

write

38774

1246

7791

24815

1:3

3*15

read

13140

819

11460

58068

write

39428

863

4884

13793

1:30

3*15

read

1552

937

9524

118889

write

46570

930

5315

15057

3:1



3*30

read

93746

623

6389

29988

write

31246

996

5543

17409

30:1



3*50

read

254534

560

2627

17213

write

8481

901

3269

17968


Known issues

Carefully using drop timeout user requests. In our heavy-throughput cluster, we had added this option and found out that the replica server might generate coredump occasionally. The coredump had happened in previous versions, recorded in https://github.com/apache/incubator-pegasus/issues/387. Now there is no evidence showing the connection between the coredump and function, but we should notice it through our observation.

Upgrading Notes

2.3.0 can only be upgraded from 2.x. Servers whose version is 1.x should upgrade to 2.0.x firstly before upgrading to 2.3.0.

Apache Pegasus 2.3.0是一个功能版本,所有的改动和commit都被总结在https://github.com/apache/incubator-pegasus/issues/818

新功能

Partition split

Pegasus表在创建之后分片个数就是固定的,然而随着表数据量增长,有时单分片可能非预期存储过大数据,单分片过大可能影响性能。Partition split能够扩展表的分片个数,使每个分片一分为二,并且较小影响读写。可以查看https://github.com/apache/incubator-pegasus/issues/754了解更多细节,或者通过设计文档https://pegasus.apache.org/2020/02/06/partition-split-design.html了解更多设计细节。

用户自定义compaction策略

Pegasus支持表级TTL功能,这个功能是通过RocksDB的compaction filter实现的。为了方便用户更灵活得修改TTL,扩展compaction功能,我们开发了用户自定义compaction策略功能,用户可以配置compaction操作来执行不同的compaction策略,并通过compaction规则筛选出表中待执行的数据。可以查看https://github.com/apache/incubator-pegasus/issues/773了解更多细节。

集群负载均衡

当replica server上的replica个数不均衡时,meta server会触发负载均衡功能。目前Pegasus只提供表级负载均衡策略,即当一个集群中每张表在replica server上是均衡的,meta server就认为整个集群是均衡的。然而,在部分场景下,特别是集群replica server节点数较多,集群中存在大量小分片表时,即使每张表是均衡的,整个集群也不是均衡的。在这个版本中,我们添加了集群负载均衡功能,在保障不改变表均衡的情况下让整个集群的replica个数均衡。可以查看https://github.com/apache/incubator-pegasus/issues/761了解更多细节。

一次性冷备份

在之前的实现中,Pegasus使用备份策略来管理表的备份,用户可以创建一个备份策略,包含备份开始之间,备份周期等信息,之后再表添加到这个策略中,就能够按策略配置的时间触发备份并周期性进行。我们现在添加了单次备份功能,即立刻开始备份,并且我们计划后续去掉备份策略这个复杂的功能,可以查看https://github.com/apache/incubator-pegasus/issues/755了解更多细节。

优化与改进

Backup request限流

我们发现backup request在部分场景下,可能会产生大量突增,影响性能,因此我们为backup request添加了限流功能,可以查看https://github.com/XiaoMi/rdsn/pull/855了解更多细节。

丢弃排队超时的用户请求

服务端接收请求后,会先将请求加入队列,只有当队列前的请求被执行,该请求才会被调度执行。如果一个请求的排队时间已经超过了客户端的超时时间,这个请求在客户端已经超时了,服务端也无需再执行了。在这个版本中,我们支持这个配置并新增了对应的Perf-Counter。可以查看https://github.com/apache/incubator-pegasus/issues/786了解更多细节。在线上环境,我们发现添加这个配置能够减少线程队列长度、长尾请求,以及改善内存突增。

磁盘保护相关功能

如果replica server的磁盘空间耗尽,不但用户写请求会失败,而且会造成进程coredump。为了避免出现这种情况,我们新增了磁盘空间阈值配置,如果replica server的可用磁盘空间低于这个值,用户写请求都会被拒绝。另外,我们还添加了启动时坏盘检查,动态添加新盘等功能。可以查看https://github.com/apache/incubator-pegasus/issues/787https://github.com/apache/incubator-pegasus/issues/788了解更多细节。

读优化

在之前的实现中,所有读请求都在同一个线程池执行,在这个版本中,我们将单条读和多条读拆分到不同的线程池执行。同时,我们还针对多条读迭代个数不准确等问题进行优化和改善。可以查看https://github.com/apache/incubator-pegasus/pull/782https://github.com/apache/incubator-pegasus/issues/808 了解更多细节。

新Perf-counter

在这个版本中,我们添加多个counter方便管理和观察集群,例如表级compaction counter、表级RocksDB读写放大、命中率counter,replica server session个数counter等。可以通过https://github.com/apache/incubator-pegasus/issues/818查看所有新增Perf-counter。

更多优化与改进可以参见:https://github.com/apache/incubator-pegasus/issues/818

已修复的问题

热备份功能bug fix

热备份在之前版本就发布了,在大多数场景能够正常使用,但是在一些错误处理的情况下,还有一些bug。在这个版本中,我们修复了多个热备份相关bug让功能更健壮。

优雅退出bug fix

2.2.0版本有一个已知问题,在关闭曾经连接过HDFS的Pegasus进程时,很可能会产生coredump。在这个版本中,我们修复了这些优雅退出相关的bug。

Asan bug fix

在这个版本中,我们修复了多个Asan检查出来的bug,例如潜在的内存泄漏等。

完整的Bug fix列表可以参考:https://github.com/apache/incubator-pegasus/issues/818

性能测试结果

我们部署的测试集群有2台meta server,5台replica server和1台collector,被测试的表有64个分片,使用YCSB进行性能测试,详细的测试结果如下:

读写比

客户端数*

线程数

---

QPS

平均延迟

(us)

P99 延迟

(us)

P999 延迟

(us)

0:1

3*15

--

--

--

--

42386

1060

6628

20007

1:0

3*50

331623

585

2611

16413

--

--

--

--

1:1

3*30

38766

1067

15521

72340

38774

1246

7791

24815

1:3

3*15

13140

819

11460

58068

39428

863

4884

13793

1:30

3*15

1552

937

9524

118889

46570

930

5315

15057

3:1



3*30

93746

623

6389

29988

31246

996

5543

17409

30:1



3*50

254534

560

2627

17213

8481

901

3269

17968


已知问题

请谨慎配置丢弃排队超时的用户请求这一功能,我们在线上大吞吐环境下,配置丢弃超时的用户请求,会偶现https://github.com/apache/incubator-pegasus/issues/387中的coredump。这个bug在之前的版本中也极小概率出现过,目前尚未定位到这个coredump与新增配置有关,但根据现象特此说明。

升级提示

2.3.0只能从2.x版本升级,如果当前server版本是1.x,需要先升级到2.0.x,才能再升级到2.3.0.

  • No labels