Apache Pegasus 2.4.0 Release Note

Apache Pegasus 2.4.0 is a feature list, the change-list is summarized here: https://github.com/apache/incubator-pegasus/issues/1032

New module

Apache Pegasus contains many ecological projects. In the past, they were maintained in different repository. Starting from this version, they will be gradually donated to the official Apache Pegasus repository. Currently, the following projects are donated to Apache Pegasus:

  • RDSN:  in the past, rdsn was linked to Apache Pegasus as a sub module of GIT. Today, it has officially become the core module of Apache Pegasus.
  • PegasusClient: Pegasus supports multiple clients. Now, the following client projects will be donated to Apache Pegasus: Pegasus-Java-Client, Pegasus-Scala-Client, Pegasus-Golang-Client, Pegasus-Python-Client, Pegasus-NodeJS-Client
  • PegasusDocker: Pegasus supports building with docker. In the current version, the official provides dockfile samples of various build environments, and uses githubaction to build corresponding docker images and upload them to DockerHub
  • PegasusShell: Pegasus has used C + + to build shell tools. In the latest version, we have built new shell tools using golang, including  AdminCli and for admin and Pegic  for user.

New architecture

In the test, we found that the shared-log engine with a single queue will cause a throughput bottleneck. Thanks to the optimization of random writes by concurrent writes of SSDs, we removed the shared-log written in sequence and only kept the private-log as the WAL. After the test, this will bring about a 15-20% improvement in the performance.

New feature

Support replica count update dynamically

In the past, once a table was created, its replica count could not be changed. The new version supports the function of dynamic change of table replica count. User can increase or decrease the count of a serving table, which is transparent to the foreground.

New batchGet interface

The old `batchGet` interface is only a simple encapsulation of the `get` interface. It does not have the `batch capability`. The new interface optimizes the batch operation. It will aggregate multiple requests according to the partiition-hash rules, and then send the aggregated requests in the unified partition to the corresponding nodes of the server atomically. This will improve the throughput of online writing

Client request limiter

Burst requests from the client will be piled up for the task queue. To avoid this situation, we added queue-controller to limit the task traffic in extreme scenarios.

In the past, Pegasus only controlled the write traffic. In the new version, we also supported the read traffic, which will enhance the stability of the cluster in emergencies.

Jemalloc memory management

Jemalloc is an excellent memory management library. In the past, we only used tcmalloc for memory management. In the new version, we also support jemalloc, the detail bench result see Jemalloc Performance

Multi architecture support

We have added support for MacOS and aarch64 systems, which will improve Pegasus' cross platform capability

Client Feature

The Java client adds a table creation and deletion interface, and supports `batchGetbypartition` to adapt the batchGet interface of the server

Go client adapts to RPC interfaces such as bulkload, compact and disk-add on the server side

AdminCli supports node-migrator, node-capacity-balance, table-migrator, table-partition-split command  and other functions.

Feature enhancement

Bulkload

The bulkload has added a lot of optimizations for performance, including using direct-io to perform data download, repair duplicate-check, optimize ingest-task strategy and other features to avoid the impact of IO-load on request latency during bulkload.

Bukload supports concurrent tasks of multiple tables meanwhile. However, it should be noted that due to the low-level speed limit, concurrency only allows multiple tables to queue to execute tasks, and does not improve the overall task efficiency unser same rate

Duplication

Duplication removes the dependence on remote file systems and supports checkpoint file transmission between clusters

Duplication supports batch-sending of log files to improve the synchronization efficiency of incremental data

The new duplication, when the user creates the task, the server will first copy the checkpoint files across the cluster, and then automatically synchronize the incremental logs, greatly simplifying the previous process

Other Important

PerfCounter: In the monitoring system, we optimized the CPU cache performance problems caused by false-share issue, and rebuilt the monitoring point system

ManualCompaction: We have added a control interface for ManualCompaction to the latest version so that users can easily trigger a ManualCompaction task and query the current progress in real time

NFS in Learn: NFS is a module for checkpoint transfer between nodes. In the past, the system has been affected by checkpoint transfer. In the new version, we have provided disk-level fine-grained rate control to reduce the impact of checkpoint transfer.

Link-tracking: The new link tracker supports data upload for monitoring systems to obtain link delay statistics

Environment Variables: We changed the `deny_write` environment, now it can also turn on read-deny at the same time and provide different response information to the client

Cold backup: backup speed will affect request latency, new version we provide dynamic configuration for HDFS upload speed during backup

RocksdB log size limit: sometimes rocksdb logs take up more space, which is limited by the new version

MetaServer: Supports Host domain name configuration

Bug fix

In the latest version, we focused on fix the following problems:

Server

  • Node crash caused by ASIO's thread safety problem
  • IO amplification caused by improper handling of RPC body
  • Data overflow caused by unreasonable type declaration in AIO module
  • Unexpected error when replica is closed

client

  • The batchMultiGet interface of the Java client cannot obtain data completely
  • The go client cannot access when the server enable the request-drop configuration
  • The go client cannot recovery when encounter the `ERR_INVALID_STATE` and so on

Performance

In this benchmark, we use the new machine, for the result is more reasonable, we re-run the Pegasus Server 2.3:

  • Machine parameters: DDR4 16G * 8 | Intel Silver4210*2 2.20Ghz/3.20Ghz | SSD 480G * 8 SATA
  • Cluster Server: 3 * MetaServerNode 5 * ReplicaServerNode
  • YCSB Client: 3 * ClientNode
  • Request Length: 1KB(set/get)

Pegasus Server 2.3

Case

client and thread

R:W

R-QPS

R-Avg

R-P99

W-QPS

W-Avg

W-P99

Write Only

3 clients * 15 threads

0:1

-

-

-

48805

919

2124

Read Only

3 clients * 50 threads

1:0

370068

402

988

-

-

-

Read Write

3 clients * 30 threads

1:1

50762

532

5859

50759

1233

4162

Read Write

3 clients * 15 threads

1:3

14471

443

3869

43425

884

1899

Read Write

3 clients * 15 threads

1:30

1583

473

3432

47551

928

2066

Read Write

3 clients * 30 threads

3:1

119093

406

3367

39693

1035

2581

Read Write

3 clients * 50 threads

30:1

322904

435

1034

10762

882

1392

Pegasus Server 2.4

Case

client and thread

R:W

R-QPS

R-Avg

R-P99

W-QPS

W-Avg

W-P99

Write Only

3 clients * 15 threads

0:1

-

-

-

56953

787

1786

Read Only

3 clients * 50 threads

1:0

360642

413

984

-

-

-

Read Write

3 clients * 30 threads

1:1

62572

464

5274

62561

985

3764

Read Write

3 clients * 15 threads

1:3

16844

372

3980

50527

762

1551

Read Write

3 clients * 15 threads

1:30

1861

381

3557

55816

790

1688

Read Write

3 clients * 30 threads

3:1

140484

351

3277

46822

856

2044

Read Write

3 clients * 50 threads

30:1

336106

419

1221

11203

763

1276


Known issues

We have upgraded the ZK client version to 3.7. When the ZK version of the server is smaller than this version, the connection may be timeout.

When configuring periodic manual-compaction tasks with environment variables, there may be a calculation error and cause immediate start.


  • No labels