Motivation

For now, Ignite has not build-in profiling tool for user's operations and internal processes. Such a tool will be able to collect performance statistics and create a human-readable report. It will help to analyze workload and to tune configuration and applications.

Example of similar tools in other products: AWR [1] [2] [3] (Oracle) ; pgbadger [4], pgmetrics [5], powa [6] (PostgresSQL).

Description

We should provide a way to execute cluster profiling. Consider the following scenario:

Enable profiling mode.
Executes some arbitrary workload.
Collects profiling info.
Run the tool that will create the Report contains statistics of workload.

Performance report

The performance report will be in a human-readable format (html page) and should contain:

Ignite and plugins versions, topology changes, profiling start/end time
Queries (SQL, scan, ..) timings, resources:
- Queries that took up the most time
- Slowest queries
- Most frequent queries
- Failing queries
- Queries count by type
- Queries that took up the most CPU/IO/Disk
User tasks timings, resources
- Jobs of slowest tasks
Caches and cache operations statistics:
- Get/Put/Remove
- Transactions
- Invoke
- Lock
- create/destroy caches
Workload by nodes
- CPU/IO/Disk resources
Checkpoints statistics
WAL statistics
PME statistics

Additional investigation required to gather following statistics:

Query parse time
Lock waiting time
User time
Messages process timings

Proposed Changes

The Ignite will log some additional internal performance statistics to profiling files. The format is like WAL logging.

One disk-writer thread and off-heap memory buffer will be used to minimize affect on performance. Maximum file size and buffer size can be configured on start.

The new extension performance-statistics-ext module will be introduced. It will contain the tool to build the report: build-report.sh(bat). The JSON format is used to store aggregated statistics and next draw in the report.

The report is based on the bootstrap library and can be viewed in a browser offline.

Management

1) JMX:

PerformanceStatisticsMBean

void start() // Start collecting performance statistics in the cluster.
void stop() // Stop collecting performance statistics in the cluster.
boolean enabled() // True if collecting performance statistics enabled.

2) Control.sh utility. Functionality is like JMX.

3) System properties:

IGNITE_PERF_STAT_FILE_MAX_SIZE - Performance statistics maximum file size in bytes. Performance statistics will be stopped when the size exceeded.
IGNITE_PERF_STAT_BUFFER_SIZE - Performance statistics offheap buffer size in bytes.
IGNITE_PERF_STAT_FLUSH_SIZE - Performance statistics minimal batch size to flush in bytes.
IGNITE_PERF_STAT_CACHED_STRINGS_THRESHOLD - Performance statistics maximum cached strings threshold. String caching will stop on threshold excess.

Page tree

Motivation

Description

Performance report

Proposed Changes

Management

Risks and Assumptions

Discussion Links

Report example

Reference Links

Tickets

3 Comments

Alexey Goncharuk

Alexey Goncharuk

Nikita Amelchev

Page tree

Cluster performance profiling tool

Motivation

Description

Performance report

Proposed Changes

Management

Risks and Assumptions

Discussion Links

Report example

Reference Links

Tickets

3 Comments

Alexey Goncharuk

Alexey Goncharuk

Nikita Amelchev