ID | IEP-35 |
Author | |
Sponsor | Nikolay Izhikov
|
Created | |
Status | IN PROGRESS - Phase 1,2 implemented. |
Motivation
For now, Ignite has not full, fragmented monitoring API. Those APIs use different protocols, such as - JMX, Java API, SQL System views, text logs, etc.
From the administrator point of view, it's impossible to understand what is going on in a running cluster:
- Which user tasks are executed?
- What resources are used by each task?
It's very hard to export all existing metrics to some arbitrary monitoring system due to a variety of protocols.
The goal of IEP is to provide a way to answer 3 questions:
- What is running inside the Ignite cluster?
- An administrator should be able to enlist each user object that was created(ran) inside a cluster via every monitoring interface we will support(JMX, SQL, CLI, etc)
- An administrator should be able to identify the source of each user object via some ID or other user-provided info.
- What is running slow?
- If some user code execution violates configured thresholds handler of such events should be executed. By default, the handler should print WARN log message with all available information about a slow piece of user provided code.
- What will be running slow?
- We should provide a way to execute cluster profiling. Consider the following scenario:
- Enable profiling mode.
- Executes some arbitrary workload.
- Collects profiling info.
- Run some Ignite-provided tool that will create the Report contains statistics of workload. Examples of such tool are:
- Oracle AWR
- PostgreSQL pgBadger
Description:

Phase1: What is running inside the Ignite cluster? + What is running slow?
1. We should add some entities in Ignite:
- MetricRegistry - Ignite subsystem that provides some set of sensors and lists.
- Cache,
- Compute,
- ServiceGrid,
- etc.
Metric - some named number with a well-defined algorithm to calculate the value in any given moments in time.
class Metric {
String name; //EntryCount, MemoryAvailable, etc
long value; //or double
Collection<Tuple2<String, String>> labels; //hostName, cacheName, etc.
}
class LongMetric extends Metric {
long ts; //timestamp of the last value update.
}
- SystemView - some named list that contains info about Ignite objects. Examples: List of caches, Transactions list, List of nodes, List of running queries, Las N queries, etc...
MonitoringEvent - generated when some user-defined code violates the threshold.
class MonitoringEvent {
MonitoringEventType type; //Event type.
T info; //Event info. Type of info differs for different type of events.
}
2. GridMetricManager, GridSystemViewManager:
- GridMetricManager - should be able to store and query Ignite metrics.
- GridSystemViewManager - should be able to store and export SystemViews.
3. Exporters:
Specific interfaces will be supported through exporters.
Exporters should work only with a read-only version of GridMetricManager and don't rely on other knowledge about Ignite internals.
Example of exporters:
- JMX
- HTTP
- SQL System View
- Log
- etc.
Lists of Ignite objects/entities that should be listed in Phase 2
- A list of compute tasks:
- Closures
- Map-reduce jobs
- ComputeJob
- Scheduled tasks
- Service grid:
- A list of services with deployment status
- Caches
- Cache groups
- Cluster nodes
- SQL objects
- Schemas
- Tables
- Views
- Tables columns
- Views columns
- Indexes
- Queries:
- SQL
- Scan
- Text
- ContinousQuery
- IgniteCache#invoke
- put, get, remove, replace, clear operations
- Transactions with lock list
- DataStreamers
- Explicit locks(IgniteCache#lock)
- DataStructures
- Queue
- Set
- AtomicLong
- AtomicReference
- CountDownLatch
- Sequence
- Semaphore
- Message topics (IgniteMessaging)
- Thin client connections.
- Machine Learning - ???
Internal Data Structures and Processes we should provide info for
- PME queue
- Service exchange queue
- Security events
Risks and Assumptions
Backward compatibility is in danger with these changes.
We should consider implementing this IEP as Ignite 3.
Discussion Links
http://apache-ignite-developers.2346864.n4.nabble.com/IEP-35-Monitoring-amp-Profiling-Proof-of-concept-td41904.html
http://apache-ignite-developers.2346864.n4.nabble.com/IEP-35-Monitoring-amp-Profiling-Current-API-Analysis-td41823.html
http://apache-ignite-developers.2346864.n4.nabble.com/DISCUSSION-IEP-35-Metrics-configuration-td42478.html
http://apache-ignite-developers.2346864.n4.nabble.com/IEP-35-GridJobProcessorMetrics-migration-td42415.html#a42441
http://apache-ignite-developers.2346864.n4.nabble.com/DISCUSSION-IEP-35-Replace-RunningQueryManager-with-GridSystemViewManager-td43794.html
Gap analysis
Current monitoring APIs availability:
Monitoring completely unavailable:
- Compute Grid
- Some basic number available in ClusterMetrics(getMaximumActiveJobs, getCurrentActiveJobs, etc...)
- Service Grid
- Data streamers
- Distributed Data Structures
- Ignite messaging (Ignite#message)
- 3-d party storage
- ContinuousQuery
- MVCC transactions
- ML - What should be available?
- Explicit locks
Monitoring API available:
- Cache
- PDS + offheap memory
- Ignite#dataRegionMetrics
- Ignite#dataStorageMetrics
- Ignite#persistentStoreMetrics
- Queries
- IgniteCache#queryMetrics
- IgniteCache#queryDetailMetrics
- QueryHistoryMetrics
- IgniteCache#mxBean
- IgniteCache#localMxBean
- SQL
- LOCAL_SQL_RUNNING_QUERIES
- INDEXES
- Transactions
- JMX - TransactionMetricsMxBean
- JMX - TransactionMXBean
- ThinClients
- JMX - ClientProcessorMXBean
- IoStaticsticsManager, IoStatisticsHolder
- GridJobMetricsProcessor
- IgniteMBeansManager
- IgniteSpiManagementMBean
Design Principles
- Sensors should contain only raw values. No aggregation of numeric metrics on Ignite side.
Min, max, avg and other functions are the matter of external monitoring system. - Every user task should have an ID or name provided by a user on start time that allows association between monitoring info and user code.
User should be able to find his code reflected in monitoring. - Every user task should have an ID or name of "connectionID"("sessionID", "clientID") or similar.
User should be able to know that a specific task was triggered by the specific connection(session, client). - No computation to get current values. We should change sensors and lists values when specific events occur.
When some sensor queries we should only get its value from internal storage. No additional computation involved. - User should be able to enable/disable any Sensor group/List at runtime. Ignite should provide some administrator interface(s) to enable/disable each Sensor Group or List separately.
No performance penalty for disabled sensors, lists.
Reference Links
https://docs.oracle.com/cd/E11882_01/server.112/e41573/autostat.htm#PFGRF027
https://www.oracle.com/technetwork/database/manageability/diag-pack-ow09-133950.pdf
https://github.com/darold/pgbadger
Tickets
Key
|
Summary
|
T
|
Created
|
Updated
|
Due
|
Assignee
|
Reporter
|
P
|
Status
|
Resolution
|