Overview
Metrics are gauge, meter, counter, and histogram for monitoring Tajo components. Tajo internally maintain various metrics and provide them to external monitoring applications through various ways, such as Ganglia, file, log4j, and JMX.
Metric Name and Metric Hierarchy
Each metric name can be divided into three parts: group name, context name, and item name. Group and context names are category. Especially, a group name means the system component or the topmost logical category. Its hierarchy is as follows:
MASTER (= TajoMaster metrics) |- CLUSTER (Aggregated metrics about cluster stats and cluster resources) |- QUERY (Aggregated metrics about submitted queries and scheduler) NODE (= Node metrics) |- TASKS (Metrics about TaskManager and task executions in each node) |- QUERYMASTER (Metrics about QueryMaster and its manager in each node) ${COMPONENT}-JVM (= Each component's JVM metrics in each node) |- MEMORY (JVM Heap, Direct memory, ...) |- FILE (File opened) |- GC (GC) |- THREAD (about thread) |- LOG (logging events)
Metric List
MASTER.CLUSTER
Full Context Name | Item Name | Data Type | Unit | Description | Example |
---|---|---|---|---|---|
MASTER.CLUSTER | UPTIME | long | Milliseconds | How long take the duration after Tajo cluster starts up | |
TOTAL_NODES | int | Number of nodes | The total number of cluster nodes | ||
ACTIVE_NODES | int | Number of nodes | The active number of cluster nodes | ||
LOST_NODES | int | Number of nodes | The lost number of cluster nodes | ||
TOTAL_MEMORY | int | Mega bytes | Total resource memory of cluster nodes | ||
FREE_MEMORY | int | Mega bytes | Available resource memory of cluster nodes | ||
TOTAL_VCPU | int | Number of virtual CPU cores | Total virtual CPU cores of cluster nodes | ||
FREE_VCPU | int | Number of virtual CPU cores | Available virtual CPU cores of cluster nodes |
MASTER.QUERY
Full Context Name | Item Name | Data Type | Unit | Description | Example |
---|---|---|---|---|---|
MASTER.QUERY
| SUBMITTED | int | Number of queries | How many queries are submitted | |
COMPLETED | int | Number of queries | How many queries are completed | ||
RUNNING | int | Number of queries | How many queries are running | ||
ERROR | int | Number of queries | How many queries are canceled due to errors | ||
FAILED | int | Mega bytes | How many queries are failed after run | ||
KILLED | int | Number of queries | How many queries are killed by users | ||
MAX_IO_THROUGHPUT | int | Mega bytes | Maximum aggregated IO throughput per query in cluster | ||
AVG_IO_THROUGHPUT | int | Mega bytes | Average aggregated IO throughput per query in cluster |
NODE.QUERYMASTER
Full Context name | Item Name | Data Type | Unit | Description | Example |
---|---|---|---|---|---|
NODE.QUERYMASTER | RUNNING_QM | int | Number of running query masters | How many query masters are running in the node |
NODE.TASKS
Full Context name | Item Name | Data Type | Unit | Description | Example |
---|---|---|---|---|---|
NODE.TASKS | RUNNING_TASKS | int | Number of running tasks | How many tasks are running in the node |
<COMPONENT>-JVM
All Tajo components like Master (TajoMaster) and Node (TajoWorker) have a number of JVM metrics. The metrics have a group name <component name>-JVM. For example, TajoMaster basically has MASTER-JVM group, and TajoWorker basically has NODE-JVM group. The contexts and items are all the same for all JVM metric groups.
Context name | Item Name | Data Type | Description | Example |
---|---|---|---|---|
GC | PS-MarkSweep.time | int | ||
PS-MarkSweep.count | int | |||
PS-Scavenge.time | int | |||
PS-Scavenge.count | int | |||
MEMORY | pools.Code-Cache.usage | |||
pools.PS-Survivor-Space.usage | ||||
pools.PS-Eden-Space.usage | ||||
pools.PS-Perm-Gen.usage | ||||
pools.PS-Old-Gen.usage | ||||
heap.init | ||||
heap.usage | ||||
heap.used | ||||
heap.committe | ||||
heap.max | ||||
non-heap.init | ||||
non-heap.usage | ||||
non-heap.used | ||||
non-heap.committed | ||||
non-heap.max | ||||
total.init | ||||
total.used | ||||
total.committed | ||||
total.max | ||||
LOG | Info | |||
Fatal | ||||
Error | ||||
Warning | ||||
THREAD | terminated.count | |||
timed_waiting.count | ||||
count | ||||
blocked.count | ||||
deadlock.count | ||||
new.count | ||||
deadlocks | ||||
runnable.count | ||||
daemon.count | ||||
waiting.count |
Configuration
You should put tajo-metrics.properties in <tajo install dir>/conf. The property example is as follows:
reporter.ganglia=org.apache.tajo.util.metrics.reporter.GangliaReporter reporter.file=org.apache.tajo.util.metrics.reporter.MetricsFileScheduledReporter MASTER.reporters=ganglia,file MASTER.ganglia.server=localhost MASTER.ganglia.port=8649 MASTER.ganglia.period=10 MASTER.file.filename=/Users/hyunsik/master-metrics.log MASTER.file.period=10 MASTER-JVM.reporters=ganglia,file MASTER-JVM.ganglia.server=localhost MASTER-JVM.ganglia.port=8650 MASTER-JVM.ganglia.period=60 MASTER-JVM.file.filename=/Users/hyunsik/master-jvm-metrics.log MASTER-JVM.file.period=60 NODE.reporters=ganglia,file NODE.ganglia.server=localhost NODE.ganglia.port=8653 NODE.ganglia.period=10 NODE.file.filename=/Users/hyunsik/node-metrics.log NODE.file.period=5 NODE-JVM.reporters=ganglia,file NODE-JVM.ganglia.server=localhost NODE-JVM.ganglia.port=8654 NODE-JVM.ganglia.period=60 NODE-JVM.file.filename=/Users/hyunsik/node-jvm-metrics.log NODE-JVM.file.period=60