Overview

Metrics are gauge, meter, counter, and histogram for monitoring Tajo components. Tajo internally maintain various metrics and provide them to external monitoring applications through various ways, such as Ganglia, file, log4j, and JMX.

Metric Name and Metric Hierarchy

Each metric name can be divided into three parts: group name, context name, and item name. Group and context names are category. Especially, a group name means the system component or the topmost logical category. Its hierarchy is as follows:

MASTER           (= TajoMaster metrics)
  |- CLUSTER     (Aggregated metrics about cluster stats and cluster resources)
  |- QUERY       (Aggregated metrics about submitted queries and scheduler)
 
NODE (= Node metrics)
  |- TASKS       (Metrics about TaskManager and task executions in each node)
  |- QUERYMASTER (Metrics about QueryMaster and its manager in each node)
 
${COMPONENT}-JVM (= Each component's JVM metrics in each node)
  |- MEMORY      (JVM Heap, Direct memory, ...)
  |- FILE        (File opened)
  |- GC          (GC)
  |- THREAD      (about thread)
  |- LOG         (logging events)

Metric List

MASTER.CLUSTER

Full Context Name	Item Name	Data Type	Unit	Description
MASTER.CLUSTER	UPTIME	long	Milliseconds	How long take the duration after Tajo cluster starts up
	TOTAL_NODES	int	Number of nodes	The total number of cluster nodes
	ACTIVE_NODES	int	Number of nodes	The active number of cluster nodes
	LOST_NODES	int	Number of nodes	The lost number of cluster nodes
	TOTAL_MEMORY	int	Mega bytes	Total resource memory of cluster nodes
	FREE_MEMORY	int	Mega bytes	Available resource memory of cluster nodes
	TOTAL_VCPU	int	Number of virtual CPU cores	Total virtual CPU cores of cluster nodes
	FREE_VCPU	int	Number of virtual CPU cores	Available virtual CPU cores of cluster nodes

MASTER.QUERY

Full Context Name	Item Name	Data Type	Unit	Description
MASTER.QUERY	SUBMITTED	int	Number of queries	How many queries are submitted
	COMPLETED	int	Number of queries	How many queries are completed
	RUNNING	int	Number of queries	How many queries are running
	ERROR	int	Number of queries	How many queries are canceled due to errors
	FAILED	int	Mega bytes	How many queries are failed after run
	KILLED	int	Number of queries	How many queries are killed by users
	MAX_IO_THROUGHPUT	int	Mega bytes	Maximum aggregated IO throughput per query in cluster
	AVG_IO_THROUGHPUT	int	Mega bytes	Average aggregated IO throughput per query in cluster

NODE.QUERYMASTER

Full Context name	Item Name	Data Type	Unit	Description	Example
NODE.QUERYMASTER	RUNNING_QM	int	Number of running query masters	How many query masters are running in the node

NODE.TASKS

Full Context name	Item Name	Data Type	Unit	Description	Example
NODE.TASKS	RUNNING_TASKS	int	Number of running tasks	How many tasks are running in the node

<COMPONENT>-JVM

All Tajo components like Master (TajoMaster) and Node (TajoWorker) have a number of JVM metrics. The metrics have a group name <component name>-JVM. For example, TajoMaster basically has MASTER-JVM group, and TajoWorker basically has NODE-JVM group. The contexts and items are all the same for all JVM metric groups.

Context name	Item Name	Data Type
GC	PS-MarkSweep.time	int
	PS-MarkSweep.count	int
	PS-Scavenge.time	int
	PS-Scavenge.count	int
MEMORY	pools.Code-Cache.usage
	pools.PS-Survivor-Space.usage
	pools.PS-Eden-Space.usage
	pools.PS-Perm-Gen.usage
	pools.PS-Old-Gen.usage
	heap.init
	heap.usage
	heap.used
	heap.committe
	heap.max
	non-heap.init
	non-heap.usage
	non-heap.used
	non-heap.committed
	non-heap.max
	total.init
	total.used
	total.committed
	total.max
LOG	Info
	Fatal
	Error
	Warning
THREAD	terminated.count
	timed_waiting.count
	count
	blocked.count
	deadlock.count
	new.count
	deadlocks
	runnable.count
	daemon.count
	waiting.count

Configuration

You should put tajo-metrics.properties in <tajo install dir>/conf. The property example is as follows:

reporter.ganglia=org.apache.tajo.util.metrics.reporter.GangliaReporter
reporter.file=org.apache.tajo.util.metrics.reporter.MetricsFileScheduledReporter

MASTER.reporters=ganglia,file
MASTER.ganglia.server=localhost
MASTER.ganglia.port=8649
MASTER.ganglia.period=10
MASTER.file.filename=/Users/hyunsik/master-metrics.log
MASTER.file.period=10

MASTER-JVM.reporters=ganglia,file
MASTER-JVM.ganglia.server=localhost
MASTER-JVM.ganglia.port=8650
MASTER-JVM.ganglia.period=60
MASTER-JVM.file.filename=/Users/hyunsik/master-jvm-metrics.log
MASTER-JVM.file.period=60

NODE.reporters=ganglia,file
NODE.ganglia.server=localhost
NODE.ganglia.port=8653
NODE.ganglia.period=10
NODE.file.filename=/Users/hyunsik/node-metrics.log
NODE.file.period=5

NODE-JVM.reporters=ganglia,file
NODE-JVM.ganglia.server=localhost
NODE-JVM.ganglia.port=8654
NODE-JVM.ganglia.period=60
NODE-JVM.file.filename=/Users/hyunsik/node-jvm-metrics.log
NODE-JVM.file.period=60

Child pages

System Metrics