Introduction

Hadoop slave node will regularly emit some metrics information to reflect the service healthy, service team will look the metrics to understand if the service is in healthy state, and trace back to understand the history behavior. Some typical use cases are:

Pre-caution for un-healthy HBase RegionServer (the heap usage), RPC handling metrics and region aliveness, etc.
Troubleshooting through the metrics history dashboard
NameNode RPC traffic from client is very high, identify the source of the client, grep the user from audit log as well
User can flexibly set threshold for each monitored metric and get alert notification w/o re-writing or create policies from scratch
Notification about the HDFS clients which generating abnormal RPC traffic
Extract the DN/RS list with abnormal RPC processing time

High-Level Monitoring/Alert Flow

Metrics Collector

Operation team is always struggling with the metrics monitoring for HBase cluster, e.g. HBase RegionServer heap usage, the RPC handling metrics for RegionServer, and region aliveness in regionserver. So we need solutions to get all those metrics. One option is to deploy a standalone JMX client in each node; another is to add JMX sink in Hadoop’s metrics system.

A JMX client is expected to be deployed at each RegionServer slave node, or deployed at a collection of nodes to retrieve the JMX information. Since we have thousands of slave nodes, it’s not suppose to deploy those clients in a single server, which may lead to heavy load on that machine.

JMX sink should be developed according to Hadoop’s metrics interface and plug into the Hadoop runtime environment

We tend to write the data into Kafka as a “distributed caching layer” to decouple the JMX client and the back-end storage system, and to avoid the storing latency of JMX data as well.

Agent

If we tend to build the JMX client to collect JMX metrics, we’d better have an agent to monitor whether the JMX client is working well, otherwise there may be some JMX data loss if we have some JMX clients stop working.

If we use JMX sink to collect data, no agent is required for JMX client, the data collection lifecycle is the same as the daemon lifecycle.

Metrics Storage

A scalable backend for large scale metrics information storage, as well as the query engine for time-series data, along with the min/max/average aggregation semantics.

NameNode Metrics

Bean Category	Bean Name	Property	Metric Name
Memory	java.lang:type=Memory	HeapMemoryUsage - used	hadoop.memory.heapmemoryusage.used
		NonHealMemoryUsage - used	hadoop.memory.nonheapmemoryusage.used
Name System State	Hadoop:service=NameNode,name=FSNamesystem	CapacityTotal	hadoop.namenode.fsnamesystemstate.capacitytotal
		CapacityUsed	hadoop.namenode.dfs.capacityused
		CapacityRemaining	hadoop.namenode.dfs.capacityremaining
		BlocksTotal	hadoop.namenode.dfs.blockstotal
		FilesTotal	hadoop.namenode.dfs.filestotal
		UnderReplicatedBlocks	hadoop.namenode.dfs.underreplicatedblocks
		MissingBlocks	hadoop.namenode.dfs.missingblocks
		CorruptBlocks	hadoop.namenode.dfs.corruptblocks
		LastCheckpointTime	hadoop.namenode.dfs.lastcheckpointtime
		TransactionsSinceLastCheckpoint	hadoop.namenode.dfs.transactionssincelastcheckpoint
		LastWrittenTransactionId	hadoop.namenode.dfs.lastwrittentransactionid
		SnapshottableDirectories	hadoop.namenode.dfs.snapshottabledirectories
		Snapshots	hadoop.namenode.dfs.snapshots
RPC	Hadoop:service=NameNode,name=RpcActivityForPort8020	RpcQueueTimeAvgTime	hadoop.namenode.rpc.rpcqueuetimeavgtime
		RpcProcessingTimeAvgTime	hadoop.namenode.rpc.rpcprocessingtimeavgtime
		NumOpenConnections	hadoop.namenode.rpc.numopenconnections
		CallQueueLength	hadoop.namenode.rpc.callqueuelength

DataNode Metrics

Bean Category	Bean Name	Property	Description
Memory	java.lang:type=Memory	NonHeapMemoryUsage - used	hadoop.memory.nonheapmemoryusage.used
		HeapMemoryUsage = used	hadoop.memory.heapmemoryusage.used
General	Hadoop:service=DataNode,name=FSDatasetState-bb8ac17a-d75b-4aab-9f9e-0ec1ef2d58f4	Capacity	hadoop.datanode.fsdatasetstate.capacity
		DfsUsed	hadoop.datanode.fsdatasetstate.dfsused
	Hadoop:service=DataNode,name=DataNodeInfo	XceiverCount	hadoop.datanode.datanodeinfo.xceivercount
RPC	Hadoop:service=DataNode,name=RpcActivityForPort50020	RpcQueueTimeAvgTime	hadoop.datanode.rpc.rpcqueuetimeavgtime
		RpcProcessingTimeAvgTime	hadoop.datanode.rpc.rpcprocessingtimeavgtime
		NumOpenConnections	hadoop.datanode.rpc.numopenconnections
		CallQueueLength	hadoop.datanode.rpc.callqueuelength

HBase Master Metrics

Bean Category	Bean Name	Property	Description	Metric Name
Memory	java.lang:type=Memory	NonHeapMemoryUsage - used		hadoop.memory.nonheapmemoryusage.used
		HeapMemoryUsage - used		hadoop.memory.heapmemoryusage.used
General	Hadoop:service=HBase,name=Master,sub=Server	averageLoad		hadoop.hbase.master.server.averageload
	Hadoop:service=HBase,name=Master,sub=AssignmentManger	ritCount	Counts the number of regions in transition	hadoop.hbase.master.assignmentmanger.ritcount
	Hadoop:service=HBase,name=Master,sub=AssignmentManger	ritCountOverThreshold	Counts the number of regions in transition that exceed the threshold as defined by the property rit.metrics.threshold.time	hadoop.hbase.master.assignmentmanger.ritcountoverthreshold
Region Assignment	Hadoop:service=HBase,name=Master,sub=AssignmentManger	Assign_num_ops		hadoop.hbase.master.assignmentmanger.assign_num_ops
		Assign_min		hadoop.hbase.master.assignmentmanger.assign_min
		Assign_max		hadoop.hbase.master.assignmentmanger.assign_max
		Assign_75th/95th/99th/99.9th_percentile		hadoop.hbase.master.assignmentmanger.assign_75th_percentile hadoop.hbase.master.assignmentmanger.assign_95th_percentile hadoop.hbase.master.assignmentmanger.assign_99th_percentile
		BulkAssign_num_ops		hadoop.hbase.master.assignmentmanger.bulkassign_num_ops
		BulkAssign_min		hadoop.hbase.master.assignmentmanger.bulkassign_min
		BulkAssign_max		hadoop.hbase.master.assignmentmanger.bulkassign_max
		BulkAssign_75th/95th/99th/99.9th_percentile		hadoop.hbase.master.assignmentmanger.bulkassign_75th_percentile hadoop.hbase.master.assignmentmanger.bulkassign_95th_percentile hadoop.hbase.master.assignmentmanger.bulkassign_99th_percentile
Balancer	Hadoop:service=HBase,name=Master,sub=Balancer	BalancerCluster_num_ops		hadoop.hbase.master.balancer.balancercluster_num_ops
		BalancerCluster_min		hadoop.hbase.master.balancer.balancercluster_min
		BalancerCluster_max		hadoop.hbase.master.balancer.balancercluster_max
		BalancerCluster_75th/95th/99th/99.9th_percentile		hadoop.hbase.master.balancer.balancercluster_75th_percentile hadoop.hbase.master.balancer.balancercluster_95th_percentile hadoop.hbase.master.balancer.balancercluster_99th_percentile
Split	Hadoop:service=HBase,name=Master,sub=FileSystem	HlogSplitTime_min		hadoop.hbase.master.filesystem.hlogsplittime_min
		HlogSplitTime_max		hadoop.hbase.master.filesystem.hlogsplittime_max
		HlogSplitTime_75th/95th/99th/99.9th_percentile		hadoop.hbase.master.filesystem.hlogsplittime_75th_percentile hadoop.hbase.master.filesystem.hlogsplittime_95th_percentile hadoop.hbase.master.filesystem.hlogsplittime_99th_percentile
		HlogSplitSize_min/max		hadoop.hbase.master.filesystem.hlogsplitsize_min hadoop.hbase.master.filesystem.hlogsplitsize_max
		MetaHlogSplitTime_min/max		hadoop.hbase.master.filesystem.metahlogsplittime_min hadoop.hbase.master.filesystem.metahlogsplittime_max
		MetaHlogSplitTime_75th/95th/99th/99.9th_percentile		hadoop.hbase.master.filesystem.metahlogsplittime_75th_percentile hadoop.hbase.master.filesystem.metahlogsplittime_95th_percentile hadoop.hbase.master.filesystem.metahlogsplittime_99th_percentile
		MetaHlogSplitSize_min/max		hadoop.hbase.master.filesystem.metahlogsplitsize_min hadoop.hbase.master.filesystem.metahlogsplitsize_max

RegionServer Metrics

Bean Category	Bean Name	Property	Description	Metric Name
Memory	java.lang:type=Memory	NonHeapMemoryUsage - used		hadoop.memory.nonheapmemoryusage.used
		HeapMemoryUsage - used		hadoop.memory.heapmemoryusage.used
Java Direct Memory	java.nio:type=BufferPool,name=direct	MemoryUsed	Java Direct Memory Used	hadoop.bufferpool.direct.memoryused
JVM Metrics	Hadoop:service=HBase,name=JvmMetrics	GcCount		hadoop.hbase.jvm.gccount
		GcTimeMillis		hadoop.hbase.jvm.gctimemillis
IPC	Hadoop:service=HBase,name=RegionServer,sub=IPC	queueSize		hadoop.hbase.regionserver.ipc.queuesize
		NumCallsInGeneralQueue		hadoop.hbase.regionserver.ipc.numcallsingeneralqueue
		NumActiveHandler		hadoop.hbase.regionserver.ipc.numactivehandler
		QueueCallTime_99th_percentile	IPC Queue Time (99th)	hadoop.hbase.regionserver.ipc.queuecalltime_99th_percentile
		ProcessCallTime_99th_percentile	IPC Process Time (99th)	hadoop.hbase.regionserver.ipc.processcalltime_99th_percentile
		QueueCallTime_num_ops		hadoop.hbase.regionserver.ipc.queuecalltime_num_ops
		ProcessCallTime_num_ops		hadoop.hbase.regionserver.ipc.processcalltime_num_ops
Regions	Hadoop:service=HBase,name=RegionServer,sub=Server	regionCount		hadoop.hbase.regionserver.server.regioncount
		storeCount		hadoop.hbase.regionserver.server.storecount
		memStoreSize		hadoop.hbase.regionserver.server.memstoresize
		storeFileSize		hadoop.hbase.regionserver.server.storefilesize
		totalRequestCount		hadoop.hbase.regionserver.server.totalrequestcount
		ReadRequestCount		hadoop.hbase.regionserver.server.readrequestcount
		WriteRequestCount		hadoop.hbase.regionserver.server.writerequestcount
		splitQueueLength		hadoop.hbase.regionserver.server.splitqueuelength
		compactionQueueLength		hadoop.hbase.regionserver.server.compactionqueuelength
		flushQueueLength		hadoop.hbase.regionserver.server.flushqueuelength
		blockCacheSize		hadoop.hbase.regionserver.server.blockcachesize
		blockCacheHitCount		hadoop.hbase.regionserver.server.blockcachehitcount
		blockCacheCountHitPercent		hadoop.hbase.regionserver.server.blockcachecounthitpercent

Data Retention

Metrics should be collected at least 1 minute interval (Hadoop emits the metrics at 10 secs interval). Aggregate to 5 minute level for data older than 30 days and keep half year.

Monitoring Dashboard & Alerting

Metrics Dashboard Overview

Dashboard Chart

Generally, we will follow the UI layout in Ambari, within that, the service health check application will also be included in service status and summary information.

Metrics Query Pattern:

Flexibly change the time range from 1 hour to

Page tree

Hadoop JMX Monitoring and Alerting