Introduction

Hadoop slave node will regularly emit some metrics information to reflect the service healthy, service team will look the metrics to understand if the service is in healthy state, and trace back to understand the history behavior. Some typical use cases are:

  • Pre-caution for un-healthy HBase RegionServer (the heap usage), RPC handling metrics and region aliveness, etc.

  • Troubleshooting through the metrics history dashboard

  • NameNode RPC traffic from client is very high, identify the source of the client, grep the user from audit log as well

  • User can flexibly set threshold for each monitored metric and get alert notification w/o re-writing or create policies from scratch

  • Notification about the HDFS clients which generating abnormal RPC traffic

  • Extract the DN/RS list with abnormal RPC processing time

High-Level Monitoring/Alert Flow

 

Metrics Collector

Operation team is always struggling with the metrics monitoring for HBase cluster, e.g. HBase RegionServer heap usage, the RPC handling metrics for RegionServer, and region aliveness in regionserver. So we need solutions to get all those metrics. One option is to deploy a standalone JMX client in each node; another is to add JMX sink in Hadoop’s metrics system.

  • A JMX client is expected to be deployed at each RegionServer slave node, or deployed at a collection of nodes to retrieve the JMX information. Since we have thousands of slave nodes, it’s not suppose to deploy those clients in a single server, which may lead to heavy load on that machine.

  • JMX sink should be developed according to Hadoop’s metrics interface and plug into the Hadoop runtime environment

We tend to write the data into Kafka as a “distributed caching layer” to decouple the JMX client and the back-end storage system, and to avoid the storing latency of JMX data as well.

Agent

If we tend to build the JMX client to collect JMX metrics, we’d better have an agent to monitor whether the JMX client is working well, otherwise there may be some JMX data loss if we have some JMX clients stop working.

If we use JMX sink to collect data, no agent is required for JMX client, the data collection lifecycle is the same as the daemon lifecycle.

Metrics Storage

A scalable backend for large scale metrics information storage, as well as the query engine for time-series data, along with the min/max/average aggregation semantics.


NameNode Metrics

Bean CategoryBean NamePropertyDescriptionMetric Name
Memoryjava.lang:type=MemoryHeapMemoryUsage - used hadoop.memory.heapmemoryusage.used
  NonHealMemoryUsage - used hadoop.memory.nonheapmemoryusage.used
Name System StateHadoop:service=NameNode,name=FSNamesystemCapacityTotal hadoop.namenode.fsnamesystemstate.capacitytotal
  CapacityUsed 

hadoop.namenode.dfs.capacityused

  CapacityRemaining 

hadoop.namenode.dfs.capacityremaining

  BlocksTotal hadoop.namenode.dfs.blockstotal
  FilesTotal hadoop.namenode.dfs.filestotal
  UnderReplicatedBlocks hadoop.namenode.dfs.underreplicatedblocks
  MissingBlocks hadoop.namenode.dfs.missingblocks
  CorruptBlocks hadoop.namenode.dfs.corruptblocks
  LastCheckpointTime hadoop.namenode.dfs.lastcheckpointtime
  TransactionsSinceLastCheckpoint hadoop.namenode.dfs.transactionssincelastcheckpoint
  LastWrittenTransactionId hadoop.namenode.dfs.lastwrittentransactionid
  SnapshottableDirectories hadoop.namenode.dfs.snapshottabledirectories
  Snapshots hadoop.namenode.dfs.snapshots
RPCHadoop:service=NameNode,name=RpcActivityForPort8020RpcQueueTimeAvgTime hadoop.namenode.rpc.rpcqueuetimeavgtime
  RpcProcessingTimeAvgTime hadoop.namenode.rpc.rpcprocessingtimeavgtime
  NumOpenConnections hadoop.namenode.rpc.numopenconnections
  CallQueueLength hadoop.namenode.rpc.callqueuelength
     

 

DataNode Metrics

Bean CategoryBean NamePropertyDescription
Memoryjava.lang:type=MemoryNonHeapMemoryUsage - usedhadoop.memory.nonheapmemoryusage.used
  HeapMemoryUsage = usedhadoop.memory.heapmemoryusage.used
GeneralHadoop:service=DataNode,name=FSDatasetState-bb8ac17a-d75b-4aab-9f9e-0ec1ef2d58f4Capacityhadoop.datanode.fsdatasetstate.capacity
  DfsUsedhadoop.datanode.fsdatasetstate.dfsused
 Hadoop:service=DataNode,name=DataNodeInfoXceiverCounthadoop.datanode.datanodeinfo.xceivercount
RPCHadoop:service=DataNode,name=RpcActivityForPort50020RpcQueueTimeAvgTimehadoop.datanode.rpc.rpcqueuetimeavgtime
  RpcProcessingTimeAvgTimehadoop.datanode.rpc.rpcprocessingtimeavgtime
  NumOpenConnectionshadoop.datanode.rpc.numopenconnections
  CallQueueLengthhadoop.datanode.rpc.callqueuelength


HBase Master Metrics

Bean CategoryBean NamePropertyDescriptionMetric Name
Memoryjava.lang:type=MemoryNonHeapMemoryUsage - used hadoop.memory.nonheapmemoryusage.used
  HeapMemoryUsage - used hadoop.memory.heapmemoryusage.used
GeneralHadoop:service=HBase,name=Master,sub=ServeraverageLoad hadoop.hbase.master.server.averageload
 Hadoop:service=HBase,name=Master,sub=AssignmentMangerritCountCounts the number of regions in transitionhadoop.hbase.master.assignmentmanger.ritcount
 Hadoop:service=HBase,name=Master,sub=AssignmentMangerritCountOverThresholdCounts the number of regions in transition that exceed the threshold as defined by the property rit.metrics.threshold.timehadoop.hbase.master.assignmentmanger.ritcountoverthreshold
Region AssignmentHadoop:service=HBase,name=Master,sub=AssignmentMangerAssign_num_ops hadoop.hbase.master.assignmentmanger.assign_num_ops
  Assign_min hadoop.hbase.master.assignmentmanger.assign_min
  Assign_max hadoop.hbase.master.assignmentmanger.assign_max
  Assign_75th/95th/99th/99.9th_percentile 

hadoop.hbase.master.assignmentmanger.assign_75th_percentile

hadoop.hbase.master.assignmentmanger.assign_95th_percentile

hadoop.hbase.master.assignmentmanger.assign_99th_percentile

  BulkAssign_num_ops hadoop.hbase.master.assignmentmanger.bulkassign_num_ops
  BulkAssign_min hadoop.hbase.master.assignmentmanger.bulkassign_min
  BulkAssign_max hadoop.hbase.master.assignmentmanger.bulkassign_max
  BulkAssign_75th/95th/99th/99.9th_percentile 

hadoop.hbase.master.assignmentmanger.bulkassign_75th_percentile

hadoop.hbase.master.assignmentmanger.bulkassign_95th_percentile

hadoop.hbase.master.assignmentmanger.bulkassign_99th_percentile

BalancerHadoop:service=HBase,name=Master,sub=BalancerBalancerCluster_num_ops hadoop.hbase.master.balancer.balancercluster_num_ops
  BalancerCluster_min hadoop.hbase.master.balancer.balancercluster_min
  BalancerCluster_max hadoop.hbase.master.balancer.balancercluster_max
  BalancerCluster_75th/95th/99th/99.9th_percentile 

hadoop.hbase.master.balancer.balancercluster_75th_percentile

hadoop.hbase.master.balancer.balancercluster_95th_percentile

hadoop.hbase.master.balancer.balancercluster_99th_percentile

SplitHadoop:service=HBase,name=Master,sub=FileSystemHlogSplitTime_min hadoop.hbase.master.filesystem.hlogsplittime_min
  HlogSplitTime_max hadoop.hbase.master.filesystem.hlogsplittime_max
  HlogSplitTime_75th/95th/99th/99.9th_percentile 

hadoop.hbase.master.filesystem.hlogsplittime_75th_percentile

hadoop.hbase.master.filesystem.hlogsplittime_95th_percentile

hadoop.hbase.master.filesystem.hlogsplittime_99th_percentile

  HlogSplitSize_min/max 

hadoop.hbase.master.filesystem.hlogsplitsize_min

hadoop.hbase.master.filesystem.hlogsplitsize_max

  MetaHlogSplitTime_min/max 

hadoop.hbase.master.filesystem.metahlogsplittime_min

hadoop.hbase.master.filesystem.metahlogsplittime_max

  MetaHlogSplitTime_75th/95th/99th/99.9th_percentile 

hadoop.hbase.master.filesystem.metahlogsplittime_75th_percentile

hadoop.hbase.master.filesystem.metahlogsplittime_95th_percentile

hadoop.hbase.master.filesystem.metahlogsplittime_99th_percentile

  MetaHlogSplitSize_min/max 

hadoop.hbase.master.filesystem.metahlogsplitsize_min

hadoop.hbase.master.filesystem.metahlogsplitsize_max

     
     
     

 

RegionServer Metrics

 

 Bean Category

Bean Name

Property

 DescriptionMetric Name

Memory

java.lang:type=Memory

NonHeapMemoryUsage - used

 hadoop.memory.nonheapmemoryusage.used
  

HeapMemoryUsage - used

 hadoop.memory.heapmemoryusage.used

Java Direct Memory

java.nio:type=BufferPool,name=direct

MemoryUsed

Java Direct Memory Used

hadoop.bufferpool.direct.memoryused

JVM Metrics

Hadoop:service=HBase,name=JvmMetrics

GcCount

 hadoop.hbase.jvm.gccount
  

GcTimeMillis

 hadoop.hbase.jvm.gctimemillis

IPC

Hadoop:service=HBase,name=RegionServer,sub=IPC

queueSize

 hadoop.hbase.regionserver.ipc.queuesize
  

NumCallsInGeneralQueue

 hadoop.hbase.regionserver.ipc.numcallsingeneralqueue
  

NumActiveHandler

 hadoop.hbase.regionserver.ipc.numactivehandler
  

QueueCallTime_99th_percentile

IPC Queue Time (99th)

hadoop.hbase.regionserver.ipc.queuecalltime_99th_percentile
  

ProcessCallTime_99th_percentile

IPC Process Time (99th)

hadoop.hbase.regionserver.ipc.processcalltime_99th_percentile
  

QueueCallTime_num_ops

 hadoop.hbase.regionserver.ipc.queuecalltime_num_ops
  

ProcessCallTime_num_ops

 hadoop.hbase.regionserver.ipc.processcalltime_num_ops

Regions

Hadoop:service=HBase,name=RegionServer,sub=Server

regionCount

 hadoop.hbase.regionserver.server.regioncount
  

storeCount

 hadoop.hbase.regionserver.server.storecount
  

memStoreSize

 hadoop.hbase.regionserver.server.memstoresize
  

storeFileSize

 hadoop.hbase.regionserver.server.storefilesize
  

totalRequestCount

 hadoop.hbase.regionserver.server.totalrequestcount
  

ReadRequestCount

 hadoop.hbase.regionserver.server.readrequestcount
  

WriteRequestCount

 hadoop.hbase.regionserver.server.writerequestcount
  

splitQueueLength

 hadoop.hbase.regionserver.server.splitqueuelength
  

compactionQueueLength

 hadoop.hbase.regionserver.server.compactionqueuelength
  

flushQueueLength

 hadoop.hbase.regionserver.server.flushqueuelength
  

blockCacheSize

 hadoop.hbase.regionserver.server.blockcachesize
  

blockCacheHitCount

 hadoop.hbase.regionserver.server.blockcachehitcount
  

blockCacheCountHitPercent

 hadoop.hbase.regionserver.server.blockcachecounthitpercent

 

Data Retention

Metrics should be collected at least 1 minute interval (Hadoop emits the metrics at 10 secs interval). Aggregate to 5 minute level for data older than 30 days and keep half year.

Monitoring Dashboard & Alerting

Metrics Dashboard Overview

 

 

Dashboard Chart

Generally, we will follow the UI layout in Ambari, within that, the service health check application will also be included in service status and summary information.

Metrics Query Pattern:

  1. Flexibly change the time range from 1 hour to 
  • No labels