Introduction
Hadoop slave node will regularly emit some metrics information to reflect the service healthy, service team will look the metrics to understand if the service is in healthy state, and trace back to understand the history behavior. Some typical use cases are:
Pre-caution for un-healthy HBase RegionServer (the heap usage), RPC handling metrics and region aliveness, etc.
Troubleshooting through the metrics history dashboard
NameNode RPC traffic from client is very high, identify the source of the client, grep the user from audit log as well
User can flexibly set threshold for each monitored metric and get alert notification w/o re-writing or create policies from scratch
Notification about the HDFS clients which generating abnormal RPC traffic
Extract the DN/RS list with abnormal RPC processing time
High-Level Monitoring/Alert Flow
Metrics Collector
Operation team is always struggling with the metrics monitoring for HBase cluster, e.g. HBase RegionServer heap usage, the RPC handling metrics for RegionServer, and region aliveness in regionserver. So we need solutions to get all those metrics. One option is to deploy a standalone JMX client in each node; another is to add JMX sink in Hadoop’s metrics system.
A JMX client is expected to be deployed at each RegionServer slave node, or deployed at a collection of nodes to retrieve the JMX information. Since we have thousands of slave nodes, it’s not suppose to deploy those clients in a single server, which may lead to heavy load on that machine.
JMX sink should be developed according to Hadoop’s metrics interface and plug into the Hadoop runtime environment
We tend to write the data into Kafka as a “distributed caching layer” to decouple the JMX client and the back-end storage system, and to avoid the storing latency of JMX data as well.
Agent
If we tend to build the JMX client to collect JMX metrics, we’d better have an agent to monitor whether the JMX client is working well, otherwise there may be some JMX data loss if we have some JMX clients stop working.
If we use JMX sink to collect data, no agent is required for JMX client, the data collection lifecycle is the same as the daemon lifecycle.
Metrics Storage
A scalable backend for large scale metrics information storage, as well as the query engine for time-series data, along with the min/max/average aggregation semantics.
NameNode Metrics
Bean Category | Bean Name | Property | Description | Metric Name |
---|---|---|---|---|
Memory | java.lang:type=Memory | HeapMemoryUsage - used | hadoop.memory.heapmemoryusage.used | |
NonHealMemoryUsage - used | hadoop.memory.nonheapmemoryusage.used | |||
Name System State | Hadoop:service=NameNode,name=FSNamesystem | CapacityTotal | hadoop.namenode.fsnamesystemstate.capacitytotal | |
CapacityUsed | hadoop.namenode.dfs.capacityused | |||
CapacityRemaining | hadoop.namenode.dfs.capacityremaining | |||
BlocksTotal | hadoop.namenode.dfs.blockstotal | |||
FilesTotal | hadoop.namenode.dfs.filestotal | |||
UnderReplicatedBlocks | hadoop.namenode.dfs.underreplicatedblocks | |||
MissingBlocks | hadoop.namenode.dfs.missingblocks | |||
CorruptBlocks | hadoop.namenode.dfs.corruptblocks | |||
LastCheckpointTime | hadoop.namenode.dfs.lastcheckpointtime | |||
TransactionsSinceLastCheckpoint | hadoop.namenode.dfs.transactionssincelastcheckpoint | |||
LastWrittenTransactionId | hadoop.namenode.dfs.lastwrittentransactionid | |||
SnapshottableDirectories | hadoop.namenode.dfs.snapshottabledirectories | |||
Snapshots | hadoop.namenode.dfs.snapshots | |||
RPC | Hadoop:service=NameNode,name=RpcActivityForPort8020 | RpcQueueTimeAvgTime | hadoop.namenode.rpc.rpcqueuetimeavgtime | |
RpcProcessingTimeAvgTime | hadoop.namenode.rpc.rpcprocessingtimeavgtime | |||
NumOpenConnections | hadoop.namenode.rpc.numopenconnections | |||
CallQueueLength | hadoop.namenode.rpc.callqueuelength | |||
DataNode Metrics
Bean Category | Bean Name | Property | Description |
---|---|---|---|
Memory | java.lang:type=Memory | NonHeapMemoryUsage - used | hadoop.memory.nonheapmemoryusage.used |
HeapMemoryUsage = used | hadoop.memory.heapmemoryusage.used | ||
General | Hadoop:service=DataNode,name=FSDatasetState-bb8ac17a-d75b-4aab-9f9e-0ec1ef2d58f4 | Capacity | hadoop.datanode.fsdatasetstate.capacity |
DfsUsed | hadoop.datanode.fsdatasetstate.dfsused | ||
Hadoop:service=DataNode,name=DataNodeInfo | XceiverCount | hadoop.datanode.datanodeinfo.xceivercount | |
RPC | Hadoop:service=DataNode,name=RpcActivityForPort50020 | RpcQueueTimeAvgTime | hadoop.datanode.rpc.rpcqueuetimeavgtime |
RpcProcessingTimeAvgTime | hadoop.datanode.rpc.rpcprocessingtimeavgtime | ||
NumOpenConnections | hadoop.datanode.rpc.numopenconnections | ||
CallQueueLength | hadoop.datanode.rpc.callqueuelength |
HBase Master Metrics
Bean Category | Bean Name | Property | Description | Metric Name |
---|---|---|---|---|
Memory | java.lang:type=Memory | NonHeapMemoryUsage - used | hadoop.memory.nonheapmemoryusage.used | |
HeapMemoryUsage - used | hadoop.memory.heapmemoryusage.used | |||
General | Hadoop:service=HBase,name=Master,sub=Server | averageLoad | hadoop.hbase.master.server.averageload | |
Hadoop:service=HBase,name=Master,sub=AssignmentManger | ritCount | Counts the number of regions in transition | hadoop.hbase.master.assignmentmanger.ritcount | |
Hadoop:service=HBase,name=Master,sub=AssignmentManger | ritCountOverThreshold | Counts the number of regions in transition that exceed the threshold as defined by the property rit.metrics.threshold.time | hadoop.hbase.master.assignmentmanger.ritcountoverthreshold | |
Region Assignment | Hadoop:service=HBase,name=Master,sub=AssignmentManger | Assign_num_ops | hadoop.hbase.master.assignmentmanger.assign_num_ops | |
Assign_min | hadoop.hbase.master.assignmentmanger.assign_min | |||
Assign_max | hadoop.hbase.master.assignmentmanger.assign_max | |||
Assign_75th/95th/99th/99.9th_percentile | hadoop.hbase.master.assignmentmanger.assign_75th_percentile hadoop.hbase.master.assignmentmanger.assign_95th_percentile hadoop.hbase.master.assignmentmanger.assign_99th_percentile | |||
BulkAssign_num_ops | hadoop.hbase.master.assignmentmanger.bulkassign_num_ops | |||
BulkAssign_min | hadoop.hbase.master.assignmentmanger.bulkassign_min | |||
BulkAssign_max | hadoop.hbase.master.assignmentmanger.bulkassign_max | |||
BulkAssign_75th/95th/99th/99.9th_percentile | hadoop.hbase.master.assignmentmanger.bulkassign_75th_percentile hadoop.hbase.master.assignmentmanger.bulkassign_95th_percentile hadoop.hbase.master.assignmentmanger.bulkassign_99th_percentile | |||
Balancer | Hadoop:service=HBase,name=Master,sub=Balancer | BalancerCluster_num_ops | hadoop.hbase.master.balancer.balancercluster_num_ops | |
BalancerCluster_min | hadoop.hbase.master.balancer.balancercluster_min | |||
BalancerCluster_max | hadoop.hbase.master.balancer.balancercluster_max | |||
BalancerCluster_75th/95th/99th/99.9th_percentile | hadoop.hbase.master.balancer.balancercluster_75th_percentile hadoop.hbase.master.balancer.balancercluster_95th_percentile hadoop.hbase.master.balancer.balancercluster_99th_percentile | |||
Split | Hadoop:service=HBase,name=Master,sub=FileSystem | HlogSplitTime_min | hadoop.hbase.master.filesystem.hlogsplittime_min | |
HlogSplitTime_max | hadoop.hbase.master.filesystem.hlogsplittime_max | |||
HlogSplitTime_75th/95th/99th/99.9th_percentile | hadoop.hbase.master.filesystem.hlogsplittime_75th_percentile hadoop.hbase.master.filesystem.hlogsplittime_95th_percentile hadoop.hbase.master.filesystem.hlogsplittime_99th_percentile | |||
HlogSplitSize_min/max | hadoop.hbase.master.filesystem.hlogsplitsize_min hadoop.hbase.master.filesystem.hlogsplitsize_max | |||
MetaHlogSplitTime_min/max | hadoop.hbase.master.filesystem.metahlogsplittime_min hadoop.hbase.master.filesystem.metahlogsplittime_max | |||
MetaHlogSplitTime_75th/95th/99th/99.9th_percentile | hadoop.hbase.master.filesystem.metahlogsplittime_75th_percentile hadoop.hbase.master.filesystem.metahlogsplittime_95th_percentile hadoop.hbase.master.filesystem.metahlogsplittime_99th_percentile | |||
MetaHlogSplitSize_min/max | hadoop.hbase.master.filesystem.metahlogsplitsize_min hadoop.hbase.master.filesystem.metahlogsplitsize_max | |||
RegionServer Metrics
Bean Category | Bean Name | Property | Description | Metric Name |
Memory | java.lang:type=Memory | NonHeapMemoryUsage - used | hadoop.memory.nonheapmemoryusage.used | |
HeapMemoryUsage - used | hadoop.memory.heapmemoryusage.used | |||
Java Direct Memory | java.nio:type=BufferPool,name=direct | MemoryUsed | Java Direct Memory Used | hadoop.bufferpool.direct.memoryused |
JVM Metrics | Hadoop:service=HBase,name=JvmMetrics | GcCount | hadoop.hbase.jvm.gccount | |
GcTimeMillis | hadoop.hbase.jvm.gctimemillis | |||
IPC | Hadoop:service=HBase,name=RegionServer,sub=IPC | queueSize | hadoop.hbase.regionserver.ipc.queuesize | |
NumCallsInGeneralQueue | hadoop.hbase.regionserver.ipc.numcallsingeneralqueue | |||
NumActiveHandler | hadoop.hbase.regionserver.ipc.numactivehandler | |||
QueueCallTime_99th_percentile | IPC Queue Time (99th) | hadoop.hbase.regionserver.ipc.queuecalltime_99th_percentile | ||
ProcessCallTime_99th_percentile | IPC Process Time (99th) | hadoop.hbase.regionserver.ipc.processcalltime_99th_percentile | ||
QueueCallTime_num_ops | hadoop.hbase.regionserver.ipc.queuecalltime_num_ops | |||
ProcessCallTime_num_ops | hadoop.hbase.regionserver.ipc.processcalltime_num_ops | |||
Regions | Hadoop:service=HBase,name=RegionServer,sub=Server | regionCount | hadoop.hbase.regionserver.server.regioncount | |
storeCount | hadoop.hbase.regionserver.server.storecount | |||
memStoreSize | hadoop.hbase.regionserver.server.memstoresize | |||
storeFileSize | hadoop.hbase.regionserver.server.storefilesize | |||
totalRequestCount | hadoop.hbase.regionserver.server.totalrequestcount | |||
ReadRequestCount | hadoop.hbase.regionserver.server.readrequestcount | |||
WriteRequestCount | hadoop.hbase.regionserver.server.writerequestcount | |||
splitQueueLength | hadoop.hbase.regionserver.server.splitqueuelength | |||
compactionQueueLength | hadoop.hbase.regionserver.server.compactionqueuelength | |||
flushQueueLength | hadoop.hbase.regionserver.server.flushqueuelength | |||
blockCacheSize | hadoop.hbase.regionserver.server.blockcachesize | |||
blockCacheHitCount | hadoop.hbase.regionserver.server.blockcachehitcount | |||
blockCacheCountHitPercent | hadoop.hbase.regionserver.server.blockcachecounthitpercent |
Data Retention
Metrics should be collected at least 1 minute interval (Hadoop emits the metrics at 10 secs interval). Aggregate to 5 minute level for data older than 30 days and keep half year.
Monitoring Dashboard & Alerting
Metrics Dashboard Overview
Dashboard Chart
Generally, we will follow the UI layout in Ambari, within that, the service health check application will also be included in service status and summary information.
Metrics Query Pattern:
- Flexibly change the time range from 1 hour to