Cluster events include event lists and event policies.
Type | Event Name | Description | Recommended Measure | Default Value | Severity | Disableable | Enabled by Default |
---|---|---|---|---|---|---|---|
Node | The CPU utilization exceeds the threshold continuously | The server CPU utilization has been greater than or equal to m for t (300 ≤ t ≤ 2,592,000) seconds continuously | Expand the node capacity or upgrade the node | m=85, t=1,800 | Severe | Yes | Yes |
The average CPU utilization exceeds the threshold | The average server CPU utilization has been greater than or equal to m for t (300 ≤ t ≤ 2,592,000) seconds continuously | Expand the node capacity or upgrade the node | m=85, t=1,800 | Moderate | Yes | No | |
The average CPU iowait utilization exceeds the threshold | The average CPU iowait utilization of the server in the last t (300 ≤ t ≤ 2,592,000) seconds has been greater than or equal to m | Troubleshoot | m=60, t=1,800 | Severe | Yes | Yes | |
The 1-second CPU load exceeds the threshold continuously | The 1-second CPU load has been greater than or equal to m for t (300 ≤ t ≤ 2,592,000) seconds continuously | Expand the node capacity or upgrade the node | m=8, t=1,800 | Moderate | Yes | No | |
The 5-second CPU load exceeds the threshold continuously | The 5-second CPU load has been greater than or equal to m for t (300 ≤ t ≤ 2,592,000) seconds continuously | Expand the node capacity or upgrade the node | m=8, t=1,800 | Severe | Yes | No | |
The memory utilization exceeds the threshold continuously | The memory utilization has been greater than or equal to m for t (300 ≤ t ≤ 2,592,000) seconds continuously | Expand the node capacity or upgrade the node | m=85, t=1,800 | Severe | Yes | Yes | |
The swap space exceeds the threshold continuously | The server swap memory has been greater than or equal to m for t (300 ≤ t ≤ 2,592,000) seconds continuously | Expand the node capacity or upgrade the node | m=0.1, t=300 | Moderate | Yes | No | |
The total number of system processes exceeds the threshold continuously | The total number of system processes has been greater than or equal to m for t (300 ≤ t ≤ 2,592,000) seconds continuously | Troubleshoot | m=10,000, t=1,800 | Severe | Yes | Yes | |
The average total number of fork subprocesses exceeds the threshold | The average total number of fork subprocesses in the last t (300 ≤ t ≤ 2,592,000) seconds has been greater than or equal to m | Troubleshoot | m=5,000, t=1,800 | Moderate | Yes | No | |
The process does not exist due to OOM | An OOM error occurred in the process | Adjust the process heap memory size | - | Severe | No | Yes | |
A disk I/O error occurred (this event is not supported currently) | A disk I/O error occurred | Replace the disk | - | Fatal | Yes | Yes | |
The disk space utilization exceeds the threshold continuously | The disk space utilization has been greater than or equal to m for t (300 ≤ t ≤ 2,592,000) seconds continuously | Expand the node capacity or upgrade the node | m=85, t=1,800 | Severe | Yes | Yes | |
The disk I/O device utilization exceeds the threshold continuously | The disk I/O device utilization has been greater than or equal to m for t (300 ≤ t ≤ 2,592,000) seconds continuously | Expand the node capacity or upgrade the node | m=85, t=1,800 | Severe | Yes | Yes | |
The node file handle utilization exceeds the threshold continuously | The node file handle utilization has been greater than or equal to m for t (300 ≤ t ≤ 2,592,000) seconds continuously | Troubleshoot | m=85, t=1,800 | Moderate | Yes | No | |
The number of TCP connections to the node exceeds the threshold continuously | The number of TCP connections to the node has been greater than or equal to m for t (300 ≤ t ≤ 2,592,000) seconds continuously | Check whether there are connection leaks | m=10,000, t=1,800 | Moderate | Yes | No | |
The node process is missing | The node service process is missing | View the service logs to find out why the service failed to be pulled | - | Moderate | No | Yes | |
The node heartbeat is missing | The node heartbeat failed to be reported regularly | Troubleshoot | - | Fatal | No | Yes | |
Incorrect hostname | The node's hostname is incorrect | Troubleshoot | - | Fatal | No | Yes | |
Failed to ping the metadatabase | The TencentDB instance heartbeat failed to be reported regularly | - | - | - | - | - | |
HDFS | The total number of HDFS files exceeds the threshold continuously | The total number of files in the cluster has been greater than or equal to m for t (300 ≤ t ≤ 2,592,000) seconds continuously | Increase the NameNode memory | m=50,000,000, t=1,800 | Severe | Yes | No |
The average total number of HDFS files exceeds the threshold | The average total number of files in the cluster in the last t (300 ≤ t ≤ 2,592,000) seconds has been greater than or equal to m | Increase the NameNode memory | m=50,000,000, t=1,800 | Severe | Yes | No | |
The total number of HDFS blocks exceeds the threshold continuously | The total number of blocks in the cluster has been greater than or equal to m for t (300 ≤ t ≤ 2,592,000) seconds continuously | Increase the NameNode memory or the block size | m=50,000,000, t=1,800 | Severe | Yes | No | |
The average total number of HDFS blocks exceeds the threshold | The average total number of HDFS blocks in the last t (300 ≤ t ≤ 2,592,000) seconds has been greater than or equal to m | Increase the NameNode memory or the block size | m=50,000,000, t=1,800 | Severe | Yes | No | |
The number of HDFS data nodes marked as dead exceeds the threshold continuously | The number of data nodes marked as dead has been greater than or equal to m for t (300 ≤ t ≤ 2,592,000) seconds continuously | Troubleshoot | m=1, t=1,800 | Moderate | Yes | No | |
The HDFS storage space utilization exceeds the threshold continuously | The HDFS storage space utilization has been greater than or equal to m for t (300 ≤ t ≤ 2,592,000) seconds continuously | Clear files in HDFS or expand the cluster capacity | m=85, t=1,800 | Severe | Yes | Yes | |
The average HDFS storage space utilization exceeds the threshold | The average HDFS storage space utilization has been greater than or equal to m for t (300 ≤ t ≤ 2,592,000) seconds continuously | Clear files in HDFS or expand the cluster capacity | m=85, t=1,800 | Severe | Yes | No | |
Active/Standby NameNodes were switched | Active/Standby NameNodes were switched | Locate the cause of NameNode switch | - | Severe | Yes | Yes | |
The NameNode RPC request processing latency exceeds the threshold continuously | The RPC request processing latency has been greater than or equal to m milliseconds for t (300 ≤ t ≤ 2,592,000) seconds continuously | Troubleshoot | m=300, t=300 | Severe | Yes | No | |
The number of current NameNode connections exceeds the threshold continuously | The number of current NameNode connections has been greater than or equal to m for t (300 ≤ t ≤ 2,592,000) seconds continuously | Troubleshoot | m=2,000, t=1,800 | Moderate | Yes | No | |
A full GC event occurred on a NameNode | A full GC event occurred on a NameNode | Fine-tune the parameter settings | - | Severe | Yes | Yes | |
The NameNode JVM memory utilization exceeds the threshold continuously | The NameNode JVM memory utilization has been greater than or equal to m for t (300 ≤ t ≤ 2,592,000) seconds continuously | Adjust the NameNode heap memory size | m=85, t=1,800 | Severe | Yes | Yes | |
The DataNode RPC request processing latency exceeds the threshold continuously | The RPC request processing latency has been greater than or equal to m milliseconds for t (300 ≤ t ≤ 2,592,000) seconds continuously | Troubleshoot | m=300, t=300 | Moderate | Yes | No | |
The number of current DataNode connections exceeds the threshold continuously | The number of current DataNode connections has been greater than or equal to m for t (300 ≤ t ≤ 2,592,000) seconds continuously | Troubleshoot | m=2,000, t=1,800 | Moderate | Yes | No | |
A full GC event occurred on a DataNode | A full GC event occurred on a DataNode | Fine-tune the parameter settings | - | Moderate | Yes | No | |
The DataNode JVM memory utilization exceeds the threshold continuously | The DataNode JVM memory utilization has been greater than or equal to m for t (300 ≤ t ≤ 2,592,000) seconds continuously | Adjust the DataNode heap memory size | m=85, t=1,800 | Moderate | Yes | Yes | |
YARN | The number of currently missing NodeManagers in the cluster exceeds the threshold continuously | The number of currently missing NodeManagers in the cluster has been greater than or equal to m for t (300 ≤ t ≤ 2,592,000) seconds continuously | Check the NodeManager process status and check whether the network connection is smooth | m=1, t=1,800 | Moderate | Yes | No |
The number of pending containers exceeds the threshold continuously | The number of pending containers has been greater than or equal to m for t (300 ≤ t ≤ 2,592,000) seconds continuously | Reasonably specify resources that can be used by YARN jobs | m=90, t=1,800 | Moderate | Yes | No | |
The cluster memory utilization exceeds the threshold continuously | The memory utilization has been greater than or equal to m for t (300 ≤ t ≤ 2,592,000) seconds continuously | Expand the cluster capacity | m=85, t=1,800 | Severe | Yes | Yes | |
The average cluster memory utilization exceeds the threshold | The average memory utilization in the last t (300 ≤ t ≤ 2,592,000) seconds has been greater than or equal to m | Expand the cluster capacity | m=85, t=1,800 | Severe | Yes | No | |
The cluster CPU utilization exceeds the threshold continuously | The CPU utilization has been greater than or equal to m for t (300 ≤ t ≤ 2,592,000) seconds continuously | Expand the cluster capacity | m=85, t=1,800 | Severe | Yes | Yes | |
The average cluster CPU utilization exceeds the threshold | The average CPU utilization in the last t (300 ≤ t ≤ 2,592,000) seconds has been greater than or equal to m | Expand the cluster capacity | m=85, t=1,800 | Severe | Yes | No | |
The number of available CPU cores in each queue is below the threshold continuously. | The number of available CPU cores in each queue has been less than or equal to m for t (300 ≤ t ≤ 2,592,000) seconds continuously | Assign more resources to the queue | m=1, t=1,800 | Moderate | Yes | No | |
The available memory in each queue is below the threshold continuously | The available memory in each queue has been less than or equal to m for t (300 ≤ t ≤ 2,592,000) seconds continuously | Assign more resources to the queue | m=1,024, t=1800 | Moderate | Yes | No | |
Active/Standby ResourceManagers were switched | Active/Standby ResourceManagers were switched | Check the ResourceManager process status and view the standby ResourceManager logs to locate the cause of active/standby switch | - | Severe | Yes | Yes | |
A full GC event occurred in a ResourceManager | A full GC event occurred in a ResourceManager | Fine-tune the parameter settings | - | Severe | Yes | Yes | |
The ResourceManager JVM memory utilization exceeds the threshold continuously | The ResourceManager JVM memory utilization has been greater than or equal to m for t (300 ≤ t ≤ 2,592,000) seconds continuously | Adjust the ResourceManager heap memory size | m=85, t=1,800 | Severe | Yes | Yes | |
A full GC event occurred in a NodeManager | A full GC event occurred in a NodeManager | Fine-tune the parameter settings | - | Moderate | Yes | No | |
The available memory in NodeManager is below the threshold continuously | The available memory in a single NodeManager has been less than or equal to m for t (300 ≤ t ≤ 2,592,000) seconds continuously | Adjust the NodeManager heap memory size | m=1, t=1,800 | Moderate | Yes | No | |
The NodeManager JVM memory utilization exceeds the threshold continuously | The NodeManager JVM memory utilization has been greater than or equal to m for t (300 ≤ t ≤ 2,592,000) seconds continuously | Adjust the NodeManager heap memory size | m=85, t=1,800 | Moderate | Yes | No | |
HBase | The number of regions in RIT status in the cluster exceeds the threshold continuously | The number of regions in RIT status in the cluster has been greater than or equal to m for t (300 ≤ t ≤ 2,592,000) seconds continuously | If the HBase version is below 2.0, run `hbase hbck -fixAssigment` | m=1, t=60 | Severe | Yes | Yes |
The number of dead RegionServers exceeds the threshold continuously | The number of dead RegionServers has been greater than or equal to m for t (300 ≤ t ≤ 2,592,000) seconds continuously | Troubleshoot | m=1, t=300 | Moderate | Yes | Yes | |
The average number of regions in each RegionServer in the cluster exceeds the threshold continuously | The average number of regions in each RegionServer in the cluster has been greater than or equal to m for t (300 ≤ t ≤ 2,592,000) seconds continuously | Expand the node capacity or upgrade the node | m=300, t=1,800 | Moderate | Yes | Yes | |
A full GC event occurred on HMaster | A full GC event occurred on HMaster | Fine-tune the parameter settings | m=5, t=300 | Moderate | Yes | Yes | |
The HMaster JVM memory utilization exceeds the threshold continuously | The HMaster JVM memory utilization has been greater than or equal to m for t (300 ≤ t ≤ 2,592,000) seconds continuously | Adjust the HMaster heap memory size | m=85, t=1,800 | Severe | Yes | Yes | |
The number of current HMaster connections exceeds the threshold continuously | The number of current HMaster connections has been greater than or equal to m for t (300 ≤ t ≤ 2,592,000) seconds continuously | Troubleshoot | m=1,000, t=1,800 | Moderate | Yes | No | |
A full GC event occurred in RegionServer | A full GC event occurred in RegionServer | Fine-tune the parameter settings | m=5, t=300 | Severe | Yes | No | |
The RegionServer JVM memory utilization exceeds the threshold continuously | The RegionServer JVM memory utilization has been greater than or equal to m for t (300 ≤ t ≤ 2,592,000) seconds continuously | Adjust the RegionServer heap memory size | m=85, t=1,800 | Moderate | Yes | No | |
The number of current RPC connections to RegionServer exceeds the threshold continuously | The number of current RPC connections to RegionServer has been greater than or equal to m for t (300 ≤ t ≤ 2,592,000) seconds continuously | Troubleshoot | m=1,000, t=1,800 | Moderate | Yes | No | |
The number of RegionServer StoreFiles exceeds the threshold continuously | The number of RegionServer StoreFiles has been greater than or equal to m for t (300 ≤ t ≤ 2,592,000) seconds continuously | Run the major compaction | m=50,000, t=1,800 | Moderate | Yes | No | |
A full GC event occurred in HBase Thrift | A full GC event occurred in HBase Thrift | Fine-tune the parameter settings | m=5, t=300 | Severe | Yes | No | |
The HBase Thrift JVM memory utilization exceeds the threshold continuously | The HBase Thrift JVM memory utilization has been greater than or equal to m for t (300 ≤ t ≤ 2,592,000) seconds continuously | Adjust the HBase Thrift heap memory size | m=85, t=1,800 | Moderate | Yes | No | |
Hive | A full GC event occurred in HiveServer2 | A full GC event occurred in HiveServer2 | Fine-tune the parameter settings | m=5, t=300 | Severe | Yes | Yes |
The HiveServer2 JVM memory utilization exceeds the threshold continuously | The HiveServer2 JVM memory utilization has been greater than or equal to m for t (300 ≤ t ≤ 2,592,000) seconds continuously | Adjust the HiveServer2 heap memory size | m=85, t=1,800 | Severe | Yes | Yes | |
A full GC event occurred in HiveMetaStore | A full GC event occurred in HiveMetaStore | Fine-tune the parameter settings | m=5, t=300 | Moderate | Yes | Yes | |
A full GC event occurred in HiveWebHcat | A full GC event occurred in HiveWebHcat | Fine-tune the parameter settings | m=5, t=300 | Moderate | Yes | Yes | |
ZooKeeper | The number of ZooKeeper connections exceeds the threshold continuously | The number of ZooKeeper connections has been greater than or equal to m for t (300 ≤ t ≤ 2,592,000) seconds continuously | Troubleshoot | m=65,535, t=1,800 | Moderate | Yes | No |
The number of ZNodes exceeds the threshold continuously | The number of ZNodes has been greater than or equal to m for t (300 ≤ t ≤ 2,592,000) seconds continuously | Troubleshoot | m=2,000, t=1,800 | Moderate | Yes | No |
Was this page helpful?