Cluster events include event lists and event policies.
Type | Event | Description | Recommended Measure | Default Value | Severity | Disableable | Enabled by Default |
---|---|---|---|---|---|---|---|
Node | The CPU utilization exceeds the threshold continuously | The server CPU utilization has been greater than or equal to m for t (300 ≤ t ≤ 2,592,000) seconds continuously | Expand the node capacity or upgrade the node | m=85, t=1800 | Severe | Yes | Yes |
The average CPU utilization exceeds the threshold | The average server CPU utilization has been greater than or equal to m for t (300 ≤ t ≤ 2,592,000) seconds continuously | Expand the node capacity or upgrade the node | m=85, t=1800 | Moderate | Yes | No | |
The average CPU iowait utilization exceeds the threshold | The average CPU iowait utilization of the server in the last t (300 ≤ t ≤ 2,592,000) seconds has been greater than or equal to m | Troubleshoot | m=60, t=1800 | Severe | Yes | Yes | |
The 1-second CPU load exceeds the threshold continuously | The 1-minute CPU load has been greater than or equal to m for t (300 ≤ t ≤ 2,592,000) seconds continuously | Expand the node capacity or upgrade the node | m=8, t=1800 | Moderate | Yes | No | |
The 5-second CPU load exceeds the threshold continuously | The 5-minute CPU load has been greater than or equal to m for t (300 ≤ t ≤ 2,592,000) seconds continuously | Expand the node capacity or upgrade the node | m=8, t=1800 | Severe | Yes | No | |
The memory utilization exceeds the threshold continuously | The memory utilization has been greater than or equal to m for t (300 ≤ t ≤ 2,592,000) seconds continuously | Expand the node capacity or upgrade the node | m=85, t=1800 | Severe | Yes | Yes | |
The swap space exceeds the threshold continuously | The server swap memory has been greater than or equal to m for t (300 ≤ t ≤ 2,592,000) seconds continuously | Expand the node capacity or upgrade the node | m=0.1, t=300 | Moderate | Yes | No | |
The total number of system processes exceeds the threshold continuously | The total number of system processes has been greater than or equal to m for t (300 ≤ t ≤ 2,592,000) seconds continuously | Troubleshoot | m=10000, t=1800 | Severe | Yes | Yes | |
The average total number of fork subprocesses exceeds the threshold | The average total number of fork subprocesses in the last t (300 ≤ t ≤ 2,592,000) seconds has been greater than or equal to m | Troubleshoot | m=5000, t=1800 | Moderate | Yes | No | |
The process does not exist due to OOM | An OOM error occurred in the process | Adjust the process heap memory size | - | Severe | No | Yes | |
A disk I/O error occurred (this event is not supported currently) | A disk I/O error occurred | Replace the disk | - | Fatal | Yes | Yes | |
The disk space utilization exceeds the threshold continuously | The disk space utilization has been greater than or equal to m for t (300 ≤ t ≤ 2,592,000) seconds continuously | Expand the node capacity or upgrade the node | m=85, t=1800 | Severe | Yes | Yes | |
The disk I/O device utilization exceeds the threshold continuously | The disk I/O device utilization has been greater than or equal to m for t (300 ≤ t ≤ 2,592,000) seconds continuously | Expand the node capacity or upgrade the node | m=85, t=1800 | Severe | Yes | Yes | |
The node file handle utilization exceeds the threshold continuously | The node file handle utilization has been greater than or equal to m for t (300 ≤ t ≤ 2,592,000) seconds continuously | Troubleshoot | m=85, t=1800 | Moderate | Yes | No | |
The number of TCP connections to the node exceeds the threshold continuously | The number of TCP connections to the node has been greater than or equal to m for t (300 ≤ t ≤ 2,592,000) seconds continuously | Check whether there are connection leaks | m=10000, t=1800 | Moderate | Yes | No | |
The configured node memory utilization exceeds the threshold | The memory utilization configured for all roles on the node exceeds the node's physical memory threshold | Adjust the allocated node process heap memory | 90% | Severe | Yes | No | |
The node process is unavailable | The node service process is unavailable | View the service logs to find out why the service failed to be pulled | - | Moderate | Yes | Yes | |
The node heartbeat is missing | The node heartbeat failed to be reported regularly | Troubleshoot | - | Fatal | No | Yes | |
Incorrect hostname | The node's hostname is incorrect | Troubleshoot | - | Fatal | No | Yes | |
Failed to ping the metadatabase | The TencentDB instance heartbeat failed to be reported regularly | - | - | - | - | - | |
HDFS | The total number of HDFS files exceeds the threshold continuously | The total number of files in the cluster has been greater than or equal to m for t (300 ≤ t ≤ 2,592,000) seconds continuously | Increase the NameNode memory | m=50,000,000, t=1800 | Severe | Yes | No |
The average total number of HDFS files exceeds the threshold | The average total number of files in the cluster in the last t (300 ≤ t ≤ 2,592,000) seconds has been greater than or equal to m | Increase the NameNode memory | m=50,000,000, t=1800 | Severe | Yes | No | |
The total number of HDFS blocks exceeds the threshold continuously | The total number of blocks in the cluster has been greater than or equal to m for t (300 ≤ t ≤ 2,592,000) seconds continuously | Increase the NameNode memory or the block size | m=50,000,000, t=1800 | Severe | Yes | No | |
The average total number of HDFS blocks exceeds the threshold | The average total number of HDFS blocks in the last t (300 ≤ t ≤ 2,592,000) seconds has been greater than or equal to m | Increase the NameNode memory or the block size | m=50,000,000, t=1800 | Severe | Yes | No | |
The number of HDFS data nodes marked as dead exceeds the threshold continuously | The number of data nodes marked as dead has been greater than or equal to m for t (300 ≤ t ≤ 2,592,000) seconds continuously | Troubleshoot | m=1, t=1800 | Moderate | Yes | No | |
The HDFS storage space utilization exceeds the threshold continuously | The HDFS storage space utilization has been greater than or equal to m for t (300 ≤ t ≤ 2,592,000) seconds continuously | Clear files in HDFS or expand the cluster capacity | m=85, t=1800 | Severe | Yes | Yes | |
The average HDFS storage space utilization exceeds the threshold | The average HDFS storage space utilization has been greater than or equal to m for t (300 ≤ t ≤ 2,592,000) seconds continuously | Clear files in HDFS or expand the cluster capacity | m=85, t=1800 | Severe | Yes | No | |
Active/Standby NameNodes were switched | Active/Standby NameNodes were switched | Locate the cause of NameNode switch | - | Severe | Yes | Yes | |
The NameNode RPC request processing latency exceeds the threshold continuously | The RPC request processing latency has been greater than or equal to m milliseconds for t (300 ≤ t ≤ 2,592,000) seconds continuously | Troubleshoot | m=300, t=300 | Severe | Yes | No | |
The number of current NameNode connections exceeds the threshold continuously | The number of current NameNode connections has been greater than or equal to m for t (300 ≤ t ≤ 2,592,000) seconds continuously | Troubleshoot | m=2000, t=1800 | Moderate | Yes | No | |
A full GC event occurred on a NameNode | A full GC event occurred on a NameNode | Fine-tune the parameter settings | - | Severe | Yes | Yes | |
The NameNode JVM memory utilization exceeds the threshold continuously | The NameNode JVM memory utilization has been greater than or equal to m for t (300 ≤ t ≤ 2,592,000) seconds continuously | Adjust the NameNode heap memory size | m=85, t=1800 | Severe | Yes | Yes | |
The DataNode RPC request processing latency exceeds the threshold continuously | The RPC request processing latency has been greater than or equal to m milliseconds for t (300 ≤ t ≤ 2,592,000) seconds continuously | Troubleshoot | m=300, t=300 | Moderate | Yes | No | |
The number of current DataNode connections exceeds the threshold continuously | The number of current DataNode connections has been greater than or equal to m for t (300 ≤ t ≤ 2,592,000) seconds continuously | Troubleshoot | m=2000, t=1800 | Moderate | Yes | No | |
A full GC event occurred on a DataNode | A full GC event occurred on a NameNode | Fine-tune the parameter settings | - | Moderate | Yes | No | |
The DataNode JVM memory utilization exceeds the threshold continuously | The NameNode JVM memory utilization has been greater than or equal to m for t (300 ≤ t ≤ 2,592,000) seconds continuously | Adjust the DataNode heap memory size | m=85, t=1800 | Moderate | Yes | Yes | |
YARN | The number of currently missing NodeManagers in the cluster exceeds the threshold continuously | The number of currently missing NodeManagers in the cluster has been greater than or equal to m for t (300 ≤ t ≤ 2,592,000) seconds continuously | Check the NodeManager process status and check whether the network connection is smooth | m=1, t=1800 | Moderate | Yes | No |
The number of pending containers exceeds the threshold continuously | The number of pending containers has been greater than or equal to m for t (300 ≤ t ≤ 2,592,000) seconds continuously | Reasonably specify resources that can be used by YARN jobs | m=90, t=1800 | Moderate | Yes | No | |
The cluster memory utilization exceeds the threshold continuously | The memory utilization has been greater than or equal to m for t (300 ≤ t ≤ 2,592,000) seconds continuously | Expand the cluster capacity | m=85, t=1800 | Severe | Yes | Yes | |
The average cluster memory utilization exceeds the threshold | The average memory utilization in the last t (300 ≤ t ≤ 2,592,000) seconds has been greater than or equal to m | Expand the cluster capacity | m=85, t=1800 | Severe | Yes | No | |
The cluster CPU utilization exceeds the threshold continuously | The CPU utilization has been greater than or equal to m for t (300 ≤ t ≤ 2,592,000) seconds continuously | Expand the cluster capacity | m=85, t=1800 | Severe | Yes | Yes | |
The average cluster CPU utilization exceeds the threshold | The average CPU utilization in the last t (300 ≤ t ≤ 2,592,000) seconds has been greater than or equal to m | Expand the cluster capacity | m=85, t=1800 | Severe | Yes | No | |
The number of available CPU cores in each queue is below the threshold continuously. | The number of available CPU cores in each queue has been less than or equal to m for t (300 ≤ t ≤ 2,592,000) seconds continuously | Assign more resources to the queue | m=1, t=1800 | Moderate | Yes | No | |
The available memory in each queue is below the threshold continuously | The available memory in each queue has been less than or equal to m for t (300 ≤ t ≤ 2,592,000) seconds continuously | Assign more resources to the queue | m=1024, t=1800 | Moderate | Yes | No | |
Active/Standby ResourceManagers were switched | Active/Standby ResourceManagers were switched | Check the ResourceManager process status and view the standby ResourceManager logs to locate the cause of active/standby switch | - | Severe | Yes | Yes | |
A full GC event occurred in a ResourceManager | A full GC event occurred in a ResourceManager | Fine-tune the parameter settings | - | Severe | Yes | Yes | |
The ResourceManager JVM memory utilization exceeds the threshold continuously | The ResourceManager JVM memory utilization has been greater than or equal to m for t (300 ≤ t ≤ 2,592,000) seconds continuously | Adjust the ResourceManager heap memory size | m=85, t=1800 | Severe | Yes | Yes | |
A full GC event occurred in a NodeManager | A full GC event occurred in a NodeManager | Fine-tune the parameter settings | - | Moderate | Yes | No | |
The available memory in NodeManager is below the threshold continuously | The available memory in a single NodeManager has been less than or equal to m for t (300 ≤ t ≤ 2,592,000) seconds continuously | Adjust the NodeManager heap memory size | m=1, t=1800 | Moderate | Yes | No | |
The NodeManager JVM memory utilization exceeds the threshold continuously | The NodeManager JVM memory utilization has been greater than or equal to m for t (300 ≤ t ≤ 2,592,000) seconds continuously | Adjust the NodeManager heap memory size | m=85, t=1800 | Moderate | Yes | No | |
HBase | The number of regions in RIT status in the cluster exceeds the threshold continuously | The number of regions in RIT status in the cluster has been greater than or equal to m for t (300 ≤ t ≤ 2,592,000) seconds continuously | If the HBase version is below 2.0, run `hbase hbck -fixAssigment` | m=1, t=60 | Severe | Yes | Yes |
The number of dead RegionServers exceeds the threshold continuously | The number of dead RegionServers has been greater than or equal to m for t (300 ≤ t ≤ 2,592,000) seconds continuously | Troubleshoot | m=1, t=300 | Moderate | Yes | Yes | |
The average number of regions in each RegionServer in the cluster exceeds the threshold continuously | The average number of regions in each RegionServer in the cluster has been greater than or equal to m for t (300 ≤ t ≤ 2,592,000) seconds continuously | Expand the node capacity or upgrade the node | m=300, t=1800 | Moderate | Yes | Yes | |
A full GC event occurred on HMaster | A full GC event occurred on HMaster | Fine-tune the parameter settings | m=5, t=300 | Moderate | Yes | Yes | |
The HMaster JVM memory utilization exceeds the threshold continuously | The HMaster JVM memory utilization has been greater than or equal to m for t (300 ≤ t ≤ 2,592,000) seconds continuously | Adjust the HMaster heap memory size | m=85, t=1800 | Severe | Yes | Yes | |
The number of current HMaster connections exceeds the threshold continuously | The number of current HMaster connections has been greater than or equal to m for t (300 ≤ t ≤ 2,592,000) seconds continuously | Troubleshoot | m=1000, t=1800 | Moderate | Yes | No | |
A full GC event occurred in RegionServer | A full GC event occurred in RegionServer | Fine-tune the parameter settings | m=5, t=300 | Severe | Yes | No | |
The RegionServer JVM memory utilization exceeds the threshold continuously | The RegionServer JVM memory utilization has been greater than or equal to m for t (300 ≤ t ≤ 2,592,000) seconds continuously | Adjust the RegionServer heap memory size | m=85, t=1800 | Moderate | Yes | No | |
The number of current RPC connections to RegionServer exceeds the threshold continuously | The number of current RPC connections to RegionServer has been greater than or equal to m for t (300 ≤ t ≤ 2,592,000) seconds continuously | Troubleshoot | m=1000, t=1800 | Moderate | Yes | No | |
The number of RegionServer StoreFiles exceeds the threshold continuously | The number of RegionServer StoreFiles has been greater than or equal to m for t (300 ≤ t ≤ 2,592,000) seconds continuously | Run the major compaction | m=50000, t=1800 | Moderate | Yes | No | |
A full GC event occurred in HBase Thrift | A full GC event occurred in HBase Thrift | Fine-tune the parameter settings | m=5, t=300 | Severe | Yes | No | |
The HBase Thrift JVM memory utilization exceeds the threshold continuously | The HBase Thrift JVM memory utilization has been greater than or equal to m for t (300 ≤ t ≤ 2,592,000) seconds continuously | Adjust the HBase Thrift heap memory size | m=85, t=1800 | Moderate | Yes | No | |
Hive | A full GC event occurred in HiveServer2 | A full GC event occurred in HiveServer2 | Fine-tune the parameter settings | m=5, t=300 | Severe | Yes | Yes |
The HiveServer2 JVM memory utilization exceeds the threshold continuously | The HiveServer2 JVM memory utilization has been greater than or equal to m for t (300 ≤ t ≤ 2,592,000) seconds continuously | Adjust the HiveServer2 heap memory size | m=85, t=1800 | Severe | Yes | Yes | |
A full GC event occurred in HiveMetaStore | A full GC event occurred in HiveMetaStore | Fine-tune the parameter settings | m=5, t=300 | Moderate | Yes | Yes | |
A full GC event occurred in HiveWebHcat | A full GC event occurred in HiveWebHcat | Fine-tune the parameter settings | m=5, t=300 | Moderate | Yes | Yes | |
ZooKeeper | The number of ZooKeeper connections exceeds the threshold continuously | The number of ZooKeeper connections has been greater than or equal to m for t (300 ≤ t ≤ 2,592,000) seconds continuously | Troubleshoot | m=65535, t=1800 | Moderate | Yes | No |
The number of ZNodes exceeds the threshold continuously | The number of ZNodes has been greater than or equal to m for t (300 ≤ t ≤ 2,592,000) seconds continuously | Troubleshoot | m=2000, t=1800 | Moderate | Yes | No | |
Impala | The ImpalaCatalog JVM memory utilization exceeds the threshold continuously | The ImpalaCatalog JVM memory utilization has been greater than or equal to m for t (300 ≤ t ≤ 604,800) seconds continuously | Adjust the ImpalaCatalog heap memory size | m=0.85, t=1800 | Moderate | Yes | No |
The Impala daemon JVM memory utilization exceeds the threshold continuously | The Impala daemon JVM memory utilization has been greater than or equal to m for t (300 ≤ t ≤ 604,800) seconds continuously | Adjust the ImpalaDaemon heap memory size | m=0.85, t=1800 | Moderate | Yes | No | |
The number of Impala Beeswax API client connections exceeds the threshold | The number of Impala Beeswax API client connections has been greater than or equal to m | Adjust the value of `fs_sevice_threads` in the `impalad.flgs` configuration in the console | m=64,t=120 | Severe | Yes | Yes | |
The number of Impala HiveServer2 client connections exceeds the threshold | The number of Impala HiveServer2 client connections has been greater than or equal to m | Adjust the value of `fs_sevice_threads` in the `impalad.flgs` configuration in the console | m=64,t=120 | Severe | Yes | Yes | |
PrestoSQL | The current number of failed PrestoSQL nodes exceeds the threshold continuously | The current number of failed PrestoSQL nodes has been greater than or equal to m for t (300 ≤ t ≤ 604,800) seconds continuously | Troubleshoot | m=1, t=1800 | Severe | Yes | Yes |
The number of queuing resources in the current PrestoSQL resource group exceeds the threshold continuously | The number of queuing tasks in the PrestoSQL resource group has been greater than or equal to m for t (300 ≤ t ≤ 604,800) seconds continuously | Fine-tune the parameter settings | m=5000, t=1800 | Severe | Yes | Yes | |
The number of failed PrestoSQL queries exceeds the threshold | The number of failed PrestoSQL queries is greater than or equal to m | Troubleshoot | m=1, t=1800 | Severe | Yes | No | |
A full GC event occurred in a PrestoSQLCoordinator | A full GC event occurred in a PrestoSQLCoordinator | Fine-tune the parameter settings | - | Moderate | Yes | No | |
The PrestoSQLCoordinator JVM memory utilization exceeds the threshold continuously | The PrestoSQLCoordinator JVM memory utilization has been greater than or equal to m for t (300 ≤ t ≤ 604,800) seconds continuously | Adjust the PrestoSQLCoordinator heap memory size | m=0.85, t=1800 | Severe | Yes | Yes | |
A full GC event occurred on a PrestoSQL worker | A full GC event occurred on a PrestoSQL worker | Fine-tune the parameter settings | - | Moderate | Yes | No | |
The PrestoSQLWorker JVM memory utilization exceeds the threshold continuously | The PrestoSQLWorker JVM memory utilization has been greater than or equal to m for t (300 ≤ t ≤ 604,800) seconds continuously | Adjust the PrestoSQLWorker heap memory size | m=0.85, t=1800 | Severe | Yes | No | |
Presto | The current number of failed Presto nodes exceeds the threshold continuously | The current number of failed Presto nodes has been greater than or equal to m for t (300 ≤ t ≤ 604,800) seconds continuously | Troubleshoot | m=1, t=1800 | Severe | Yes | Yes |
The number of queuing resources in the current Presto resource group exceeds the threshold continuously | The number of queuing tasks in the Presto resource group has been greater than or equal to m for t (300 ≤ t ≤ 604,800) seconds continuously | Fine-tune the parameter settings | m=5000, t=1800 | Severe | Yes | Yes | |
The number of failed Presto queries exceeds the threshold | The number of failed Presto queries is greater than or equal to m | Troubleshoot | m=1, t=1800 | Severe | Yes | No | |
A full GC event occurred on a PrestoSQL coordinator | A full GC event occurred on a PrestoSQL coordinator | Fine-tune the parameter settings | - | Moderate | Yes | No | |
The Presto coordinator JVM memory utilization exceeds the threshold continuously | The Presto coordinator JVM memory utilization has been greater than or equal to m for t (300 ≤ t ≤ 604,800) seconds continuously | Adjust the Presto coordinator heap memory size | m=0.85, t=1800 | Moderate | Yes | Yes | |
A full GC event occurred on a Presto worker | A full GC event occurred on a Presto worker | Fine-tune the parameter settings | - | Moderate | Yes | No | |
The Presto worker JVM memory utilization exceeds the threshold continuously | The Presto worker JVM memory utilization has been greater than or equal to m for t (300 ≤ t ≤ 604,800) seconds continuously | Adjust the Presto worker heap memory size | m=0.85, t=1800 | Severe | Yes | No | |
Alluxio | The current total number of Alluxio workers is below the threshold continuously | The current total number of Alluxio workers has been smaller than or equal to m for t (300 ≤ t ≤ 604,800) seconds continuously | Troubleshoot | m=1, t=1800 | Severe | Yes | No |
The utilization of the capacity on all tiers of the current Alluxio worker exceeds the threshold | The utilization of the capacity on all tiers of the current Alluxio worker has been greater than or equal to the threshold for t (300 ≤ t ≤ 604,800) seconds continuously | Fine-tune the parameter settings | m=0.85, t=1800 | Severe | Yes | No | |
A full GC event occurred on an Alluxio master | A full GC event occurred on an Alluxio master | Troubleshoot | - | Moderate | Yes | No | |
The Alluxio master JVM memory utilization exceeds the threshold continuously | The Alluxio master JVM memory utilization has been greater than or equal to m for t (300 ≤ t ≤ 604,800) seconds continuously | Adjust the Alluxio worker heap memory size | m=0.85, t=1800 | Severe | Yes | Yes | |
A full GC event occurred on an Alluxio worker | A full GC event occurred on an Alluxio worker | Troubleshoot | - | Moderate | Yes | No | |
The Alluxio worker JVM memory utilization exceeds the threshold continuously | The Alluxio worker JVM memory utilization has been greater than or equal to m for t (300 ≤ t ≤ 604,800) seconds continuously | Adjust the Alluxio master heap memory size | m=0.85, t=1800 | Severe | Yes | Yes | |
Kudu | The cluster replica skew exceeds the threshold | The cluster replica skew has been greater than or equal to the threshold for t (300 ≤ t ≤ 3,600) seconds continuously | Run the `rebalance` command to balance the replicas | m=100, t=300 | Moderate | Yes | Yes |
The hybrid clock error exceeds the threshold | The hybrid clock error has been greater than or equal to the threshold for t (300 ≤ t ≤ 3,600) seconds continuously | Make sure that the NTP daemon is running and the network communication with the NTP server is normal | m=5000000, t=300 | Moderate | Yes | Yes | |
The number of running tablets exceeds the threshold | The number of running tablets has been greater than or equal to m for t (300 ≤ t ≤ 3,600) seconds continuously | Too many tablets on a node can affect the performance. We recommend you clear unnecessary tables and partitions or expand the capacity as needed | m=1000, t=300 | Moderate | Yes | Yes | |
The number of failed tablets exceeds the threshold | The number of failed tablets has been greater than or equal to m for t (300 ≤ t ≤ 3,600) seconds continuously | Check whether any disk is unavailable or data file is corrupted | m=1, t=300 | Moderate | Yes | Yes | |
The number of failed data directories exceeds the threshold | The number of failed data directories has been greater than or equal to m for t (300 ≤ t ≤ 3,600) seconds continuously | Check whether the path configured in the `fs_data_dirs` parameter is available | m=1, t=300 | Severe | Yes | Yes | |
The number of full data directories exceeds the threshold | The number of full data directories has been greater than or equal to m for t (120 ≤ t ≤ 3,600) seconds continuously | Clear unnecessary data files or expand the capacity as needed | m=1, t=120 | Severe | Yes | Yes | |
The number of write requests rejected due to queue overloading exceeds the threshold | The number of write requests rejected due to queue overloading has been greater than or equal to m for t (300 ≤ t ≤ 3,600) seconds continuously | Check whether the number of write hotspots or worker threads is small | m=10, t=300 | Moderate | Yes | No | |
The number of expired scanners exceeds the threshold | The number of expired scanners has been greater than or equal to m for t (300 ≤ t ≤ 3,600) seconds continuously | Be sure to call the method for closing a scanner after reading data | m=100, t=300 | Moderate | Yes | Yes | |
The number of error logs exceeds the threshold | The number of error logs has been greater than or equal to m for t (300 ≤ t ≤ 3,600) seconds continuously | Troubleshoot | m=10, t=300 | Moderate | Yes | Yes | |
The number of RPC requests that timed out while waiting in the queue exceeds the threshold | The number of RPC requests that timed out while waiting in the queue has been greater than or equal to m for t (300 ≤ t ≤ 3,600) seconds continuously | Check whether the system load is too high | m=100, t=300 | Moderate | Yes | Yes |
Was this page helpful?