tencent cloud

Feedback

Cluster Event

Last updated: 2022-05-16 12:45:11

    Overview

    Cluster events include event lists and event policies.

    • Event list: It records key change events and abnormal events occurring in the cluster.
    • Event policy: Event monitoring trigger policies can be customized based on the actual business conditions. Events with monitoring enabled can be set as cluster inspection items.

    Viewing Event List

    1. Log in to the EMR console and click the ID/Name of the target cluster in the cluster list to enter the cluster details page.
    2. On the cluster details page, select Cluster Monitoring > Cluster Events > Event List to view all operation events in the current cluster.

      The severity divides into the following:
    • Fatal: Exception events of a node or service that require manual intervention and will cause service interruption if left unattended. Such events may last for a period of time.
    • Severe: Alert events that currently have not caused service or node interruption but will cause fatal events if left unattended.
    • Moderate: Regular events occurring in the cluster that generally do not require special processing.

    Setting Event Policy

    1. Log in to the EMR console and click the ID/Name of the target cluster in the cluster list to enter the cluster details page.
    2. On the cluster details page, select Cluster Monitoring > Cluster Events > Event Policy and you can customize the event monitoring trigger policies.
    3. The event configuration list contains the event name, event trigger policy, severity (fatal, severe, and moderate), and option to enable/disable monitoring, which can be modified and saved.
    4. Event trigger policies cover two types of events: fixed system policy events (which cannot be modified) and custom events (which can be configured based on the business standards).
    5. You can select whether to enable event monitoring in an event policy. Only events with monitoring enabled can be selected as cluster inspection items. Monitoring is enabled by default for some events and is enabled by default and cannot be disabled for some other events. The following are the specific rules:
      Type Event Description Recommended Measure Default Value Severity Disableable Enabled by Default
      Node The CPU utilization exceeds the threshold continuously The server CPU utilization has been greater than or equal to m for t (300 ≤ t ≤ 2,592,000) seconds continuously Expand the node capacity or upgrade the node m=85, t=1800 Severe Yes Yes
      The average CPU utilization exceeds the threshold The average server CPU utilization has been greater than or equal to m for t (300 ≤ t ≤ 2,592,000) seconds continuously Expand the node capacity or upgrade the node m=85, t=1800 Moderate Yes No
      The average CPU iowait utilization exceeds the threshold The average CPU iowait utilization of the server in the last t (300 ≤ t ≤ 2,592,000) seconds has been greater than or equal to m Troubleshoot m=60, t=1800 Severe Yes Yes
      The 1-second CPU load exceeds the threshold continuously The 1-minute CPU load has been greater than or equal to m for t (300 ≤ t ≤ 2,592,000) seconds continuously Expand the node capacity or upgrade the node m=8, t=1800 Moderate Yes No
      The 5-second CPU load exceeds the threshold continuously The 5-minute CPU load has been greater than or equal to m for t (300 ≤ t ≤ 2,592,000) seconds continuously Expand the node capacity or upgrade the node m=8, t=1800 Severe Yes No
      The memory utilization exceeds the threshold continuously The memory utilization has been greater than or equal to m for t (300 ≤ t ≤ 2,592,000) seconds continuously Expand the node capacity or upgrade the node m=85, t=1800 Severe Yes Yes
      The swap space exceeds the threshold continuously The server swap memory has been greater than or equal to m for t (300 ≤ t ≤ 2,592,000) seconds continuously Expand the node capacity or upgrade the node m=0.1, t=300 Moderate Yes No
      The total number of system processes exceeds the threshold continuously The total number of system processes has been greater than or equal to m for t (300 ≤ t ≤ 2,592,000) seconds continuously Troubleshoot m=10000, t=1800 Severe Yes Yes
      The average total number of fork subprocesses exceeds the threshold The average total number of fork subprocesses in the last t (300 ≤ t ≤ 2,592,000) seconds has been greater than or equal to m Troubleshoot m=5000, t=1800 Moderate Yes No
      The process does not exist due to OOM An OOM error occurred in the process Adjust the process heap memory size - Severe No Yes
      A disk I/O error occurred (this event is not supported currently) A disk I/O error occurred Replace the disk - Fatal Yes Yes
      The disk space utilization exceeds the threshold continuously The disk space utilization has been greater than or equal to m for t (300 ≤ t ≤ 2,592,000) seconds continuously Expand the node capacity or upgrade the node m=85, t=1800 Severe Yes Yes
      The disk I/O device utilization exceeds the threshold continuously The disk I/O device utilization has been greater than or equal to m for t (300 ≤ t ≤ 2,592,000) seconds continuously Expand the node capacity or upgrade the node m=85, t=1800 Severe Yes Yes
      The node file handle utilization exceeds the threshold continuously The node file handle utilization has been greater than or equal to m for t (300 ≤ t ≤ 2,592,000) seconds continuously Troubleshoot m=85, t=1800 Moderate Yes No
      The number of TCP connections to the node exceeds the threshold continuously The number of TCP connections to the node has been greater than or equal to m for t (300 ≤ t ≤ 2,592,000) seconds continuously Check whether there are connection leaks m=10000, t=1800 Moderate Yes No
      The configured node memory utilization exceeds the threshold The memory utilization configured for all roles on the node exceeds the node's physical memory threshold Adjust the allocated node process heap memory 90% Severe Yes No
      The node process is unavailable The node service process is unavailable View the service logs to find out why the service failed to be pulled - Moderate Yes Yes
      The node heartbeat is missing The node heartbeat failed to be reported regularly Troubleshoot - Fatal No Yes
      Incorrect hostname The node's hostname is incorrect Troubleshoot - Fatal No Yes
      Failed to ping the metadatabase The TencentDB instance heartbeat failed to be reported regularly - - - - -
      HDFS The total number of HDFS files exceeds the threshold continuously The total number of files in the cluster has been greater than or equal to m for t (300 ≤ t ≤ 2,592,000) seconds continuously Increase the NameNode memory m=50,000,000, t=1800 Severe Yes No
      The average total number of HDFS files exceeds the threshold The average total number of files in the cluster in the last t (300 ≤ t ≤ 2,592,000) seconds has been greater than or equal to m Increase the NameNode memory m=50,000,000, t=1800 Severe Yes No
      The total number of HDFS blocks exceeds the threshold continuously The total number of blocks in the cluster has been greater than or equal to m for t (300 ≤ t ≤ 2,592,000) seconds continuously Increase the NameNode memory or the block size m=50,000,000, t=1800 Severe Yes No
      The average total number of HDFS blocks exceeds the threshold The average total number of HDFS blocks in the last t (300 ≤ t ≤ 2,592,000) seconds has been greater than or equal to m Increase the NameNode memory or the block size m=50,000,000, t=1800 Severe Yes No
      The number of HDFS data nodes marked as dead exceeds the threshold continuously The number of data nodes marked as dead has been greater than or equal to m for t (300 ≤ t ≤ 2,592,000) seconds continuously Troubleshoot m=1, t=1800 Moderate Yes No
      The HDFS storage space utilization exceeds the threshold continuously The HDFS storage space utilization has been greater than or equal to m for t (300 ≤ t ≤ 2,592,000) seconds continuously Clear files in HDFS or expand the cluster capacity m=85, t=1800 Severe Yes Yes
      The average HDFS storage space utilization exceeds the threshold The average HDFS storage space utilization has been greater than or equal to m for t (300 ≤ t ≤ 2,592,000) seconds continuously Clear files in HDFS or expand the cluster capacity m=85, t=1800 Severe Yes No
      Active/Standby NameNodes were switched Active/Standby NameNodes were switched Locate the cause of NameNode switch - Severe Yes Yes
      The NameNode RPC request processing latency exceeds the threshold continuously The RPC request processing latency has been greater than or equal to m milliseconds for t (300 ≤ t ≤ 2,592,000) seconds continuously Troubleshoot m=300, t=300 Severe Yes No
      The number of current NameNode connections exceeds the threshold continuously The number of current NameNode connections has been greater than or equal to m for t (300 ≤ t ≤ 2,592,000) seconds continuously Troubleshoot m=2000, t=1800 Moderate Yes No
      A full GC event occurred on a NameNode A full GC event occurred on a NameNode Fine-tune the parameter settings - Severe Yes Yes
      The NameNode JVM memory utilization exceeds the threshold continuously The NameNode JVM memory utilization has been greater than or equal to m for t (300 ≤ t ≤ 2,592,000) seconds continuously Adjust the NameNode heap memory size m=85, t=1800 Severe Yes Yes
      The DataNode RPC request processing latency exceeds the threshold continuously The RPC request processing latency has been greater than or equal to m milliseconds for t (300 ≤ t ≤ 2,592,000) seconds continuously Troubleshoot m=300, t=300 Moderate Yes No
      The number of current DataNode connections exceeds the threshold continuously The number of current DataNode connections has been greater than or equal to m for t (300 ≤ t ≤ 2,592,000) seconds continuously Troubleshoot m=2000, t=1800 Moderate Yes No
      A full GC event occurred on a DataNode A full GC event occurred on a NameNode Fine-tune the parameter settings - Moderate Yes No
      The DataNode JVM memory utilization exceeds the threshold continuously The NameNode JVM memory utilization has been greater than or equal to m for t (300 ≤ t ≤ 2,592,000) seconds continuously Adjust the DataNode heap memory size m=85, t=1800 Moderate Yes Yes
      YARN The number of currently missing NodeManagers in the cluster exceeds the threshold continuously The number of currently missing NodeManagers in the cluster has been greater than or equal to m for t (300 ≤ t ≤ 2,592,000) seconds continuously Check the NodeManager process status and check whether the network connection is smooth m=1, t=1800 Moderate Yes No
      The number of pending containers exceeds the threshold continuously The number of pending containers has been greater than or equal to m for t (300 ≤ t ≤ 2,592,000) seconds continuously Reasonably specify resources that can be used by YARN jobs m=90, t=1800 Moderate Yes No
      The cluster memory utilization exceeds the threshold continuously The memory utilization has been greater than or equal to m for t (300 ≤ t ≤ 2,592,000) seconds continuously Expand the cluster capacity m=85, t=1800 Severe Yes Yes
      The average cluster memory utilization exceeds the threshold The average memory utilization in the last t (300 ≤ t ≤ 2,592,000) seconds has been greater than or equal to m Expand the cluster capacity m=85, t=1800 Severe Yes No
      The cluster CPU utilization exceeds the threshold continuously The CPU utilization has been greater than or equal to m for t (300 ≤ t ≤ 2,592,000) seconds continuously Expand the cluster capacity m=85, t=1800 Severe Yes Yes
      The average cluster CPU utilization exceeds the threshold The average CPU utilization in the last t (300 ≤ t ≤ 2,592,000) seconds has been greater than or equal to m Expand the cluster capacity m=85, t=1800 Severe Yes No
      The number of available CPU cores in each queue is below the threshold continuously. The number of available CPU cores in each queue has been less than or equal to m for t (300 ≤ t ≤ 2,592,000) seconds continuously Assign more resources to the queue m=1, t=1800 Moderate Yes No
      The available memory in each queue is below the threshold continuously The available memory in each queue has been less than or equal to m for t (300 ≤ t ≤ 2,592,000) seconds continuously Assign more resources to the queue m=1024, t=1800 Moderate Yes No
      Active/Standby ResourceManagers were switched Active/Standby ResourceManagers were switched Check the ResourceManager process status and view the standby ResourceManager logs to locate the cause of active/standby switch - Severe Yes Yes
      A full GC event occurred in a ResourceManager A full GC event occurred in a ResourceManager Fine-tune the parameter settings - Severe Yes Yes
      The ResourceManager JVM memory utilization exceeds the threshold continuously The ResourceManager JVM memory utilization has been greater than or equal to m for t (300 ≤ t ≤ 2,592,000) seconds continuously Adjust the ResourceManager heap memory size m=85, t=1800 Severe Yes Yes
      A full GC event occurred in a NodeManager A full GC event occurred in a NodeManager Fine-tune the parameter settings - Moderate Yes No
      The available memory in NodeManager is below the threshold continuously The available memory in a single NodeManager has been less than or equal to m for t (300 ≤ t ≤ 2,592,000) seconds continuously Adjust the NodeManager heap memory size m=1, t=1800 Moderate Yes No
      The NodeManager JVM memory utilization exceeds the threshold continuously The NodeManager JVM memory utilization has been greater than or equal to m for t (300 ≤ t ≤ 2,592,000) seconds continuously Adjust the NodeManager heap memory size m=85, t=1800 Moderate Yes No
      HBase The number of regions in RIT status in the cluster exceeds the threshold continuously The number of regions in RIT status in the cluster has been greater than or equal to m for t (300 ≤ t ≤ 2,592,000) seconds continuously If the HBase version is below 2.0, run `hbase hbck -fixAssigment` m=1, t=60 Severe Yes Yes
      The number of dead RegionServers exceeds the threshold continuously The number of dead RegionServers has been greater than or equal to m for t (300 ≤ t ≤ 2,592,000) seconds continuously Troubleshoot m=1, t=300 Moderate Yes Yes
      The average number of regions in each RegionServer in the cluster exceeds the threshold continuously The average number of regions in each RegionServer in the cluster has been greater than or equal to m for t (300 ≤ t ≤ 2,592,000) seconds continuously Expand the node capacity or upgrade the node m=300, t=1800 Moderate Yes Yes
      A full GC event occurred on HMaster A full GC event occurred on HMaster Fine-tune the parameter settings m=5, t=300 Moderate Yes Yes
      The HMaster JVM memory utilization exceeds the threshold continuously The HMaster JVM memory utilization has been greater than or equal to m for t (300 ≤ t ≤ 2,592,000) seconds continuously Adjust the HMaster heap memory size m=85, t=1800 Severe Yes Yes
      The number of current HMaster connections exceeds the threshold continuously The number of current HMaster connections has been greater than or equal to m for t (300 ≤ t ≤ 2,592,000) seconds continuously Troubleshoot m=1000, t=1800 Moderate Yes No
      A full GC event occurred in RegionServer A full GC event occurred in RegionServer Fine-tune the parameter settings m=5, t=300 Severe Yes No
      The RegionServer JVM memory utilization exceeds the threshold continuously The RegionServer JVM memory utilization has been greater than or equal to m for t (300 ≤ t ≤ 2,592,000) seconds continuously Adjust the RegionServer heap memory size m=85, t=1800 Moderate Yes No
      The number of current RPC connections to RegionServer exceeds the threshold continuously The number of current RPC connections to RegionServer has been greater than or equal to m for t (300 ≤ t ≤ 2,592,000) seconds continuously Troubleshoot m=1000, t=1800 Moderate Yes No
      The number of RegionServer StoreFiles exceeds the threshold continuously The number of RegionServer StoreFiles has been greater than or equal to m for t (300 ≤ t ≤ 2,592,000) seconds continuously Run the major compaction m=50000, t=1800 Moderate Yes No
      A full GC event occurred in HBase Thrift A full GC event occurred in HBase Thrift Fine-tune the parameter settings m=5, t=300 Severe Yes No
      The HBase Thrift JVM memory utilization exceeds the threshold continuously The HBase Thrift JVM memory utilization has been greater than or equal to m for t (300 ≤ t ≤ 2,592,000) seconds continuously Adjust the HBase Thrift heap memory size m=85, t=1800 Moderate Yes No
      Hive A full GC event occurred in HiveServer2 A full GC event occurred in HiveServer2 Fine-tune the parameter settings m=5, t=300 Severe Yes Yes
      The HiveServer2 JVM memory utilization exceeds the threshold continuously The HiveServer2 JVM memory utilization has been greater than or equal to m for t (300 ≤ t ≤ 2,592,000) seconds continuously Adjust the HiveServer2 heap memory size m=85, t=1800 Severe Yes Yes
      A full GC event occurred in HiveMetaStore A full GC event occurred in HiveMetaStore Fine-tune the parameter settings m=5, t=300 Moderate Yes Yes
      A full GC event occurred in HiveWebHcat A full GC event occurred in HiveWebHcat Fine-tune the parameter settings m=5, t=300 Moderate Yes Yes
      ZooKeeper The number of ZooKeeper connections exceeds the threshold continuously The number of ZooKeeper connections has been greater than or equal to m for t (300 ≤ t ≤ 2,592,000) seconds continuously Troubleshoot m=65535, t=1800 Moderate Yes No
      The number of ZNodes exceeds the threshold continuously The number of ZNodes has been greater than or equal to m for t (300 ≤ t ≤ 2,592,000) seconds continuously Troubleshoot m=2000, t=1800 Moderate Yes No
      Impala The ImpalaCatalog JVM memory utilization exceeds the threshold continuously The ImpalaCatalog JVM memory utilization has been greater than or equal to m for t (300 ≤ t ≤ 604,800) seconds continuously Adjust the ImpalaCatalog heap memory size m=0.85, t=1800 Moderate Yes No
      The Impala daemon JVM memory utilization exceeds the threshold continuously The Impala daemon JVM memory utilization has been greater than or equal to m for t (300 ≤ t ≤ 604,800) seconds continuously Adjust the ImpalaDaemon heap memory size m=0.85, t=1800 Moderate Yes No
      The number of Impala Beeswax API client connections exceeds the threshold The number of Impala Beeswax API client connections has been greater than or equal to m Adjust the value of `fs_sevice_threads` in the `impalad.flgs` configuration in the console m=64,t=120 Severe Yes Yes
      The number of Impala HiveServer2 client connections exceeds the threshold The number of Impala HiveServer2 client connections has been greater than or equal to m Adjust the value of `fs_sevice_threads` in the `impalad.flgs` configuration in the console m=64,t=120 Severe Yes Yes
      PrestoSQL The current number of failed PrestoSQL nodes exceeds the threshold continuously The current number of failed PrestoSQL nodes has been greater than or equal to m for t (300 ≤ t ≤ 604,800) seconds continuously Troubleshoot m=1, t=1800 Severe Yes Yes
      The number of queuing resources in the current PrestoSQL resource group exceeds the threshold continuously The number of queuing tasks in the PrestoSQL resource group has been greater than or equal to m for t (300 ≤ t ≤ 604,800) seconds continuously Fine-tune the parameter settings m=5000, t=1800 Severe Yes Yes
      The number of failed PrestoSQL queries exceeds the threshold The number of failed PrestoSQL queries is greater than or equal to m Troubleshoot m=1, t=1800 Severe Yes No
      A full GC event occurred in a PrestoSQLCoordinator A full GC event occurred in a PrestoSQLCoordinator Fine-tune the parameter settings - Moderate Yes No
      The PrestoSQLCoordinator JVM memory utilization exceeds the threshold continuously The PrestoSQLCoordinator JVM memory utilization has been greater than or equal to m for t (300 ≤ t ≤ 604,800) seconds continuously Adjust the PrestoSQLCoordinator heap memory size m=0.85, t=1800 Severe Yes Yes
      A full GC event occurred on a PrestoSQL worker A full GC event occurred on a PrestoSQL worker Fine-tune the parameter settings - Moderate Yes No
      The PrestoSQLWorker JVM memory utilization exceeds the threshold continuously The PrestoSQLWorker JVM memory utilization has been greater than or equal to m for t (300 ≤ t ≤ 604,800) seconds continuously Adjust the PrestoSQLWorker heap memory size m=0.85, t=1800 Severe Yes No
      Presto The current number of failed Presto nodes exceeds the threshold continuously The current number of failed Presto nodes has been greater than or equal to m for t (300 ≤ t ≤ 604,800) seconds continuously Troubleshoot m=1, t=1800 Severe Yes Yes
      The number of queuing resources in the current Presto resource group exceeds the threshold continuously The number of queuing tasks in the Presto resource group has been greater than or equal to m for t (300 ≤ t ≤ 604,800) seconds continuously Fine-tune the parameter settings m=5000, t=1800 Severe Yes Yes
      The number of failed Presto queries exceeds the threshold The number of failed Presto queries is greater than or equal to m Troubleshoot m=1, t=1800 Severe Yes No
      A full GC event occurred on a PrestoSQL coordinator A full GC event occurred on a PrestoSQL coordinator Fine-tune the parameter settings - Moderate Yes No
      The Presto coordinator JVM memory utilization exceeds the threshold continuously The Presto coordinator JVM memory utilization has been greater than or equal to m for t (300 ≤ t ≤ 604,800) seconds continuously Adjust the Presto coordinator heap memory size m=0.85, t=1800 Moderate Yes Yes
      A full GC event occurred on a Presto worker A full GC event occurred on a Presto worker Fine-tune the parameter settings - Moderate Yes No
      The Presto worker JVM memory utilization exceeds the threshold continuously The Presto worker JVM memory utilization has been greater than or equal to m for t (300 ≤ t ≤ 604,800) seconds continuously Adjust the Presto worker heap memory size m=0.85, t=1800 Severe Yes No
      Alluxio The current total number of Alluxio workers is below the threshold continuously The current total number of Alluxio workers has been smaller than or equal to m for t (300 ≤ t ≤ 604,800) seconds continuously Troubleshoot m=1, t=1800 Severe Yes No
      The utilization of the capacity on all tiers of the current Alluxio worker exceeds the threshold The utilization of the capacity on all tiers of the current Alluxio worker has been greater than or equal to the threshold for t (300 ≤ t ≤ 604,800) seconds continuously Fine-tune the parameter settings m=0.85, t=1800 Severe Yes No
      A full GC event occurred on an Alluxio master A full GC event occurred on an Alluxio master Troubleshoot - Moderate Yes No
      The Alluxio master JVM memory utilization exceeds the threshold continuously The Alluxio master JVM memory utilization has been greater than or equal to m for t (300 ≤ t ≤ 604,800) seconds continuously Adjust the Alluxio worker heap memory size m=0.85, t=1800 Severe Yes Yes
      A full GC event occurred on an Alluxio worker A full GC event occurred on an Alluxio worker Troubleshoot - Moderate Yes No
      The Alluxio worker JVM memory utilization exceeds the threshold continuously The Alluxio worker JVM memory utilization has been greater than or equal to m for t (300 ≤ t ≤ 604,800) seconds continuously Adjust the Alluxio master heap memory size m=0.85, t=1800 Severe Yes Yes
      Kudu The cluster replica skew exceeds the threshold The cluster replica skew has been greater than or equal to the threshold for t (300 ≤ t ≤ 3,600) seconds continuously Run the `rebalance` command to balance the replicas m=100, t=300 Moderate Yes Yes
      The hybrid clock error exceeds the threshold The hybrid clock error has been greater than or equal to the threshold for t (300 ≤ t ≤ 3,600) seconds continuously Make sure that the NTP daemon is running and the network communication with the NTP server is normal m=5000000, t=300 Moderate Yes Yes
      The number of running tablets exceeds the threshold The number of running tablets has been greater than or equal to m for t (300 ≤ t ≤ 3,600) seconds continuously Too many tablets on a node can affect the performance. We recommend you clear unnecessary tables and partitions or expand the capacity as needed m=1000, t=300 Moderate Yes Yes
      The number of failed tablets exceeds the threshold The number of failed tablets has been greater than or equal to m for t (300 ≤ t ≤ 3,600) seconds continuously Check whether any disk is unavailable or data file is corrupted m=1, t=300 Moderate Yes Yes
      The number of failed data directories exceeds the threshold The number of failed data directories has been greater than or equal to m for t (300 ≤ t ≤ 3,600) seconds continuously Check whether the path configured in the `fs_data_dirs` parameter is available m=1, t=300 Severe Yes Yes
      The number of full data directories exceeds the threshold The number of full data directories has been greater than or equal to m for t (120 ≤ t ≤ 3,600) seconds continuously Clear unnecessary data files or expand the capacity as needed m=1, t=120 Severe Yes Yes
      The number of write requests rejected due to queue overloading exceeds the threshold The number of write requests rejected due to queue overloading has been greater than or equal to m for t (300 ≤ t ≤ 3,600) seconds continuously Check whether the number of write hotspots or worker threads is small m=10, t=300 Moderate Yes No
      The number of expired scanners exceeds the threshold The number of expired scanners has been greater than or equal to m for t (300 ≤ t ≤ 3,600) seconds continuously Be sure to call the method for closing a scanner after reading data m=100, t=300 Moderate Yes Yes
      The number of error logs exceeds the threshold The number of error logs has been greater than or equal to m for t (300 ≤ t ≤ 3,600) seconds continuously Troubleshoot m=10, t=300 Moderate Yes Yes
      The number of RPC requests that timed out while waiting in the queue exceeds the threshold The number of RPC requests that timed out while waiting in the queue has been greater than or equal to m for t (300 ≤ t ≤ 3,600) seconds continuously Check whether the system load is too high m=100, t=300 Moderate Yes Yes
    Contact Us

    Contact our sales team or business advisors to help your business.

    Technical Support

    Open a ticket if you're looking for further assistance. Our Ticket is 7x24 avaliable.

    7x24 Phone Support