Cluster Event

Last updated: 2020-11-12 10:55:48

    Feature Overview

    Cluster events include event lists and event policies.

    • Event list: it records key change events and exceptional events occurring in the cluster.
    • Event policy: event monitoring trigger policies can be customized based on the actual business conditions. Events with monitoring enabled can be set as cluster inspection items.

    Viewing Event List

    1. Log in to the EMR Console and click the ID/name of the target cluster in Cluster List to enter the cluster details page.
    2. On the cluster details page, select Cluster Monitoring > Cluster Events > Event List to view all operation events in the current cluster.

      The severity divides into the following:
      • Fatal: node or service exceptions, which may last a while and may cause service unavailability if no human intervention is conducted.
      • Severe: alerts, which haven't caused node or service unavailability yet but will lead to fatal events if left unattended for a long time.
      • Moderate: regular events occurring in the cluster that generally do not require special processing.

    Setting Event Policy

    1. Log in to the EMR Console and click the ID/name of the target cluster in Cluster List to enter the cluster details page.
    2. On the cluster details page, select Cluster Monitoring > Cluster Events > Event Policy and you can customize the event monitoring trigger policies.
    3. The event configuration list contains the event name, event trigger policy, severity (fatal, severe, and moderate), and option to enable/disable monitoring, which can be modified and saved.
    4. Event trigger policies cover two types of events: fixed system policy events (which cannot be modified) and custom events (which can be configured based on the business standards).
    5. You can select whether to enable event monitoring in an event policy. Only events with monitoring enabled can be selected as cluster inspection items. Monitoring is enabled by default for some events and is enabled by default and cannot be disabled for some other events. The following are the specific rules:
      Type Event Name Description Recommended Measure Default Value Severity Disableable Enabled by Default
      Node The CPU utilization exceeds the threshold continuously The server CPU utilization has been greater than or equal to m for t (300 ≤ t ≤ 2,592,000) seconds continuously Expand the node capacity or upgrade the node m=85, t=1,800 Severe Yes Yes
      The average CPU utilization exceeds the threshold The average server CPU utilization has been greater than or equal to m for t (300 ≤ t ≤ 2,592,000) seconds continuously Expand the node capacity or upgrade the node m=85, t=1,800 Moderate Yes No
      The average CPU iowait utilization exceeds the threshold The average CPU iowait utilization of the server in the last t (300 ≤ t ≤ 2,592,000) seconds has been greater than or equal to m Troubleshoot m=60, t=1,800 Severe Yes Yes
      The 1-second CPU load exceeds the threshold continuously The 1-second CPU load has been greater than or equal to m for t (300 ≤ t ≤ 2,592,000) seconds continuously Expand the node capacity or upgrade the node m=8, t=1,800 Moderate Yes No
      The 5-second CPU load exceeds the threshold continuously The 5-second CPU load has been greater than or equal to m for t (300 ≤ t ≤ 2,592,000) seconds continuously Expand the node capacity or upgrade the node m=8, t=1,800 Severe Yes No
      The memory utilization exceeds the threshold continuously The memory utilization has been greater than or equal to m for t (300 ≤ t ≤ 2,592,000) seconds continuously Expand the node capacity or upgrade the node m=85, t=1,800 Severe Yes Yes
      The swap space exceeds the threshold continuously The server swap memory has been greater than or equal to m for t (300 ≤ t ≤ 2,592,000) seconds continuously Expand the node capacity or upgrade the node m=0.1, t=300 Moderate Yes No
      The total number of system processes exceeds the threshold continuously The total number of system processes has been greater than or equal to m for t (300 ≤ t ≤ 2,592,000) seconds continuously Troubleshoot m=10,000, t=1,800 Severe Yes Yes
      The average total number of fork subprocesses exceeds the threshold The average total number of fork subprocesses in the last t (300 ≤ t ≤ 2,592,000) seconds has been greater than or equal to m Troubleshoot m=5,000, t=1,800 Moderate Yes No
      The process does not exist due to OOM An OOM error occurred in the process Adjust the process heap memory size - Severe No Yes
      A disk I/O error occurred (this event is not supported currently) A disk I/O error occurred Replace the disk - Fatal Yes Yes
      The disk space utilization exceeds the threshold continuously The disk space utilization has been greater than or equal to m for t (300 ≤ t ≤ 2,592,000) seconds continuously Expand the node capacity or upgrade the node m=85, t=1,800 Severe Yes Yes
      The disk I/O device utilization exceeds the threshold continuously The disk I/O device utilization has been greater than or equal to m for t (300 ≤ t ≤ 2,592,000) seconds continuously Expand the node capacity or upgrade the node m=85, t=1,800 Severe Yes Yes
      The node file handle utilization exceeds the threshold continuously The node file handle utilization has been greater than or equal to m for t (300 ≤ t ≤ 2,592,000) seconds continuously Troubleshoot m=85, t=1,800 Moderate Yes No
      The number of TCP connections to the node exceeds the threshold continuously The number of TCP connections to the node has been greater than or equal to m for t (300 ≤ t ≤ 2,592,000) seconds continuously Check whether there are connection leaks m=10,000, t=1,800 Moderate Yes No
      The node process is missing The node service process is missing View the service logs to find out why the service failed to be pulled - Moderate No Yes
      The node heartbeat is missing The node heartbeat failed to be reported regularly Troubleshoot - Fatal No Yes
      Incorrect hostname The node's hostname is incorrect Troubleshoot - Fatal No Yes
      Failed to ping the metadatabase The TencentDB instance heartbeat failed to be reported regularly - - - - -
      HDFS The total number of HDFS files exceeds the threshold continuously The total number of files in the cluster has been greater than or equal to m for t (300 ≤ t ≤ 2,592,000) seconds continuously Increase the NameNode memory m=50,000,000, t=1,800 Severe Yes No
      The average total number of HDFS files exceeds the threshold The average total number of files in the cluster in the last t (300 ≤ t ≤ 2,592,000) seconds has been greater than or equal to m Increase the NameNode memory m=50,000,000, t=1,800 Severe Yes No
      The total number of HDFS blocks exceeds the threshold continuously The total number of blocks in the cluster has been greater than or equal to m for t (300 ≤ t ≤ 2,592,000) seconds continuously Increase the NameNode memory or the block size m=50,000,000, t=1,800 Severe Yes No
      The average total number of HDFS blocks exceeds the threshold The average total number of HDFS blocks in the last t (300 ≤ t ≤ 2,592,000) seconds has been greater than or equal to m Increase the NameNode memory or the block size m=50,000,000, t=1,800 Severe Yes No
      The number of HDFS data nodes marked as dead exceeds the threshold continuously The number of data nodes marked as dead has been greater than or equal to m for t (300 ≤ t ≤ 2,592,000) seconds continuously Troubleshoot m=1, t=1,800 Moderate Yes No
      The HDFS storage space utilization exceeds the threshold continuously The HDFS storage space utilization has been greater than or equal to m for t (300 ≤ t ≤ 2,592,000) seconds continuously Clear files in HDFS or expand the cluster capacity m=85, t=1,800 Severe Yes Yes
      The average HDFS storage space utilization exceeds the threshold The average HDFS storage space utilization has been greater than or equal to m for t (300 ≤ t ≤ 2,592,000) seconds continuously Clear files in HDFS or expand the cluster capacity m=85, t=1,800 Severe Yes No
      Active/Standby NameNodes were switched Active/Standby NameNodes were switched Locate the cause of NameNode switch - Severe Yes Yes
      The NameNode RPC request processing latency exceeds the threshold continuously The RPC request processing latency has been greater than or equal to m milliseconds for t (300 ≤ t ≤ 2,592,000) seconds continuously Troubleshoot m=300, t=300 Severe Yes No
      The number of current NameNode connections exceeds the threshold continuously The number of current NameNode connections has been greater than or equal to m for t (300 ≤ t ≤ 2,592,000) seconds continuously Troubleshoot m=2,000, t=1,800 Moderate Yes No
      A full GC event occurred on a NameNode A full GC event occurred on a NameNode Fine-tune the parameter settings - Severe Yes Yes
      The NameNode JVM memory utilization exceeds the threshold continuously The NameNode JVM memory utilization has been greater than or equal to m for t (300 ≤ t ≤ 2,592,000) seconds continuously Adjust the NameNode heap memory size m=85, t=1,800 Severe Yes Yes
      The DataNode RPC request processing latency exceeds the threshold continuously The RPC request processing latency has been greater than or equal to m milliseconds for t (300 ≤ t ≤ 2,592,000) seconds continuously Troubleshoot m=300, t=300 Moderate Yes No
      The number of current DataNode connections exceeds the threshold continuously The number of current DataNode connections has been greater than or equal to m for t (300 ≤ t ≤ 2,592,000) seconds continuously Troubleshoot m=2,000, t=1,800 Moderate Yes No
      A full GC event occurred on a DataNode A full GC event occurred on a DataNode Fine-tune the parameter settings - Moderate Yes No
      The DataNode JVM memory utilization exceeds the threshold continuously The DataNode JVM memory utilization has been greater than or equal to m for t (300 ≤ t ≤ 2,592,000) seconds continuously Adjust the DataNode heap memory size m=85, t=1,800 Moderate Yes Yes
      YARN The number of currently missing NodeManagers in the cluster exceeds the threshold continuously The number of currently missing NodeManagers in the cluster has been greater than or equal to m for t (300 ≤ t ≤ 2,592,000) seconds continuously Check the NodeManager process status and check whether the network connection is smooth m=1, t=1,800 Moderate Yes No
      The number of pending containers exceeds the threshold continuously The number of pending containers has been greater than or equal to m for t (300 ≤ t ≤ 2,592,000) seconds continuously Reasonably specify resources that can be used by YARN jobs m=90, t=1,800 Moderate Yes No
      The cluster memory utilization exceeds the threshold continuously The memory utilization has been greater than or equal to m for t (300 ≤ t ≤ 2,592,000) seconds continuously Expand the cluster capacity m=85, t=1,800 Severe Yes Yes
      The average cluster memory utilization exceeds the threshold The average memory utilization in the last t (300 ≤ t ≤ 2,592,000) seconds has been greater than or equal to m Expand the cluster capacity m=85, t=1,800 Severe Yes No
      The cluster CPU utilization exceeds the threshold continuously The CPU utilization has been greater than or equal to m for t (300 ≤ t ≤ 2,592,000) seconds continuously Expand the cluster capacity m=85, t=1,800 Severe Yes Yes
      The average cluster CPU utilization exceeds the threshold The average CPU utilization in the last t (300 ≤ t ≤ 2,592,000) seconds has been greater than or equal to m Expand the cluster capacity m=85, t=1,800 Severe Yes No
      The number of available CPU cores in each queue is below the threshold continuously. The number of available CPU cores in each queue has been less than or equal to m for t (300 ≤ t ≤ 2,592,000) seconds continuously Assign more resources to the queue m=1, t=1,800 Moderate Yes No
      The available memory in each queue is below the threshold continuously The available memory in each queue has been less than or equal to m for t (300 ≤ t ≤ 2,592,000) seconds continuously Assign more resources to the queue m=1,024, t=1800 Moderate Yes No
      Active/Standby ResourceManagers were switched Active/Standby ResourceManagers were switched Check the ResourceManager process status and view the standby ResourceManager logs to locate the cause of active/standby switch - Severe Yes Yes
      A full GC event occurred in a ResourceManager A full GC event occurred in a ResourceManager Fine-tune the parameter settings - Severe Yes Yes
      The ResourceManager JVM memory utilization exceeds the threshold continuously The ResourceManager JVM memory utilization has been greater than or equal to m for t (300 ≤ t ≤ 2,592,000) seconds continuously Adjust the ResourceManager heap memory size m=85, t=1,800 Severe Yes Yes
      A full GC event occurred in a NodeManager A full GC event occurred in a NodeManager Fine-tune the parameter settings - Moderate Yes No
      The available memory in NodeManager is below the threshold continuously The available memory in a single NodeManager has been less than or equal to m for t (300 ≤ t ≤ 2,592,000) seconds continuously Adjust the NodeManager heap memory size m=1, t=1,800 Moderate Yes No
      The NodeManager JVM memory utilization exceeds the threshold continuously The NodeManager JVM memory utilization has been greater than or equal to m for t (300 ≤ t ≤ 2,592,000) seconds continuously Adjust the NodeManager heap memory size m=85, t=1,800 Moderate Yes No
      HBase The number of regions in RIT status in the cluster exceeds the threshold continuously The number of regions in RIT status in the cluster has been greater than or equal to m for t (300 ≤ t ≤ 2,592,000) seconds continuously If the HBase version is below 2.0, run `hbase hbck -fixAssigment` m=1, t=60 Severe Yes Yes
      The number of dead RegionServers exceeds the threshold continuously The number of dead RegionServers has been greater than or equal to m for t (300 ≤ t ≤ 2,592,000) seconds continuously Troubleshoot m=1, t=300 Moderate Yes Yes
      The average number of regions in each RegionServer in the cluster exceeds the threshold continuously The average number of regions in each RegionServer in the cluster has been greater than or equal to m for t (300 ≤ t ≤ 2,592,000) seconds continuously Expand the node capacity or upgrade the node m=300, t=1,800 Moderate Yes Yes
      A full GC event occurred on HMaster A full GC event occurred on HMaster Fine-tune the parameter settings m=5, t=300 Moderate Yes Yes
      The HMaster JVM memory utilization exceeds the threshold continuously The HMaster JVM memory utilization has been greater than or equal to m for t (300 ≤ t ≤ 2,592,000) seconds continuously Adjust the HMaster heap memory size m=85, t=1,800 Severe Yes Yes
      The number of current HMaster connections exceeds the threshold continuously The number of current HMaster connections has been greater than or equal to m for t (300 ≤ t ≤ 2,592,000) seconds continuously Troubleshoot m=1,000, t=1,800 Moderate Yes No
      A full GC event occurred in RegionServer A full GC event occurred in RegionServer Fine-tune the parameter settings m=5, t=300 Severe Yes No
      The RegionServer JVM memory utilization exceeds the threshold continuously The RegionServer JVM memory utilization has been greater than or equal to m for t (300 ≤ t ≤ 2,592,000) seconds continuously Adjust the RegionServer heap memory size m=85, t=1,800 Moderate Yes No
      The number of current RPC connections to RegionServer exceeds the threshold continuously The number of current RPC connections to RegionServer has been greater than or equal to m for t (300 ≤ t ≤ 2,592,000) seconds continuously Troubleshoot m=1,000, t=1,800 Moderate Yes No
      The number of RegionServer StoreFiles exceeds the threshold continuously The number of RegionServer StoreFiles has been greater than or equal to m for t (300 ≤ t ≤ 2,592,000) seconds continuously Run the major compaction m=50,000, t=1,800 Moderate Yes No
      A full GC event occurred in HBase Thrift A full GC event occurred in HBase Thrift Fine-tune the parameter settings m=5, t=300 Severe Yes No
      The HBase Thrift JVM memory utilization exceeds the threshold continuously The HBase Thrift JVM memory utilization has been greater than or equal to m for t (300 ≤ t ≤ 2,592,000) seconds continuously Adjust the HBase Thrift heap memory size m=85, t=1,800 Moderate Yes No
      Hive A full GC event occurred in HiveServer2 A full GC event occurred in HiveServer2 Fine-tune the parameter settings m=5, t=300 Severe Yes Yes
      The HiveServer2 JVM memory utilization exceeds the threshold continuously The HiveServer2 JVM memory utilization has been greater than or equal to m for t (300 ≤ t ≤ 2,592,000) seconds continuously Adjust the HiveServer2 heap memory size m=85, t=1,800 Severe Yes Yes
      A full GC event occurred in HiveMetaStore A full GC event occurred in HiveMetaStore Fine-tune the parameter settings m=5, t=300 Moderate Yes Yes
      A full GC event occurred in HiveWebHcat A full GC event occurred in HiveWebHcat Fine-tune the parameter settings m=5, t=300 Moderate Yes Yes
      ZooKeeper The number of ZooKeeper connections exceeds the threshold continuously The number of ZooKeeper connections has been greater than or equal to m for t (300 ≤ t ≤ 2,592,000) seconds continuously Troubleshoot m=65,535, t=1,800 Moderate Yes No
      The number of ZNodes exceeds the threshold continuously The number of ZNodes has been greater than or equal to m for t (300 ≤ t ≤ 2,592,000) seconds continuously Troubleshoot m=2,000, t=1,800 Moderate Yes No