Help & DocumentationElasticsearch ServiceBest PracticesSuggestions for Configuring Monitors and Alarms

Suggestions for Configuring Monitors and Alarms

Last updated: 2019-11-12 17:40:54

PDF

ES not only provides a number of monitoring metrics for running ES clusters to monitor their health, but also allows you to configure alarms for key metrics, so that you can identify cluster problems and address them in a timely manner. For more information, see Viewing Monitoring Metrics and Configuring Alarms.
This document describes some metrics that require special attention during your use of an ES cluster, as well as recommended alarm configurations:

Metric Suggested Alarm Configuration Description
Cluster health status The statistical period is 1 minute. If this value is >= 1 in 5 consecutive periods, an alarm will be triggered once every 30 minutes Value range:
  • 0: Green, which indicates that all primary and replica shards are available and the cluster is in the healthiest status.
  • 1: Yellow, which indicates that all the primary shards are available, but some replica shards are unavailable. In this case, the search results are still complete; however, the high availability of the cluster is affected to some extent, and there is a high risk of data loss.
  • 2: Red, which indicates that at least one primary shard and all its replicas are unavailable. When the cluster health status changes to red, some data has become unavailable, the search can only return partial data, and the requests allocated to a lost shard will return an exception.
  • The cluster health status is the most direct manifestation of the current health of the cluster. If it changes to yellow or red, you should troubleshoot and repair the problem in a timely manner to prevent data loss or service unavailability.
    Avg disk utilization The statistical period is 1 minute. If this value is > 80% in 5 consecutive periods, an alarm will be triggered once every 30 minutes The avg disk utilization refers to the average of the disk utilization values of all nodes in the cluster. If the disk utilization of a node is too high, the node will not have sufficient disk capacity to accommodate the shards allocated to it, leading to failures in basic operations such as index creating and document adding. You are recommended to promptly clear the data or scale out your cluster when this value is above 75%.
    Avg JVM memory utilization The statistical period is 1 minute. If this value is > 85% in 5 consecutive periods, an alarm will be triggered once every 30 minutes The avg JVM memory utilization refers to the average of the JVM memory utilization values of all nodes in the cluster. A too high JVM memory utilization can lead to rejection of read and write operations, frequent GC, or even OOM. When this value exceeds the threshold, you are recommended to upgrade the node specification through vertical scaling.
    Avg CPU utilization The statistical period is 1 minute. If this value is > 90% in 5 consecutive periods, an alarm will be triggered once every 30 minutes The avg CPU utilization refers to the average of the CPU utilization values of all nodes in the cluster. A too high average CPU utilization can lead to a decline in the processing capability of the cluster nodes or even downtime. If this value is too high, you should upgrade the node specification or reduce the number of requests based on the current node configuration of your cluster and your business.
    Bulk rejection rate The statistical period is 1 minute. If this value is > 0% in one period, an alarm will be triggered once every 30 minutes The bulk rejection rate refers to the percentage of rejected bulk operations in all bulk operations performed by your cluster during a single period. When this value is greater than 0%, i.e., one or more bulk rejections have occurred, your cluster has reached the upper limit of its capability to process bulk operations, or an exception has occurred. In this case, you should troubleshoot and repair the problem in a timely manner; otherwise, bulk operations will be affected, or data loss will occur.
    Query rejection rate The statistical period is 1 minute. If this value is > 0% in one period, an alarm will be triggered once every 30 minutes The query rejection rate refers to the percentage of rejected query operations in all query operations performed by your cluster during a single period. When this value is greater than 0%, i.e., one or more query rejections have occurred, your cluster has reached the upper limit of its capability to process query operations, or an exception has occurred. In this case, you should troubleshoot and repair the problem in a timely manner; otherwise, query operations will be affected.