Suggestions for Configuring Monitors and Alarms

Download

Focus Mode

Font Size

Last updated: 2024-12-03 17:58:22

ES not only provides a number of monitoring metrics for running ES clusters to monitor their health, but also allows you to configure alarms for key metrics, so that you can identify cluster problems and address them in a timely manner. For more information, see Viewing Monitoring Metrics and Configuring Alarms.
This document describes some metrics that require special attention during your use of an ES cluster, as well as recommended alarm configurations:
Metric
Suggested Alarm Configuration
Description
Cluster health status
The statistical period is 1 minute. If this value is >= 1 in 5 consecutive periods, an alarm will be triggered once every 30 minutes
Value range:
0: Green, which indicates that all primary and replica shards are available and the cluster is in the healthiest status.
1: Yellow, which indicates that all the primary shards are available, but some replica shards are unavailable. In this case, the search results are still complete; however, the high availability of the cluster is affected to some extent, and there is a high risk of data loss.
2: Red, which indicates that at least one primary shard and all its replicas are unavailable. When the cluster health status changes to red, some data has become unavailable, the search can only return partial data, and the requests allocated to a lost shard will return an exception.
The cluster health status is the most direct manifestation of the current health of the cluster. If it changes to yellow or red, you should troubleshoot and repair the problem in a timely manner to prevent data loss or service unavailability.
Avg disk utilization
The statistical period is 1 minute. If this value is > 80% in 5 consecutive periods, an alarm will be triggered once every 30 minutes
The avg disk utilization refers to the average of the disk utilization values of all nodes in the cluster. If the disk utilization of a node is too high, the node will not have sufficient disk capacity to accommodate the shards allocated to it, leading to failures in basic operations such as index creating and document adding. You are recommended to promptly clear the data or scale out your cluster when this value is above 75%.
Avg JVM memory utilization
The statistical period is 1 minute. If this value is > 85% in 5 consecutive periods, an alarm will be triggered once every 30 minutes
The avg JVM memory utilization refers to the average of the JVM memory utilization values of all nodes in the cluster. A too high JVM memory utilization can lead to rejection of read and write operations, frequent GC, or even OOM. When this value exceeds the threshold, you are recommended to upgrade the node specification through vertical scaling.
Avg CPU utilization
The statistical period is 1 minute. If this value is > 90% in 5 consecutive periods, an alarm will be triggered once every 30 minutes
The avg CPU utilization refers to the average of the CPU utilization values of all nodes in the cluster. A too high average CPU utilization can lead to a decline in the processing capability of the cluster nodes or even downtime. If this value is too high, you should upgrade the node specification or reduce the number of requests based on the current node configuration of your cluster and your business.
Bulk rejection rate
The statistical period is 1 minute. If this value is > 0% in one period, an alarm will be triggered once every 30 minutes
The bulk rejection rate refers to the percentage of rejected bulk operations in all bulk operations performed by your cluster during a single period. When this value is greater than 0%, i.e., one or more bulk rejections have occurred, your cluster has reached the upper limit of its capability to process bulk operations, or an exception has occurred. In this case, you should troubleshoot and repair the problem in a timely manner; otherwise, bulk operations will be affected, or data loss will occur.
Query rejection rate
The statistical period is 1 minute. If this value is > 0% in one period, an alarm will be triggered once every 30 minutes
The query rejection rate refers to the percentage of rejected query operations in all query operations performed by your cluster during a single period. When this value is greater than 0%, i.e., one or more query rejections have occurred, your cluster has reached the upper limit of its capability to process query operations, or an exception has occurred. In this case, you should troubleshoot and repair the problem in a timely manner; otherwise, query operations will be affected.

Help and Support

Was this page helpful?

You can also Contact sales or Submit a Ticket for help.

Help us improve! Rate your documentation experience in 5 mins.

Feedback

tencent cloud

Elasticsearch Service

Suggestions for Configuring Monitors and Alarms

Help and Support

Metric	Suggested Alarm Configuration	Description
Cluster health status	The statistical period is 1 minute. If this value is >= 1 in 5 consecutive periods, an alarm will be triggered once every 30 minutes	Value range: 0: Green, which indicates that all primary and replica shards are available and the cluster is in the healthiest status. 1: Yellow, which indicates that all the primary shards are available, but some replica shards are unavailable. In this case, the search results are still complete; however, the high availability of the cluster is affected to some extent, and there is a high risk of data loss. 2: Red, which indicates that at least one primary shard and all its replicas are unavailable. When the cluster health status changes to red, some data has become unavailable, the search can only return partial data, and the requests allocated to a lost shard will return an exception. The cluster health status is the most direct manifestation of the current health of the cluster. If it changes to yellow or red, you should troubleshoot and repair the problem in a timely manner to prevent data loss or service unavailability.
Avg disk utilization	The statistical period is 1 minute. If this value is > 80% in 5 consecutive periods, an alarm will be triggered once every 30 minutes	The avg disk utilization refers to the average of the disk utilization values of all nodes in the cluster. If the disk utilization of a node is too high, the node will not have sufficient disk capacity to accommodate the shards allocated to it, leading to failures in basic operations such as index creating and document adding. You are recommended to promptly clear the data or scale out your cluster when this value is above 75%.
Avg JVM memory utilization	The statistical period is 1 minute. If this value is > 85% in 5 consecutive periods, an alarm will be triggered once every 30 minutes	The avg JVM memory utilization refers to the average of the JVM memory utilization values of all nodes in the cluster. A too high JVM memory utilization can lead to rejection of read and write operations, frequent GC, or even OOM. When this value exceeds the threshold, you are recommended to upgrade the node specification through vertical scaling.
Avg CPU utilization	The statistical period is 1 minute. If this value is > 90% in 5 consecutive periods, an alarm will be triggered once every 30 minutes	The avg CPU utilization refers to the average of the CPU utilization values of all nodes in the cluster. A too high average CPU utilization can lead to a decline in the processing capability of the cluster nodes or even downtime. If this value is too high, you should upgrade the node specification or reduce the number of requests based on the current node configuration of your cluster and your business.
Bulk rejection rate	The statistical period is 1 minute. If this value is > 0% in one period, an alarm will be triggered once every 30 minutes	The bulk rejection rate refers to the percentage of rejected bulk operations in all bulk operations performed by your cluster during a single period. When this value is greater than 0%, i.e., one or more bulk rejections have occurred, your cluster has reached the upper limit of its capability to process bulk operations, or an exception has occurred. In this case, you should troubleshoot and repair the problem in a timely manner; otherwise, bulk operations will be affected, or data loss will occur.
Query rejection rate	The statistical period is 1 minute. If this value is > 0% in one period, an alarm will be triggered once every 30 minutes	The query rejection rate refers to the percentage of rejected query operations in all query operations performed by your cluster during a single period. When this value is greater than 0%, i.e., one or more query rejections have occurred, your cluster has reached the upper limit of its capability to process query operations, or an exception has occurred. In this case, you should troubleshoot and repair the problem in a timely manner; otherwise, query operations will be affected.