Release Notes and Announcements
- Release Notes
- Announcements
- Security Announcements
Product Introduction
- Overview
- Strengths
- Architecture
- Features
- Use Cases
- Constraints and Limits
- Technical Support Scope
- Product release
Purchase Guide
- EMR on CVM Billing Instructions
- EMR on TKE Billing Instructions
- EMR Serverless HBase Billing Instructions
- EMR Serverless TCBase Billing Overview
Getting Started
- EMR on CVM Quick Start
- EMR on TKE Quick Start
EMR on CVM Operation Guide
- Planning Cluster
- Administrative rights
- Configuring Cluster
- Managing Cluster
- Managing Service
- Monitoring and Alarms
- TCInsight
EMR on TKE Operation Guide
- Introduction to EMR on TKE
- Configuring Cluster
- Cluster Management
- Service Management
- Monitoring and Ops
- Application Analysis
EMR Serverless HBase Operation Guide
- EMR Serverless HBase Product Introduction
- Quotas and Limits
- Planning an Instance
- Managing an Instance
- Monitoring and Alarms
- Development Guide
EMR Serverless TCBase Operation Guide
- Introduction to EMR Serverless TCBase
- Managing Instances
- Managing Services
- Monitoring and Alarms
EMR Development Guide
- Hadoop Development Guide
- Spark Development Guide
- HBase Development Guide
- Phoenix on Hbase Development Guide
- Hive Development Guide
- Presto Development Guide
- Sqoop Development Guide
- Hue Development Guide
- Oozie Development Guide
- Flume Development Guide
- Kerberos Development Guide
- Knox Development Guide
- Alluxio Development Guide
- Kylin Development Guide
- Livy Development Guide
- Kyuubi Development Guide
- Zeppelin Development Guide
- Hudi Development Guide
- Superset Development Guide
- Impala Development Guide
- Druid Development Guide
- TensorFlow Development Guide
- Kudu Development Guide
- Ranger Development Guide
- Kafka Development Guide
- StarRocks Development Guide
- Flink Development Guide
- JupyterLab Development Guide
- MLflow Development Guide
Practical Tutorial
- Practice of EMR on CVM Ops
- Data Migration
- Practical Tutorial on Custom Scaling
API Documentation
- History
- Introduction
- API Category
- Making API Requests
- Cluster Resource Management APIs
- Cluster Services APIs
- User Management APIs
- Information Query APIs
- Scaling APIs
- Configuration APIs
- Other APIs
- Cluster Lifecycle APIs
- Serverless HBase APIs
- YARN Resource Scheduling APIs
- Data Types
- Error Codes
FAQs
- EMR on CVM
Service Level Agreement
Contact Us

Instance Events

Download

포커스 모드

폰트 크기

마지막 업데이트 시간: 2026-04-30 15:20:58

Feature Overview
Instance events include event lists and event policies.
Event list: Records key change events or abnormal events that occur in instances.
Event policy: Supports customizing event monitoring trigger policies based on business scenarios.
Viewing the Event List
1. Log in to the EMR Serverless TCBase console, find the instance you need to view from the instance list, and click Monitor to go to the Monitoring and Alarms page.
2. On the Monitoring and Alarms page, select Instance Events to directly view all operation events of the current instance.
The descriptions of severity levels are as follows:
Fatal: Abnormal events of nodes or services that require manual intervention. Otherwise, services become unavailable. Such events may persist for a period of time.
Critical: Events that have not yet caused service or node unavailability but are classified as warning-level events. If left unaddressed, they may lead to fatal events.
Normal: Records routine events occurring in clusters, which generally require no special handling.
3. Click the value in the Daily Number of Triggers column to view event trigger records, as well as associated metrics, logs, or snapshots.
Configuring Event Policies
1. Log in to the EMR Serverless TCBase console, find the instance you need to view from the instance list, and click Monitor to go to the Monitoring and Alarms page.
2. On the Monitoring and Alarms page, select Event Policy to customize event monitoring trigger policies.
3. The event configuration list includes: event name, event detection policy, severity (fatal/critical/normal), and monitoring status (enabled), which can be modified and saved.
4. Event detection policies are categorized into two types: one type consists of system-fixed policy events that cannot be modified by users; the other type varies based on customer business standards and supports user configuration.
5. Event policies allow customization of whether to enable event monitoring. Only events with monitoring enabled can be selected as items in cluster inspections. Some events are enabled by default, while others are enabled by default and cannot be disabled. Specific rules are as follows:
Category
Event Name
Event Description
Suggestion and Measure
Default Value
Severity Level
Allow Disable
Enabled by Default
Node
CPU utilization continuously exceeding the threshold
The machine CPU utilization has been greater than or equal to m for a duration of t seconds (where t is between 300 and 2,592,000).
Scale out the node or upgrade the configuration.
m=85, t=1800
Critical
Yes
Yes
Instance
Node role process restarted
The node role process restarted.
Manually troubleshoot the issue.
-
Normal
No
Yes
﻿
Process killed by OOMKiller
The process was killed by OOMKiller.
1. Check the system resource usage by running the top or htop command to view the system CPU, memory, and disk usage. Confirm whether there are memory leaks or resource contention issues.
2. Analyze Java heap memory usage and adjust JVM parameters.
3. Increase the node memory.
-
Critical
Yes
Yes
TCBase
Database access unavailable
The PostgreSQL database has failed the liveness probe for n consecutive times.
Usually self-healing; if the issue persists, submit a ticket for consultation.
failure_count=3
Fatal
Yes
Yes
﻿
API Gateway access unavailable
API Gateway (Kong) has failed the liveness probe continuously.
Usually self-healing; if the issue persists, submit a ticket for consultation.
failure_count=3
Fatal
Yes
Yes
﻿
Database HA primary/replica switchover
A primary/replica switchover has occurred in PostgreSQL.
Usually self-healing; if the issue persists, submit a ticket for consultation.
-
Critical
Yes
Yes
﻿
HA cluster having no Leader
There has been no PostgreSQL Leader node continuously within the detection cycle, making the database unwritable.
Usually self-healing; if the issue persists, submit a ticket for consultation.
no_leader_count=2
Fatal
Yes
Yes
﻿
Primary/replica replication delay being too high
The primary/replica replication delay in PostgreSQL has continuously exceeded the threshold, posing a risk to replica database data consistency.
Check the write pressure on the primary database.
lag_threshold_sec=30,sample_count=2
Critical
Yes
Yes
﻿
WAL Receiver stream interruption
The WAL Receiver on the replica database is not in the streaming status, and primary/replica replication is interrupted.
Usually self-healing; if the issue persists, submit a ticket for consultation.
sample_count=2
Critical
Yes
Yes
﻿
Abnormal Patroni node status
The Patroni node status is abnormal, which may affect the PostgreSQL HA feature.
Usually self-healing; if the issue persists, submit a ticket for consultation.
sample_count=2
Critical
Yes
Yes
﻿
etcd unavailable
The etcd cluster has experienced continuous liveness probe anomalies, which may affect the PostgreSQL HA feature.
Usually self-healing; if the issue persists, submit a ticket for consultation.
failure_count=3
Fatal
Yes
Yes
﻿
Number of database connections being too high
The PostgreSQL connection utilization has continuously exceeded the threshold, which may lead to new connections being rejected.
Check for connection leaks and increase max_connections.
usage_pct=80, sample_count=2
Critical
Yes
Yes
﻿
Frequent deadlocks
The increment of PostgreSQL deadlocks within the detection cycle has exceeded the threshold, indicating concurrent transaction conflicts.
Analyze query patterns, check lock sequences, and optimize transaction isolation levels.
deadlock_count=5
Normal
Yes
No
﻿
Cache hit ratio being too low
The PostgreSQL cache hit ratio has been continuously lower than the threshold, leading to heavy disk reads and degraded performance.
Increase shared_buffers, analyze query patterns, and add indexes.
hit_ratio_threshold=90, sample_count=2
Normal
Yes
No
﻿
Authentication service unavailable
The TCBase authentication service has failed the liveness probe continuously, affecting user authentication, registration, and JWT issuance features.
Usually self-healing; if the issue persists, submit a ticket for consultation.
failure_count=3
Critical
Yes
Yes
﻿
REST API service unavailable
The PostgREST component has failed the liveness probe continuously, and REST API-related requests may be affected.
Usually self-healing; if the issue persists, submit a ticket for consultation.
failure_count=3
Critical
Yes
Yes
﻿
Realtime service unavailable
The Realtime service has failed the liveness probe continuously, and WebSocket subscriptions and real-time push notifications may be affected.
Usually self-healing; if the issue persists, submit a ticket for consultation.
failure_count=3
Critical
Yes
Yes
﻿
Storage service unavailable
The Storage service has failed the liveness probe continuously, and file upload, file download, and S3 protocol may be affected.
Usually self-healing; if the issue persists, submit a ticket for consultation.
failure_count=3
Critical
Yes
Yes
﻿
Abnormal component running status
A certain TCBASE component has failed the liveness probe continuously.
Usually self-healing; if the issue persists, submit a ticket for consultation.
failure_count=3
Normal
Yes
Yes
﻿
Database ping latency being too high
The database ping latency has continuously exceeded the threshold, potentially due to I/O bottlenecks or high load.
Check whether the database request pressure is too high.
latency_threshold=100ms,sample_count=2
Normal
Yes
No
﻿
Database capacity alert
The machine disk usage has exceeded the threshold. Storage capacity requires attention.
Delete unnecessary data.
size_threshold=10737418240 (10GB)
Normal
Yes
No
﻿
HAProxy unavailable
The HAProxy proxy has failed the liveness probe continuously, and database access via HAProxy may be affected.
Usually self-healing; if the issue persists, submit a ticket for consultation.
failure_count=3
Fatal
Yes
Yes
﻿
Studio management panel unavailable
The Studio management panel has failed the liveness probe continuously, and the web management interface may be affected.
Usually self-healing; if the issue persists, submit a ticket for consultation.
failure_count=3
Normal
Yes
Yes

도움말 및 지원

문제 해결에 도움이 되었나요?

더 자세한 내용은 문의하기 또는 티겟 제출 을 통해 문의할 수 있습니다.

피드백

tencent cloud

Elastic MapReduce

Instance Events

Feature Overview

Viewing the Event List

Configuring Event Policies

도움말 및 지원

Category	Event Name	Event Description	Suggestion and Measure	Default Value	Severity Level	Allow Disable	Enabled by Default
Node	CPU utilization continuously exceeding the threshold	The machine CPU utilization has been greater than or equal to m for a duration of t seconds (where t is between 300 and 2,592,000).	Scale out the node or upgrade the configuration.	m=85, t=1800	Critical	Yes	Yes
Instance	Node role process restarted	The node role process restarted.	Manually troubleshoot the issue.	-	Normal	No	Yes
Instance		Process killed by OOMKiller	The process was killed by OOMKiller.	1. Check the system resource usage by running the top or htop command to view the system CPU, memory, and disk usage. Confirm whether there are memory leaks or resource contention issues. 2. Analyze Java heap memory usage and adjust JVM parameters. 3. Increase the node memory.	-	Critical	Yes	Yes
TCBase	Database access unavailable	The PostgreSQL database has failed the liveness probe for n consecutive times.	Usually self-healing; if the issue persists, submit a ticket for consultation.	failure_count=3	Fatal	Yes	Yes
		API Gateway access unavailable	API Gateway (Kong) has failed the liveness probe continuously.	Usually self-healing; if the issue persists, submit a ticket for consultation.	failure_count=3	Fatal	Yes	Yes
		Database HA primary/replica switchover	A primary/replica switchover has occurred in PostgreSQL.	Usually self-healing; if the issue persists, submit a ticket for consultation.	-	Critical	Yes	Yes
		HA cluster having no Leader	There has been no PostgreSQL Leader node continuously within the detection cycle, making the database unwritable.	Usually self-healing; if the issue persists, submit a ticket for consultation.	no_leader_count=2	Fatal	Yes	Yes
		Primary/replica replication delay being too high	The primary/replica replication delay in PostgreSQL has continuously exceeded the threshold, posing a risk to replica database data consistency.	Check the write pressure on the primary database.	lag_threshold_sec=30,sample_count=2	Critical	Yes	Yes
		WAL Receiver stream interruption	The WAL Receiver on the replica database is not in the streaming status, and primary/replica replication is interrupted.	Usually self-healing; if the issue persists, submit a ticket for consultation.	sample_count=2	Critical	Yes	Yes
		Abnormal Patroni node status	The Patroni node status is abnormal, which may affect the PostgreSQL HA feature.	Usually self-healing; if the issue persists, submit a ticket for consultation.	sample_count=2	Critical	Yes	Yes
		etcd unavailable	The etcd cluster has experienced continuous liveness probe anomalies, which may affect the PostgreSQL HA feature.	Usually self-healing; if the issue persists, submit a ticket for consultation.	failure_count=3	Fatal	Yes	Yes
		Number of database connections being too high	The PostgreSQL connection utilization has continuously exceeded the threshold, which may lead to new connections being rejected.	Check for connection leaks and increase max_connections.	usage_pct=80, sample_count=2	Critical	Yes	Yes
		Frequent deadlocks	The increment of PostgreSQL deadlocks within the detection cycle has exceeded the threshold, indicating concurrent transaction conflicts.	Analyze query patterns, check lock sequences, and optimize transaction isolation levels.	deadlock_count=5	Normal	Yes	No
		Cache hit ratio being too low	The PostgreSQL cache hit ratio has been continuously lower than the threshold, leading to heavy disk reads and degraded performance.	Increase shared_buffers, analyze query patterns, and add indexes.	hit_ratio_threshold=90, sample_count=2	Normal	Yes	No
		Authentication service unavailable	The TCBase authentication service has failed the liveness probe continuously, affecting user authentication, registration, and JWT issuance features.	Usually self-healing; if the issue persists, submit a ticket for consultation.	failure_count=3	Critical	Yes	Yes
		REST API service unavailable	The PostgREST component has failed the liveness probe continuously, and REST API-related requests may be affected.	Usually self-healing; if the issue persists, submit a ticket for consultation.	failure_count=3	Critical	Yes	Yes
		Realtime service unavailable	The Realtime service has failed the liveness probe continuously, and WebSocket subscriptions and real-time push notifications may be affected.	Usually self-healing; if the issue persists, submit a ticket for consultation.	failure_count=3	Critical	Yes	Yes
		Storage service unavailable	The Storage service has failed the liveness probe continuously, and file upload, file download, and S3 protocol may be affected.	Usually self-healing; if the issue persists, submit a ticket for consultation.	failure_count=3	Critical	Yes	Yes
		Abnormal component running status	A certain TCBASE component has failed the liveness probe continuously.	Usually self-healing; if the issue persists, submit a ticket for consultation.	failure_count=3	Normal	Yes	Yes
		Database ping latency being too high	The database ping latency has continuously exceeded the threshold, potentially due to I/O bottlenecks or high load.	Check whether the database request pressure is too high.	latency_threshold=100ms,sample_count=2	Normal	Yes	No
		Database capacity alert	The machine disk usage has exceeded the threshold. Storage capacity requires attention.	Delete unnecessary data.	size_threshold=10737418240 (10GB)	Normal	Yes	No
		HAProxy unavailable	The HAProxy proxy has failed the liveness probe continuously, and database access via HAProxy may be affected.	Usually self-healing; if the issue persists, submit a ticket for consultation.	failure_count=3	Fatal	Yes	Yes
		Studio management panel unavailable	The Studio management panel has failed the liveness probe continuously, and the web management interface may be affected.	Usually self-healing; if the issue persists, submit a ticket for consultation.	failure_count=3	Normal	Yes	Yes