tencent cloud

Elastic MapReduce

Instance Events

PDF
포커스 모드
폰트 크기
마지막 업데이트 시간: 2026-04-30 15:20:58

Feature Overview

Instance events include event lists and event policies.
Event list: Records key change events or abnormal events that occur in instances.
Event policy: Supports customizing event monitoring trigger policies based on business scenarios.

Viewing the Event List

1. Log in to the EMR Serverless TCBase console, find the instance you need to view from the instance list, and click Monitor to go to the Monitoring and Alarms page.
2. On the Monitoring and Alarms page, select Instance Events to directly view all operation events of the current instance.
The descriptions of severity levels are as follows:
Fatal: Abnormal events of nodes or services that require manual intervention. Otherwise, services become unavailable. Such events may persist for a period of time.
Critical: Events that have not yet caused service or node unavailability but are classified as warning-level events. If left unaddressed, they may lead to fatal events.
Normal: Records routine events occurring in clusters, which generally require no special handling.
3. Click the value in the Daily Number of Triggers column to view event trigger records, as well as associated metrics, logs, or snapshots.

Configuring Event Policies

1. Log in to the EMR Serverless TCBase console, find the instance you need to view from the instance list, and click Monitor to go to the Monitoring and Alarms page.
2. On the Monitoring and Alarms page, select Event Policy to customize event monitoring trigger policies.
3. The event configuration list includes: event name, event detection policy, severity (fatal/critical/normal), and monitoring status (enabled), which can be modified and saved.
4. Event detection policies are categorized into two types: one type consists of system-fixed policy events that cannot be modified by users; the other type varies based on customer business standards and supports user configuration.
5. Event policies allow customization of whether to enable event monitoring. Only events with monitoring enabled can be selected as items in cluster inspections. Some events are enabled by default, while others are enabled by default and cannot be disabled. Specific rules are as follows:
Category
Event Name
Event Description
Suggestion and Measure
Default Value
Severity Level
Allow Disable
Enabled by Default
Node
CPU utilization continuously exceeding the threshold
The machine CPU utilization has been greater than or equal to m for a duration of t seconds (where t is between 300 and 2,592,000).
Scale out the node or upgrade the configuration.
m=85, t=1800
Critical
Yes
Yes
Instance
Node role process restarted
The node role process restarted.
Manually troubleshoot the issue.
-
Normal
No
Yes
Process killed by OOMKiller
The process was killed by OOMKiller.
1. Check the system resource usage by running the top or htop command to view the system CPU, memory, and disk usage. Confirm whether there are memory leaks or resource contention issues.
2. Analyze Java heap memory usage and adjust JVM parameters.
3. Increase the node memory.
-
Critical
Yes
Yes
TCBase
Database access unavailable
The PostgreSQL database has failed the liveness probe for n consecutive times.
Usually self-healing; if the issue persists, submit a ticket for consultation.
failure_count=3
Fatal
Yes
Yes
API Gateway access unavailable
API Gateway (Kong) has failed the liveness probe continuously.
Usually self-healing; if the issue persists, submit a ticket for consultation.
failure_count=3
Fatal
Yes
Yes
Database HA primary/replica switchover
A primary/replica switchover has occurred in PostgreSQL.
Usually self-healing; if the issue persists, submit a ticket for consultation.
-
Critical
Yes
Yes
HA cluster having no Leader
There has been no PostgreSQL Leader node continuously within the detection cycle, making the database unwritable.
Usually self-healing; if the issue persists, submit a ticket for consultation.
no_leader_count=2
Fatal
Yes
Yes
Primary/replica replication delay being too high
The primary/replica replication delay in PostgreSQL has continuously exceeded the threshold, posing a risk to replica database data consistency.
Check the write pressure on the primary database.
lag_threshold_sec=30,sample_count=2
Critical
Yes
Yes
WAL Receiver stream interruption
The WAL Receiver on the replica database is not in the streaming status, and primary/replica replication is interrupted.
Usually self-healing; if the issue persists, submit a ticket for consultation.
sample_count=2
Critical
Yes
Yes
Abnormal Patroni node status
The Patroni node status is abnormal, which may affect the PostgreSQL HA feature.
Usually self-healing; if the issue persists, submit a ticket for consultation.
sample_count=2
Critical
Yes
Yes
etcd unavailable
The etcd cluster has experienced continuous liveness probe anomalies, which may affect the PostgreSQL HA feature.
Usually self-healing; if the issue persists, submit a ticket for consultation.
failure_count=3
Fatal
Yes
Yes
Number of database connections being too high
The PostgreSQL connection utilization has continuously exceeded the threshold, which may lead to new connections being rejected.
Check for connection leaks and increase max_connections.
usage_pct=80, sample_count=2
Critical
Yes
Yes
Frequent deadlocks
The increment of PostgreSQL deadlocks within the detection cycle has exceeded the threshold, indicating concurrent transaction conflicts.
Analyze query patterns, check lock sequences, and optimize transaction isolation levels.
deadlock_count=5
Normal
Yes
No
Cache hit ratio being too low
The PostgreSQL cache hit ratio has been continuously lower than the threshold, leading to heavy disk reads and degraded performance.
Increase shared_buffers, analyze query patterns, and add indexes.
hit_ratio_threshold=90, sample_count=2
Normal
Yes
No
Authentication service unavailable
The TCBase authentication service has failed the liveness probe continuously, affecting user authentication, registration, and JWT issuance features.
Usually self-healing; if the issue persists, submit a ticket for consultation.
failure_count=3
Critical
Yes
Yes
REST API service unavailable
The PostgREST component has failed the liveness probe continuously, and REST API-related requests may be affected.
Usually self-healing; if the issue persists, submit a ticket for consultation.
failure_count=3
Critical
Yes
Yes
Realtime service unavailable
The Realtime service has failed the liveness probe continuously, and WebSocket subscriptions and real-time push notifications may be affected.
Usually self-healing; if the issue persists, submit a ticket for consultation.
failure_count=3
Critical
Yes
Yes
Storage service unavailable
The Storage service has failed the liveness probe continuously, and file upload, file download, and S3 protocol may be affected.
Usually self-healing; if the issue persists, submit a ticket for consultation.
failure_count=3
Critical
Yes
Yes
Abnormal component running status
A certain TCBASE component has failed the liveness probe continuously.
Usually self-healing; if the issue persists, submit a ticket for consultation.
failure_count=3
Normal
Yes
Yes
Database ping latency being too high
The database ping latency has continuously exceeded the threshold, potentially due to I/O bottlenecks or high load.
Check whether the database request pressure is too high.
latency_threshold=100ms,sample_count=2
Normal
Yes
No
Database capacity alert
The machine disk usage has exceeded the threshold. Storage capacity requires attention.
Delete unnecessary data.
size_threshold=10737418240 (10GB)
Normal
Yes
No
HAProxy unavailable
The HAProxy proxy has failed the liveness probe continuously, and database access via HAProxy may be affected.
Usually self-healing; if the issue persists, submit a ticket for consultation.
failure_count=3
Fatal
Yes
Yes
Studio management panel unavailable
The Studio management panel has failed the liveness probe continuously, and the web management interface may be affected.
Usually self-healing; if the issue persists, submit a ticket for consultation.
failure_count=3
Normal
Yes
Yes

도움말 및 지원

문제 해결에 도움이 되었나요?

피드백