Category | Event Name | Event Description | Suggestion and Measure | Default Value | Severity Level | Allow Disable | Enabled by Default |
Node | CPU utilization continuously exceeding the threshold | The machine CPU utilization has been greater than or equal to m for a duration of t seconds (where t is between 300 and 2,592,000). | Scale out the node or upgrade the configuration. | m=85, t=1800 | Critical | Yes | Yes |
Instance | Node role process restarted | The node role process restarted. | Manually troubleshoot the issue. | - | Normal | No | Yes |
| Process killed by OOMKiller | The process was killed by OOMKiller. | 1. Check the system resource usage by running the top or htop command to view the system CPU, memory, and disk usage. Confirm whether there are memory leaks or resource contention issues. 2. Analyze Java heap memory usage and adjust JVM parameters. 3. Increase the node memory. | - | Critical | Yes | Yes |
TCBase | Database access unavailable | The PostgreSQL database has failed the liveness probe for n consecutive times. | Usually self-healing; if the issue persists, submit a ticket for consultation. | failure_count=3 | Fatal | Yes | Yes |
| API Gateway access unavailable | API Gateway (Kong) has failed the liveness probe continuously. | Usually self-healing; if the issue persists, submit a ticket for consultation. | failure_count=3 | Fatal | Yes | Yes |
| Database HA primary/replica switchover | A primary/replica switchover has occurred in PostgreSQL. | Usually self-healing; if the issue persists, submit a ticket for consultation. | - | Critical | Yes | Yes |
| HA cluster having no Leader | There has been no PostgreSQL Leader node continuously within the detection cycle, making the database unwritable. | Usually self-healing; if the issue persists, submit a ticket for consultation. | no_leader_count=2 | Fatal | Yes | Yes |
| Primary/replica replication delay being too high | The primary/replica replication delay in PostgreSQL has continuously exceeded the threshold, posing a risk to replica database data consistency. | Check the write pressure on the primary database. | lag_threshold_sec=30,sample_count=2 | Critical | Yes | Yes |
| WAL Receiver stream interruption | The WAL Receiver on the replica database is not in the streaming status, and primary/replica replication is interrupted. | Usually self-healing; if the issue persists, submit a ticket for consultation. | sample_count=2 | Critical | Yes | Yes |
| Abnormal Patroni node status | The Patroni node status is abnormal, which may affect the PostgreSQL HA feature. | Usually self-healing; if the issue persists, submit a ticket for consultation. | sample_count=2 | Critical | Yes | Yes |
| etcd unavailable | The etcd cluster has experienced continuous liveness probe anomalies, which may affect the PostgreSQL HA feature. | Usually self-healing; if the issue persists, submit a ticket for consultation. | failure_count=3 | Fatal | Yes | Yes |
| Number of database connections being too high | The PostgreSQL connection utilization has continuously exceeded the threshold, which may lead to new connections being rejected. | Check for connection leaks and increase max_connections. | usage_pct=80, sample_count=2 | Critical | Yes | Yes |
| Frequent deadlocks | The increment of PostgreSQL deadlocks within the detection cycle has exceeded the threshold, indicating concurrent transaction conflicts. | Analyze query patterns, check lock sequences, and optimize transaction isolation levels. | deadlock_count=5 | Normal | Yes | No |
| Cache hit ratio being too low | The PostgreSQL cache hit ratio has been continuously lower than the threshold, leading to heavy disk reads and degraded performance. | Increase shared_buffers, analyze query patterns, and add indexes. | hit_ratio_threshold=90, sample_count=2 | Normal | Yes | No |
| Authentication service unavailable | The TCBase authentication service has failed the liveness probe continuously, affecting user authentication, registration, and JWT issuance features. | Usually self-healing; if the issue persists, submit a ticket for consultation. | failure_count=3 | Critical | Yes | Yes |
| REST API service unavailable | The PostgREST component has failed the liveness probe continuously, and REST API-related requests may be affected. | Usually self-healing; if the issue persists, submit a ticket for consultation. | failure_count=3 | Critical | Yes | Yes |
| Realtime service unavailable | The Realtime service has failed the liveness probe continuously, and WebSocket subscriptions and real-time push notifications may be affected. | Usually self-healing; if the issue persists, submit a ticket for consultation. | failure_count=3 | Critical | Yes | Yes |
| Storage service unavailable | The Storage service has failed the liveness probe continuously, and file upload, file download, and S3 protocol may be affected. | Usually self-healing; if the issue persists, submit a ticket for consultation. | failure_count=3 | Critical | Yes | Yes |
| Abnormal component running status | A certain TCBASE component has failed the liveness probe continuously. | Usually self-healing; if the issue persists, submit a ticket for consultation. | failure_count=3 | Normal | Yes | Yes |
| Database ping latency being too high | The database ping latency has continuously exceeded the threshold, potentially due to I/O bottlenecks or high load. | Check whether the database request pressure is too high. | latency_threshold=100ms,sample_count=2 | Normal | Yes | No |
| Database capacity alert | The machine disk usage has exceeded the threshold. Storage capacity requires attention. | Delete unnecessary data. | size_threshold=10737418240 (10GB) | Normal | Yes | No |
| HAProxy unavailable | The HAProxy proxy has failed the liveness probe continuously, and database access via HAProxy may be affected. | Usually self-healing; if the issue persists, submit a ticket for consultation. | failure_count=3 | Fatal | Yes | Yes |
| Studio management panel unavailable | The Studio management panel has failed the liveness probe continuously, and the web management interface may be affected. | Usually self-healing; if the issue persists, submit a ticket for consultation. | failure_count=3 | Normal | Yes | Yes |
フィードバック