TDMQ for CKafka (CKafka) provides a comprehensive observability system, including monitoring and alarms, event recording, and one-click diagnosis. It helps customers quickly detect, locate, and resolve issues to ensure stable business operations.
Monitoring and Alarms
Monitoring Capabilities
CKafka provides monitoring capabilities of cloud products based on Tencent Cloud Observability Platform (TCOP), enabling real-time monitoring of resources created under your account, such as instances, topics, and consumer groups. These monitoring metrics can help you understand cluster resource usage, the number of connections, and message backlogs, and assist you in determining the cluster capacity usage level and identifying risks in advance.
Based on the instance edition you purchased, the scope of monitoring capabilities supported by CKafka is as follows:
|
Basic monitoring | Full series | Through basic monitoring, you can view monitoring metrics from three dimensions, including instances, topics, and consumer groups. | Cluster-level metric observation, used for requirements such as assisting in identifying issues and planning cluster capacity in basic Ops scenarios. |
Advanced monitoring | Pro Edition | Through advanced monitoring, you can view the node-level monitoring metrics of the instance, such as core services, production, consumption, instance resources, and broker GC. | Node-level metric observation, used for requirements such as issue localization, analysis of traffic throttling causes, and duration analysis in business troubleshooting scenarios. |
Dashboard | Pro Edition | Through the dashboard, you can view the number of all TCP connections on the broker, details of out-of-sync replicas (OSRs), node distribution for topics, and top ranking data for key metrics such as topic traffic, disk usage, and the consumption speed of consumer groups. | Top ranking for key metrics, used for requirements such as assisting in production/consumption hot spot analysis, and disk usage analysis in business optimization analysis scenarios. |
Prometheus monitoring | Pro Edition | It provides an access method based on the open-source standard Prometheus Exporter, including a series of monitorable metrics from Apache Kafka, such as instance-level metrics and node-level metrics. | It provides an open-source and compatible monitoring integration solution, supporting integration with users' self-built Ops platforms. |
Alarm Capabilities
CKafka provides alarm capabilities for cloud products based on TCOP. You can configure alarm rules for monitoring metrics on TCOP. If a monitoring metric reaches the configured alarm threshold, you will be notified through emails, Short Message Service (SMS), WeChat, or phone calls. You can take preventive or remedial actions promptly. Proper configuration of alarm rules can help enhance the robustness and reliability of your applications.
Event Records
Event Center in CKafka supports centralized management, storage, analysis, and visualization of various Ops events, diagnosis events, and broker change events that occur during instance operation, facilitating future querying, auditing, and tracing. It also supports event alarm capabilities. You can configure alarm rules for key events (such as node offline or disk expansion failure) on TCOP, so that Ops personnel can handle them promptly.
One-Click Diagnosis
CKafka Pro Edition supports the one-click diagnosis feature, which can actively troubleshoot cluster risks and potential hazards. Based on the accumulated Tencent Cloud expert experience, it provides solutions for issues, automatically summarizes health check results, and generates diagnosis reports. The one-click diagnosis capability extracts key information, locates issues, and provides professional solutions and suggestions for users, achieving a closed-loop Ops experience.