Querying Advanced Monitoring (Pro Edition)

Last updated: 2021-08-23 16:16:58

    Overview

    CKafka Pro Edition supports advanced monitoring. You can view metrics such as core services, production, consumption, and broker GC in the console, making it easier for you to troubleshoot CKafka issues.

    This document describes how to view advanced monitoring metrics in the console and explains their meanings.

    Directions

    1. Log in to the CKafka console.
    2. In the instance list, click the ID/Name of the target instance to enter the instance details page.
    3. At the top of the instance details page, click Monitoring > Advanced Monitoring, select the metric to be viewed, and set the time range to view the monitoring data.

    Monitoring metric description

    Note:

    You can click the following tabs to view the detailed descriptions of monitoring metrics of the core service, production, consumption, instance resource, and broker GC.

    Monitoring MetricDescriptionNormal Range
    Network busynessThis value is used to measure the current remaining I/O resources for concurrent request processing. The closer to 1 it is, the idler the instance is.This value generally fluctuates between 0.5 and 1. If it is below 0.3, the load is high.
    Request queue depthThis value indicates the number of production requests that have not been processed. If this value is too large, it may be because that the number of concurrent requests is high, the CPU load is high, or the disk I/O hits a bottleneck.
  • If this value stays at 2000, the cluster load is high.
  • If it is below 2000, it can be ignored.
  • Number of unsynced replicasThis value indicates the number of unsynced replicas in the cluster. When there are unsynced replicas of an instance, there may be a health problem with the cluster.
  • If this value stays above 5 (this is because that some built-in topic partitions of Tencent Cloud may be offline and has nothing to do with the business), the cluster needs to be fixed.
  • In case that the broker occasionally fluctuates, this value may surge and then become stable, which is normal.
  • Number of ZK reconnectionsThis value indicates the number of reconnections of the persistent connection between the broker and ZooKeeper. Network fluctuations and high cluster loads may cause disconnections and reconnections, thus leading to leader switching.
  • There is no normal range for this metric.
  • The number of ZK reconnections is cumulative, so a large number does not necessarily mean that there is a problem with the cluster. This metric is for reference only.
  • Number of ISR expansionsThis value is the number of Kafka ISR expansions. It will increase by 1 when an unsynced replica catches up with the data from the leader and rejoins the ISR.
  • There is no normal range for this metric. Expansions occur when the cluster fluctuates.
  • No attention is required unless this value stays above 0.
  • Number of ISR shrinksThis value is the number of Kafka ISR shrinks. It is counted when the broker is down and ZooKeeper reconnects.
  • There is no normal range for this metric.
  • Shrinks occur when the cluster fluctuates. No attention is required unless this value stays above 0.
  • Causes of monitoring metric exceptions

    The following describes causes of certain monitoring metric exceptions.

    Metric Exception Cause
    CPU utilization (%) When you find that it is above 90% in more than 5 consecutive statistical periods, you can first check whether there are message compression and message format conversion. If the client machine has sufficient CPU resources, we recommend you enable Snappy compression. You can observe the request queue depth at the same time. If this value is too large, it may be because that the request volume is too high, which may also cause a high CPU load.
    Unsynced replicas (count) When this value is above 0, there are unsynced replicas in the cluster. This is usually due to broker node exceptions or network issues. You can troubleshoot through the broker logs.
    Full GC count (count) If this problem occurs occasionally, it may be caused by disk I/O related to CVM instances. You can check whether it often occurs on the instance with the same IP subsequently, and if so, please submit a ticket for assistance.
    Request queue depth (count) If the client's production and consumption time out, but the CVM load is normal, the request queue depth of the CVM instance has reached the upper limit, which is 500 by default. You can submit a ticket to adjust it appropriately according to the purchased resource configuration.