Using Cluster Audit for Troubleshooting

Last updated: 2020-11-23 10:13:52

    Overview

    Cluster resources may be deleted or modified in the case of misoperations, application bugs, or apiserver API calls from malicious programs. You can use the cluster audit feature to keep logs of apiserver API calls. In this way, you can search and analyze audit logs to find the causes of problems. This document describes how to use the cluster audit feature for troubleshooting.

    Note:

    This document only applies to TKE clusters.

    Prerequisites

    You have enabled the cluster audit feature in the TKE console. For more information, see Enabling Cluster Audit.

    Use Case

    Obtaining the analysis result

    1. Log in to the CLS console and select Search and Analysis in the left sidebar.
    2. On the Search and Analysis page, select the logset and log topic to search, and a time scope.
    3. Enter the analysis statement and click Search and Analysis to obtain the analysis result.

    Example 1: querying the operator who cordoned a node

    To query the operator who cordoned a node, run the following command:

    objectRef.resource:nodes AND requestObject:unschedulable

    On the Search and Analysis page, click Layouts. You can see the user.username, requestObject, and objectRef.name fields, which indicate the operator, request content, and node name, respectively. See the figure below:

    As shown in the figure, the sub-account 10001****958 cordoned the main.63u5qua9.0 node at 2020-10-09 16:13:22. For more information on the sub-account, choose Access Management > User List and click the account ID.

    Example 2: querying the operator who deleted a workload

    To query the operator who deleted a workload, run the following command:

    objectRef.resource:deployments AND objectRef.name:"nginx" AND verb:"delete" 

    You can obtain detailed information about the subaccount based on the search result.

    Example 3: locating the causes of apiserver access limitation

    To prevent apiserver/etcd from being overloaded due to frequent apiserver access caused by malicious programs or bugs, apiserver enables an access limit mechanism by default. If the access limit is reached, you can identify the clients that have sent large numbers of requests through audit logs.

    1. If you need to analyze clients that send requests based on userAgent, modify the log topic in the Key Index window and collect statistics based on the userAgent field, as shown in the figure below.
    2. Run the following command to collect QPS statistics from each client to the apiserver:
    * | SELECT CAST((__TIMESTAMP_US__ /1000-__TIMESTAMP_US__ /1000%1000) as TIMESTAMP) AS time, COUNT(1) AS qps,userAgent GROUP BY time,userAgent ORDER BY time
    1. Switch to chart analysis and select line chart as the chart type. Select time as the X-axis, QPS as the Y-axis, and userAgent for the aggregation column, as shown in the figure below.

      After obtaining the data, click the data to add it to the dashboard for display, as shown in the figure below.

      The figure shows that the frequency of requests from the kube-state-metrics client to the apiserver is much higher than that of other clients.
      According to the log, kube-state-metrics frequently sends requests to the apiserver due to RBAC permission issues. As a result, the apiserver access limit is triggered. The log is as follows:
      I1009 13:13:09.760767       1 request.go:538] Throttling request took 1.393921018s, request: GET:https://172.16.252.1:443/api/v1/endpoints?limit=500&resourceVersion=1029843735
      E1009 13:13:09.766106       1 reflector.go:156] pkg/mod/k8s.io/client-go@v0.0.0-20191109102209-3c0d1af94be5/tools/cache/reflector.go:108: Failed to list *v1.Endpoints: endpoints is forbidden: User "system:serviceaccount:monitoring:kube-state-metrics" cannot list resource "endpoints" in API group "" at the cluster scope
      To use other fields, such as user.username, to distinguish the clients to collect data on, you can modify the SQL statement as required. An example SQL statement is as follows:
      * | SELECT CAST((__TIMESTAMP_US__ /1000-__TIMESTAMP_US__ /1000%1000) as TIMESTAMP) AS time, COUNT(1) AS qps,user.username GROUP BY time,user.username ORDER BY time
      The following figure shows the display result.

    References

    • For more information on the TKE cluster audit feature and basic operations, see Cluster Audit.
    • Cluster audit data is stored in Cloud Log Service (CLS). To search and analyze the audit results in the CLS console, see Syntax and Rules for the search syntax.
    • To analyze audit data, an SQL statement supported by CLS is required. For more information, see Overview.

    Was this page helpful?

    Was this page helpful?

    • Not at all
    • Not very helpful
    • Somewhat helpful
    • Very helpful
    • Extremely helpful
    Send Feedback
    Help