Cluster resources may be deleted or modified in the case of misoperations, application bugs, or apiserver API calls from malicious programs. You can use the cluster audit feature to keep logs of apiserver API calls. In this way, you can search and analyze audit logs to find the causes of problems. This document describes how to use the cluster audit feature for troubleshooting.
This document only applies to TKE clusters.
You have enabled the cluster audit feature in the TKE console. For more information, see Enabling Cluster Audit.
To query the operator who cordoned a node, run the following command:
objectRef.resource:nodes AND requestObject:unschedulable
On the Search and Analysis page, click Layouts. You can see the
objectRef.name fields, which indicate the operator, request content, and node name, respectively. See the figure below:
As shown in the figure, the sub-account
10001****958 cordoned the
main.63u5qua9.0 node at
2020-10-09 16:13:22. For more information on the sub-account, choose Access Management > User List and click the account ID.
To query the operator who deleted a workload, run the following command:
objectRef.resource:deployments AND objectRef.name:"nginx" AND verb:"delete"
You can obtain detailed information about the subaccount based on the search result.
To prevent apiserver/etcd from being overloaded due to frequent apiserver access caused by malicious programs or bugs, apiserver enables an access limit mechanism by default. If the access limit is reached, you can identify the clients that have sent large numbers of requests through audit logs.
* | SELECT CAST((__TIMESTAMP_US__ /1000-__TIMESTAMP_US__ /1000%1000) as TIMESTAMP) AS time, COUNT(1) AS qps,userAgent GROUP BY time,userAgent ORDER BY time
To use other fields, such as user.username, to distinguish the clients to collect data on, you can modify the SQL statement as required. An example SQL statement is as follows:
I1009 13:13:09.760767 1 request.go:538] Throttling request took 1.393921018s, request: GET:https://172.16.252.1:443/api/v1/endpoints?limit=500&resourceVersion=1029843735 E1009 13:13:09.766106 1 reflector.go:156] firstname.lastname@example.org/tools/cache/reflector.go:108: Failed to list *v1.Endpoints: endpoints is forbidden: User "system:serviceaccount:monitoring:kube-state-metrics" cannot list resource "endpoints" in API group "" at the cluster scope
The following figure shows the display result.
* | SELECT CAST((__TIMESTAMP_US__ /1000-__TIMESTAMP_US__ /1000%1000) as TIMESTAMP) AS time, COUNT(1) AS qps,user.username GROUP BY time,user.username ORDER BY time