DeScheduler is a plug-in provided by TKE based on the DeScheduler Kubernetes native community in order to implement rescheduling based on actual node loads. After it is installed in a TKE cluster, this plug-in will work together with Kube-scheduler to monitor high-load nodes in the cluster in real time and drain low-priority pods. We recommend that you use it together with the TKE Dynamic Scheduler add-on to ensure cluster load balancing in multiple dimensions.
This plug-in relies on the Prometheus monitoring component and relevant rule configurations. We recommend that you read Dependency Deployment carefully before installing this plug-in to prevent plug-in operation failures.
Kubernetes Object Name | Type | Requested Resources | Namespace |
---|---|---|---|
descheduler | Deployment | Each instance: CPU: 200m, Memory: 200Mi, a total of 1 instance | kube-system |
descheduler | ClusterRole | - | kube-system |
descheduler | ClusterRoleBinding | - | kube-system |
descheduler | ServiceAccount | - | kube-system |
descheduler-policy | ConfigMap | - | kube-system |
probe-prometheus | ConfigMap | - | kube-system |
DeScheduler performs rescheduling to resolve the improper operations of existing nodes in your cluster. The policy adopted by the community version of DeScheduler is based on the data in the APIServer, instead of actual node loads. Therefore, we can add node monitoring so that rescheduling can be performed based on actual loads.
The self-developed ReduceHighLoadNode policy of TKE relies on the monitoring data of Prometheus and node_exporter to perform pod draining and rescheduling based on metrics such as node CPU utilization, memory utilization, network I/O, and system loadavg. This can prevent extremely high loads on nodes. ReduceHighLoadNode of DeScheduler needs to be used in combination with the load-based scheduling policy of the self-developed Dynamic Scheduler of TKE.
Based on the rescheduling idea of the community version of DeScheduler, DeScheduler regularly scans the pods running on each node. After discovering any pods that do not comply with its policies, DeScheduler will drain and reschedule them. The community version of DeScheduler has provided some policies based on data in the APIServer. For example, the LowNodeUtilization
policy relies on the request and limit data of pods. This data can be used to effectively balance cluster resource distribution and prevent resource fragments. However, the community policies lack support for actual node resource occupation. For example, if node A and node B are allocated the same resources, their loads will be different because of differences in CPU consumption, memory consumption, and load in peak periods resulting from the actual running of pods.
Therefore, Tencent Cloud TKE launched DeScheduler, which performs rescheduling based on the monitoring of actual node loads. It obtains the load statistics of cluster nodes from Prometheus, and based on the user-set load threshold, regularly executes the check rule in the policy to drain pods from high-load nodes.
Note:To ensure that the add-on can pull the required monitoring data and the scheduling policy can take effect, please configure the monitoring data collection rule in accordance with the directions of Dependency Deployment > Prometheus File Configuration.
Note:Default values have been set for load threshold parameters. If you do not have extra requirements, you can directly use them.
If the average CPU utilization or average memory usage of a node over the past 5 minutes exceeds the set threshold, DeScheduler will regard the node as a high-load node and execute the pod draining logic. It will try to reduce the node load to below the target utilization level through pod rescheduling.
The DeScheduler add-on relies on the actual load of nodes at the current moment and over a past period to make scheduling decisions. It requires monitoring components such as Prometheus to obtain actual node load information from the system. Before you use the DeScheduler add-on, we recommend that you adopt self-built Prometheus monitoring or TKE cloud native monitoring.
We use node-exporter to monitor node metrics. You can deploy node-exporter and Prometheus based on your own requirements.
After node-exporter obtains node monitoring data, Prometheus is required to perform aggregation calculation of the data collected in the native node-exporter. To obtain the metrics required by DeScheduler, such as cpu_usage_avg_5m
and mem_usage_avg_5m
, you need to configure rules in Prometheus. See the sample below:
groups:
- name: cpu_mem_usage_active
interval: 30s
rules:
- record: mem_usage_active
expr: 100*(1-node_memory_MemAvailable_bytes/node_memory_MemTotal_bytes)
- name: cpu-usage-1m
interval: 1m
rules:
- record: cpu_usage_avg_5m
expr: 100 - (avg by (instance) (irate(node_cpu_seconds_total{mode="idle"}[5m])) * 100)
- name: mem-usage-1m
interval: 1m
rules:
- record: mem_usage_avg_5m
expr: avg_over_time(mem_usage_active[5m])
注意:When you use the DynamicScheduler provided by TKE, you need to configure the aggregation rules for obtaining node monitoring data in Prometheus. The aggregation rules of the DynamicScheduler partly overlap with those of DeScheduler, but they are not exactly the same. Therefore, mutual overwriting is not allowed during rule configuration. When you use DynamicScheduler and DeScheduler together, configure the following rules:
groups: - name: cpu_mem_usage_active interval: 30s rules: - record: mem_usage_active expr: 100*(1-node_memory_MemAvailable_bytes/node_memory_MemTotal_bytes) - name: mem-usage-1m interval: 1m rules: - record: mem_usage_avg_5m expr: avg_over_time(mem_usage_active[5m]) - name: mem-usage-5m interval: 5m rules: - record: mem_usage_max_avg_1h expr: max_over_time(mem_usage_avg_5m[1h]) - record: mem_usage_max_avg_1d expr: max_over_time(mem_usage_avg_5m[1d]) - name: cpu-usage-1m interval: 1m rules: - record: cpu_usage_avg_5m expr: 100 - (avg by (instance) (irate(node_cpu_seconds_total{mode="idle"}[5m])) * 100) - name: cpu-usage-5m interval: 5m rules: - record: cpu_usage_max_avg_1h expr: max_over_time(cpu_usage_avg_5m[1h]) - record: cpu_usage_max_avg_1d expr: max_over_time(cpu_usage_avg_5m[1d])
global:
evaluation_interval: 30s
scrape_interval: 30s
external_labels:
rule_files:
- /etc/prometheus/rules/*.yml # /etc/prometheus/rules/*.yml is the file that defines the rules.
/etc/prometheus/rules/
directory of the above Prometheus container.说明:Normally, the above Prometheus configuration file and rules configuration file are stored via configmap and then mounted to the Prometheus server container. Therefore, you only need to modify the relevant configmap.
descheduler.alpha.kubernetes.io/evictable: 'true'
Was this page helpful?