Dynamic Scheduler is an add-on provided by TKE for pre-selection and preferential selection based on actual node loads. It is implemented based on the native Kube-scheduler Extender mechanism of Kubernetes. After you install it in a TKE cluster, this add-on will work with Kube-scheduler to effectively prevent node load imbalances caused by the native scheduler through the request and limit scheduling mechanisms.
This add-on relies on the Prometheus monitoring component and configuration of relevant rules. Before installing this add-on, we recommend you carefully read Dependency Deployment to prevent problems with the add-on.
Kubernetes Object Name | Type | Requested Resources | Namespace |
---|---|---|---|
node-annotator | Deployment | Each instance: CPU: 100m, Memory: 100Mi, a total of 1 instance | kube-system |
dynamic-scheduler | Deployment | Each instance: CPU: 400m, Memory: 200Mi, a total of 3 instances | kube-system |
dynamic-scheduler | Service | - | kube-system |
node-annotator | ClusterRole | - | kube-system |
node-annotator | ClusterRoleBinding | - | kube-system |
node-annotator | ServiceAccount | - | kube-system |
dynamic-scheduler-policy | ConfigMap | - | kube-system |
restart-kube-scheduler | ConfigMap | - | kube-system |
probe-prometheus | ConfigMap | - | kube-system |
In most cases, the Kubernetes native scheduler performs scheduling based on the pod request resources, without considering the actual load of nodes at the current time and over a previous period. Consequently, the following problem may occur:
In some nodes of the cluster, the amount of remaining schedulable resources (calculated based on the request and limit of the pods running on nodes) may be large but the actual load is high. In contrast, in other nodes, the amount of remaining schedulable resources may be low but the actual load is low. In some cases like this, Kube-scheduler preferentially schedules pods to nodes with more remaining resources (according to the LeastRequestedPriority policy).
As shown in the figure below, Kube-Scheduler will schedule pods to Node2, when it is clear that scheduling pods to Node1 (with a lower load level) is a better choice.
To prevent large numbers of pods from being continuously scheduled to low-load nodes, Dynamic Scheduler sets the scheduling hotspot-prevention policy, which calculates the number of pods scheduled to each node within the past few minutes and deducts points from the preferential selection scores of nodes accordingly.
The current policy is as follows:
Based on the scheduler extender mechanism, Dynamic Scheduler obtains node load data from the Prometheus monitoring data, develops scheduling policies based on the actual load of nodes, and intervenes during pre-selection and preferential selection to preferentially schedule pods to low-load nodes. This add-on consists of node-annotator and dynamic-scheduler.
node-annotator regularly pulls node load metrics from monitoring data and synchronizes them to the node annotation, as shown in the figure below:
Note:After the addon is deleted, the annotation generated by the node-annotator will not be automatically deleted. You can manually delete it as needed.
Dynamic-scheduler is a scheduler-extender that performs filtering and score calculation based on the load data in the node annotation during pre-selection and preferential selection of nodes.
To prevent pods from being scheduled to high-load nodes, you need to filter out some high-load nodes through pre-selection (the filtering policy and proportion can be dynamically configured; for more information, see Add-on Parameter Description).
As shown in the figure below, the load of Node2 in the past 5 minutes and the load of Node3 in the past hour both exceed the corresponding thresholds, so both nodes will be excluded from subsequent preferential selection.
At the same time, to balance the loads among nodes, Dynamic-Scheduler scores each node based on its load data. The lower the load, the higher the score.
As shown in the figure below, the score of Node1 is the highest, so pods will be preferentially scheduled to Node1 (the scoring policy and weights can be dynamically configured; for more information, see Add-on Parameter Description).
Note:
- To ensure that the add-on can pull the required monitoring data and the scheduling policy can take effect, please configure the rules for collecting monitoring data in accordance with Dependency Deployment -> Prometheus rule configuration.
- We have set the pre-selection parameters and preferential selection parameters to the default values. You do not have to modify them unless you have additional requirements.
Default Value | Description |
---|---|
Threshold for average CPU utilization in the past 5 minutes | If the average CPU utilization of a node in the past 5 minutes exceeds the specified threshold, pods will not be scheduled to this node. |
Threshold for max CPU utilization in the past hour | If the max CPU utilization of a node in the past hour exceeds the specified threshold, pods will not be scheduled to this node. |
Threshold for average memory utilization in the past 5 minutes | If the average memory utilization of a node in the past 5 minutes exceeds the specified threshold, pods will not be scheduled to this node. |
Threshold for max memory utilization in the past hour | If the max memory utilization of a node for in past hour exceeds the specified threshold, pods will not be scheduled to this node. |
Default Value | Description |
---|---|
Weight of average CPU utilization in the past 5 minutes | The greater the weight, the greater influence the average CPU utilization of the node in the past 5 minutes will have on the node score. |
Weight of max CPU utilization in the past hour | The greater the weight, the greater influence the max CPU utilization of the node in the past hour will have on the node score. |
Weight of max CPU utilization in the past day | The greater the weight, the greater influence the max CPU utilization of the node in the past day will have on the node score. |
Weight of average memory utilization in the past 5 minutes | The greater the weight, the greater influence the average memory utilization of the node in the past 5 minutes will have on the node score. |
Weight of max memory utilization in the past hour | The greater the weight, the greater influence the max memory utilization of the node in the past hour will have on the node score. |
Weight of max memory utilization in the past day | The greater the weight, the greater influence the max memory utilization of the node in the past day will have on the node score. |
Dynamic Scheduler makes scheduling decisions based on the actual load of nodes at the current time and over a previous period. This requires monitoring components, such as Prometheus, to obtain the actual load information of nodes. Before using Dynamic Scheduler, you need to deploy monitoring components such as Prometheus. In TKE, users can use the self-built Prometheus monitoring service or use the cloud native monitoring provided by TKE.
We use node-exporter to monitor node metrics. You can deploy node-exporter and prometheus based on your own requirements.
After node-exporter obtains node monitoring data, Prometheus must aggregate the data collected in the native node-exporter. To obtain metrics required by Dynamic Scheduler, such as cpu_usage_avg_5m
, cpu_usage_max_avg_1h
, cpu_usage_max_avg_1d
, mem_usage_avg_5m
, mem_usage_max_avg_1h
, and mem_usage_max_avg_1d
, you need to configure rules in Prometheus as follows:
yaml
apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
name: example-record
spec:
groups:
- name: cpu_mem_usage_active
interval: 30s
rules:
- record: cpu_usage_active
expr: 100 - (avg by (instance) (irate(node_cpu_seconds_total{mode="idle"}[30s])) * 100)
- record: mem_usage_active
expr: 100*(1-node_memory_MemAvailable_bytes/node_memory_MemTotal_bytes)
- name: cpu-usage-5m
interval: 5m
rules:
- record: cpu_usage_max_avg_1h
expr: max_over_time(cpu_usage_avg_5m[1h])
- record: cpu_usage_max_avg_1d
expr: max_over_time(cpu_usage_avg_5m[1d])
- name: cpu-usage-1m
interval: 1m
rules:
- record: cpu_usage_avg_5m
expr: avg_over_time(cpu_usage_active[5m])
- name: mem-usage-5m
interval: 5m
rules:
- record: mem_usage_max_avg_1h
expr: max_over_time(mem_usage_avg_5m[1h])
- record: mem_usage_max_avg_1d
expr: max_over_time(mem_usage_avg_5m[1d])
- name: mem-usage-1m
interval: 1m
rules:
- record: mem_usage_avg_5m
expr: avg_over_time(mem_usage_active[5m])
global:
evaluation_interval: 30s
scrape_interval: 30s
external_labels:
rule_files:
- /etc/prometheus/rules/*.yml # /etc/prometheus/rules/*.yml is the file that defines the rules.
/etc/prometheus/rules/
of the above Prometheus container.Note
Normally, the above Prometheus configuration file and rules configuration file are stored via configmap and then mounted to the Prometheus server container. Therefore, you only need to modify the relevant configmap.
Was this page helpful?