High-risk Operations of Container Service

Last updated: 2020-10-10 12:07:57

    When deploying or running business, you may trigger high-risk operations at different levels, leading to service failures to different degrees. To help you estimate and avoid operational risks, this document describes the consequences of the high-risk operations and corresponding solutions. Below you can find the high-risk operations you may trigger when dealing with clusters, networking and load balancing, logs, and cloud disks.

    Clusters

    Category High-risk Operation Consequence Solution
    Master and etcd nodes Modifying the security groups of nodes in a cluster Master node may become unavailable Configure security groups as recommended by Tencent Cloud
    Node expires or is terminated The master node becomes unavailable Unrecoverable
    Reinstalling operating system Master components get deleted Unrecoverable
    Upgrading master or etcd component version on your own Cluster may become unavailable Roll back to the original version
    Deleting or formatting core directory data such as node /etc/kubernetes The master node becomes unavailable Unrecoverable
    Changing node IP The master node becomes unavailable Change back to the old IP
    Modifying parameters of core components, e.g. etcd, kube-apiserver, docker, etc., on your own Master node may become unavailable Configure parameters as recommended by Tencent Cloud
    Changing master or etcd certificate on your own Cluster may become unavailable Unrecoverable
    Worker node Modifying the security groups of nodes in a cluster Nodes may become unavailable Configure security groups as recommended by Tencent Cloud
    Node expires or is terminated The node becomes unavailable Unrecoverable
    Reinstalling operating system Node components get deleted Remove the node and add it back to the cluster
    Upgrading node component version on your own Node may become unavailable Roll back to the original version
    Changing node IP Node becomes unavailable Change back to the old IP
    Modifying parameters of core components, e.g. etcd, kube-apiserver, docker, etc., on your own Node may become unavailable Configure parameters as recommended by Tencent Cloud
    Modifying operating system configuration Node may become unavailable Try to restore the configurations or delete the node and purchase a new one
    Others Modifying permissions in CAM Some cluster resources, such as cloud load balancers, may not be able to be created Restore the permissions

    Networking and Load Balancing

    High-risk Operation Consequence Solution
    Modifying kernel parameters net.ipv4.ip_forward=0 Network not connected Modify kernel parameters to net.ipv4.ip_forward=1
    Modifying kernel parameter net.ipv4.tcp_tw_recycle = 1 NAT exception Modify kernel parameter net.ipv4.tcp_tw_recycle = 0
    Container CIDR’s UDP port 53 is not opened to the Internet in the security group configuration of the node In-cluster DNS cannot work normally Configure security groups as recommended by Tencent Cloud
    Modifying or deleting LB tags added in TKE A new LB is purchased Restore the LB tags
    Creating custom listeners in TKE-managed LB through LB console Modification gets reset by TKE Automatically create listeners through service YAML
    Binding custom backend rs in TKE-managed LB through LB console Prohibit manual binding of backend rs
    Modifying certificate of TKE-managed LB through LB console Automatically manage certificate through ingress YAML
    Modifying TKE-managed LB listener name through LB console Prohibit modification of TKE-managed LB listener name

    Logs

    High-risk Operation Consequence Solution Notes
    Deleting the /tmp/ccs-log-collector/pos directory of the host Log gets collected again None Files in Pod record where they are collected
    Deleting the /tmp/ccs-log-collector/buffer directory of the host Log gets lost None Buffer contains log cache file

    Cloud Disks

    High-risk Operation Consequence Solution
    Manually unmounting cloud disks through console Writing to Pod reports IO errors Delete the mount directory of the node and reschedule the Pod
    Unmounting disk mounting path on the node Pod gets written to the local disk Re-mount the corresponding directory onto Pod
    Directly operating CBS block device on the node Pod gets written to the local disk None

    Was this page helpful?

    Was this page helpful?

    • Not at all
    • Not very helpful
    • Somewhat helpful
    • Very helpful
    • Extremely helpful
    Send Feedback
    Help