High Workload

Last updated: 2020-05-25 09:43:05

    This article describes how to troubleshoot TKE cluster issues caused by high loads.

    Error Description

    High loads prevent node processes from getting the CPU time they need to function properly, which can lead to network timeout, health check failures, and service unavailability.

    Troubleshooting

    At times, a node’s load increases even though cpu ‘us’ (user) is low and cpu ‘id’ (idle) is high. This is usually caused by file I/O bottlenecks, which results in excessive I/O wait. In turn, this leads to high loads and impacts the performance of other processes.
    This article uses top, atop, and iotop to diagnose if the performance issue is caused by disk I/O bottlenecks.

    Query average load and wait time

    1. Log in to your node and use top to query the current load. The following results are displayed:

      High load average means the node is handling a large amount of requests. You can use values in the Cpu(s), Mem, %CPU, and %MEM columns to see which processes are using a large portion of the resources.

       top - 19:42:06 up 23:59,  2 users,  load average: 34.64, 35.80, 35.76
       Tasks: 679 total,   1 running, 678 sleeping,   0 stopped,   0 zombie
       Cpu(s): 15.6%us,  1.7%sy,  0.0%ni, 74.7%id,  7.9%wa,  0.0%hi,  0.1%si,  0.0%st
       Mem:  32865032k total, 30989168k used,  1875864k free,   370748k buffers
       Swap:  8388604k total,     5440k used,  8383164k free,  7982424k cached
      
         PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND
        9783 mysql     20   0 17.3g  16g 8104 S 186.9 52.3   3752:33 mysqld
        5700 nginx     20   0 1330m  66m 9496 S  8.9  0.2   0:20.82 php-fpm
        6424 nginx     20   0 1330m  65m 8372 S  8.3  0.2   0:04.97 php-fpm
        6573 nginx     20   0 1330m  64m 7368 S  8.3  0.2   0:01.49 php-fpm
        5927 nginx     20   0 1320m  56m 9272 S  7.6  0.2   0:12.54 php-fpm
        5956 nginx     20   0 1330m  65m 8500 S  7.6  0.2   0:12.70 php-fpm
        6126 nginx     20   0 1321m  57m 8964 S  7.3  0.2   0:09.72 php-fpm
        6127 nginx     20   0 1319m  54m 9520 S  6.6  0.2   0:08.73 php-fpm
        6131 nginx     20   0 1320m  56m 9404 S  6.6  0.2   0:09.43 php-fpm
        6174 nginx     20   0 1321m  56m 8444 S  6.3  0.2   0:08.92 php-fpm
        5790 nginx     20   0 1319m  54m 9468 S  5.6  0.2   0:17.33 php-fpm
        6575 nginx     20   0 1320m  55m 8212 S  5.6  0.2   0:02.11 php-fpm
        6160 nginx     20   0 1310m  44m 8296 S  4.0  0.1   0:10.05 php-fpm
        5597 nginx     20   0 1310m  46m 9556 S  3.6  0.1   0:21.03 php-fpm
        5786 nginx     20   0 1310m  45m 8528 S  3.6  0.1   0:15.53 php-fpm
        5797 nginx     20   0 1310m  46m 9444 S  3.6  0.1   0:14.02 php-fpm
        6158 nginx     20   0 1310m  45m 8324 S  3.6  0.1   0:10.20 php-fpm
        5698 nginx     20   0 1310m  46m 9184 S  3.3  0.1   0:20.62 php-fpm
        5779 nginx     20   0 1309m  44m 8336 S  3.3  0.1   0:15.34 php-fpm
        6540 nginx     20   0 1306m  40m 7884 S  3.3  0.1   0:02.46 php-fpm
        5553 nginx     20   0 1300m  36m 9568 S  3.0  0.1   0:21.58 php-fpm
        5722 nginx     20   0 1310m  45m 8552 S  3.0  0.1   0:17.25 php-fpm
        5920 nginx     20   0 1302m  36m 8208 S  3.0  0.1   0:14.23 php-fpm
        6432 nginx     20   0 1310m  45m 8420 S  3.0  0.1   0:05.86 php-fpm
        5285 nginx     20   0 1302m  38m 9696 S  2.7  0.1   0:23.41 php-fpm
    2. Among the results is the CPU wa value. wa (wait) is the percent of CPU resources used by IO WAIT. By default, the result shows the average value of all cores. Press 1 to view the wa value of each core, as shown below:

      wa is usually 0%. If it constantly floats above 1%, this indicates a storage bottleneck has been reached and storage cannot keep up with CPU processing speed.

       top - 19:42:08 up 23:59,  2 users,  load average: 34.64, 35.80, 35.76
       Tasks: 679 total,   1 running, 678 sleeping,   0 stopped,   0 zombie
       Cpu0  : 29.5%us,  3.7%sy,  0.0%ni, 48.7%id, 17.9%wa,  0.0%hi,  0.1%si,  0.0%st
       Cpu1  : 29.3%us,  3.7%sy,  0.0%ni, 48.9%id, 17.9%wa,  0.0%hi,  0.1%si,  0.0%st
       Cpu2  : 26.1%us,  3.1%sy,  0.0%ni, 64.4%id,  6.0%wa,  0.0%hi,  0.3%si,  0.0%st
       Cpu3  : 25.9%us,  3.1%sy,  0.0%ni, 65.5%id,  5.4%wa,  0.0%hi,  0.1%si,  0.0%st
       Cpu4  : 24.9%us,  3.0%sy,  0.0%ni, 66.8%id,  5.0%wa,  0.0%hi,  0.3%si,  0.0%st
       Cpu5  : 24.9%us,  2.9%sy,  0.0%ni, 67.0%id,  4.8%wa,  0.0%hi,  0.3%si,  0.0%st
       Cpu6  : 24.2%us,  2.7%sy,  0.0%ni, 68.3%id,  4.5%wa,  0.0%hi,  0.3%si,  0.0%st
       Cpu7  : 24.3%us,  2.6%sy,  0.0%ni, 68.5%id,  4.2%wa,  0.0%hi,  0.3%si,  0.0%st
       Cpu8  : 23.8%us,  2.6%sy,  0.0%ni, 69.2%id,  4.1%wa,  0.0%hi,  0.3%si,  0.0%st
       Cpu9  : 23.9%us,  2.5%sy,  0.0%ni, 69.3%id,  4.0%wa,  0.0%hi,  0.3%si,  0.0%st
       Cpu10 : 23.3%us,  2.4%sy,  0.0%ni, 68.7%id,  5.6%wa,  0.0%hi,  0.0%si,  0.0%st
       Cpu11 : 23.3%us,  2.4%sy,  0.0%ni, 69.2%id,  5.1%wa,  0.0%hi,  0.0%si,  0.0%st
       Cpu12 : 21.8%us,  2.4%sy,  0.0%ni, 60.2%id, 15.5%wa,  0.0%hi,  0.0%si,  0.0%st
       Cpu13 : 21.9%us,  2.4%sy,  0.0%ni, 60.6%id, 15.2%wa,  0.0%hi,  0.0%si,  0.0%st
       Cpu14 : 21.4%us,  2.3%sy,  0.0%ni, 72.6%id,  3.7%wa,  0.0%hi,  0.0%si,  0.0%st
       Cpu15 : 21.5%us,  2.2%sy,  0.0%ni, 73.2%id,  3.1%wa,  0.0%hi,  0.0%si,  0.0%st
       Cpu16 : 21.2%us,  2.2%sy,  0.0%ni, 73.6%id,  3.0%wa,  0.0%hi,  0.0%si,  0.0%st
       Cpu17 : 21.2%us,  2.1%sy,  0.0%ni, 73.8%id,  2.8%wa,  0.0%hi,  0.0%si,  0.0%st
       Cpu18 : 20.9%us,  2.1%sy,  0.0%ni, 74.1%id,  2.9%wa,  0.0%hi,  0.0%si,  0.0%st
       Cpu19 : 21.0%us,  2.1%sy,  0.0%ni, 74.4%id,  2.5%wa,  0.0%hi,  0.0%si,  0.0%st
       Cpu20 : 20.7%us,  2.0%sy,  0.0%ni, 73.8%id,  3.4%wa,  0.0%hi,  0.0%si,  0.0%st
       Cpu21 : 20.8%us,  2.0%sy,  0.0%ni, 73.9%id,  3.2%wa,  0.0%hi,  0.0%si,  0.0%st
       Cpu22 : 20.8%us,  2.0%sy,  0.0%ni, 74.4%id,  2.8%wa,  0.0%hi,  0.0%si,  0.0%st
       Cpu23 : 20.8%us,  1.9%sy,  0.0%ni, 74.4%id,  2.8%wa,  0.0%hi,  0.0%si,  0.0%st
       Mem:  32865032k total, 30209248k used,  2655784k free,   370748k buffers
       Swap:  8388604k total,     5440k used,  8383164k free,  7986552k cached

    Monitoring Disk I/O Statistics

    1. Use atop to query disk I/O. In the following example, disk sda shows busy 100%, meaning it has reached the bottleneck.

       ATOP - lemp              2017/01/23  19:42:32              ---------                10s elapsed
       PRC | sys    3.18s | user  33.24s | #proc    679 | #tslpu    28 | #zombie    0 | #exit      0 |
       CPU | sys      29% | user    330% | irq       1% | idle   1857% | wait    182% | curscal  69% |
       CPL | avg1   33.00 | avg5   35.29 | avg15  35.59 | csw    62610 | intr   76926 | numcpu    24 |
       MEM | tot    31.3G | free    2.1G | cache   7.6G | dirty  41.0M | buff  362.1M | slab    1.2G |
       SWP | tot     8.0G | free    8.0G |              |              | vmcom  23.9G | vmlim  23.7G |
       DSK |          sda | busy    100% | read       4 | write   1789 | MBw/s   2.84 | avio 5.58 ms |
       NET | transport    | tcpi   10357 | tcpo    9065 | udpi       0 | udpo       0 | tcpao    174 |
       NET | network      | ipi    10360 | ipo     9065 | ipfrw      0 | deliv  10359 | icmpo      0 |
       NET | eth0      4% | pcki    6649 | pcko    6136 | si 1478 Kbps | so 4115 Kbps | erro       0 |
       NET | lo      ---- | pcki    4082 | pcko    4082 | si 8967 Kbps | so 8967 Kbps | erro       0 |
      
       PID   TID  THR  SYSCPU  USRCPU  VGROW  RGROW  RDDSK  WRDSK ST EXC S CPUNR  CPU CMD       1/12
        9783     -  156   0.21s  19.44s     0K  -788K     4K  1344K --   - S     4 197% mysqld
        5596     -    1   0.10s   0.62s 47204K 47004K     0K   220K --   - S    18   7% php-fpm
        6429     -    1   0.06s   0.34s 19840K 19968K     0K     0K --   - S    21   4% php-fpm
        6210     -    1   0.03s   0.30s -5216K -5204K     0K     0K --   - S    19   3% php-fpm
        5757     -    1   0.05s   0.27s 26072K 26012K     0K     4K --   - S    13   3% php-fpm
        6433     -    1   0.04s   0.28s -2816K -2816K     0K     0K --   - S    11   3% php-fpm
        5846     -    1   0.06s   0.22s -2560K -2660K     0K     0K --   - S     7   3% php-fpm
        5791     -    1   0.05s   0.21s  5764K  5692K     0K     0K --   - S    22   3% php-fpm
        5860     -    1   0.04s   0.21s 48088K 47724K     0K     0K --   - S     1   3% php-fpm
        6231     -    1   0.04s   0.20s  -256K    -4K     0K     0K --   - S     1   2% php-fpm
        6154     -    1   0.03s   0.21s -3004K -3184K     0K     0K --   - S    21   2% php-fpm
        6573     -    1   0.04s   0.20s  -512K  -168K     0K     0K --   - S     4   2% php-fpm
        6435     -    1   0.04s   0.19s -3216K -2980K     0K     0K --   - S    15   2% php-fpm
        5954     -    1   0.03s   0.20s     0K   164K     0K     4K --   - S     0   2% php-fpm
        6133     -    1   0.03s   0.19s 41056K 40432K     0K     0K --   - S    18   2% php-fpm
        6132     -    1   0.02s   0.20s 37836K 37440K     0K     0K --   - S    11   2% php-fpm
        6242     -    1   0.03s   0.19s -12.2M -12.3M     0K     4K --   - S    12   2% php-fpm
        6285     -    1   0.02s   0.19s 39516K 39420K     0K     0K --   - S     3   2% php-fpm
        6455     -    1   0.05s   0.16s 29008K 28560K     0K     0K --   - S    14   2% php-fpm
    2. Use one of the following methods to view process disk I/O usage:

      • Press d to view process disk I/O usage, as shown below:

           ATOP - lemp               2017/01/23  19:42:46               ---------               2s elapsed
           PRC | sys    0.24s | user   1.99s | #proc    679 | #tslpu    54 | #zombie    0 | #exit      0 |
           CPU | sys      11% | user    101% | irq       1% | idle   2089% | wait    208% | curscal  63% |
           CPL | avg1   38.49 | avg5   36.48 | avg15  35.98 | csw     4654 | intr    6876 | numcpu    24 |
           MEM | tot    31.3G | free    2.2G | cache   7.6G | dirty  48.7M | buff  362.1M | slab    1.2G |
           SWP | tot     8.0G | free    8.0G |              |              | vmcom  23.9G | vmlim  23.7G |
           DSK |          sda | busy    100% | read       2 | write    362 | MBw/s   2.28 | avio 5.49 ms |
           NET | transport    | tcpi    1031 | tcpo     968 | udpi       0 | udpo       0 | tcpao     45 |
           NET | network      | ipi     1031 | ipo      968 | ipfrw      0 | deliv   1031 | icmpo      0 |
           NET | eth0      1% | pcki     558 | pcko     508 | si  762 Kbps | so 1077 Kbps | erro       0 |
           NET | lo      ---- | pcki     406 | pcko     406 | si 2273 Kbps | so 2273 Kbps | erro       0 |
        
             PID          TID         RDDSK         WRDSK        WCANCL         DSK        CMD         1/5
            9783            -            0K          468K           16K         40%        mysqld
            1930            -            0K          212K            0K         18%        flush-8:0
            5896            -            0K          152K            0K         13%        nginx
             880            -            0K          148K            0K         13%        jbd2/sda5-8
            5909            -            0K           60K            0K          5%        nginx
            5906            -            0K           36K            0K          3%        nginx
            5907            -           16K            8K            0K          2%        nginx
            5903            -           20K            0K            0K          2%        nginx
            5901            -            0K           12K            0K          1%        nginx
            5908            -            0K            8K            0K          1%        nginx
            5894            -            0K            8K            0K          1%        nginx
            5911            -            0K            8K            0K          1%        nginx
            5900            -            0K            4K            4K          0%        nginx
            5551            -            0K            4K            0K          0%        php-fpm
            5913            -            0K            4K            0K          0%        nginx
            5895            -            0K            4K            0K          0%        nginx
            6133            -            0K            0K            0K          0%        php-fpm
            5780            -            0K            0K            0K          0%        php-fpm
            6675            -            0K            0K            0K          0%        atop
      • You can also use iotop -oPa to view process disk I/O usage, as shown below:

           Total DISK READ: 15.02 K/s | Total DISK WRITE: 3.82 M/s
             PID  PRIO  USER     DISK READ  DISK WRITE  SWAPIN     IO>    COMMAND
            1930 be/4 root          0.00 B   1956.00 K  0.00 % 83.34 % [flush-8:0]
            5914 be/4 nginx         0.00 B      0.00 B  0.00 % 36.56 % nginx: cache manager process
             880 be/3 root          0.00 B     21.27 M  0.00 % 35.03 % [jbd2/sda5-8]
            5913 be/2 nginx        36.00 K   1000.00 K  0.00 %  8.94 % nginx: worker process
            5910 be/2 nginx         0.00 B   1048.00 K  0.00 %  8.43 % nginx: worker process
            5896 be/2 nginx        56.00 K    452.00 K  0.00 %  6.91 % nginx: worker process
            5909 be/2 nginx        20.00 K   1144.00 K  0.00 %  6.24 % nginx: worker process
            5890 be/2 nginx        48.00 K    692.00 K  0.00 %  6.07 % nginx: worker process
            5892 be/2 nginx        84.00 K    736.00 K  0.00 %  5.71 % nginx: worker process
            5901 be/2 nginx        20.00 K    504.00 K  0.00 %  5.46 % nginx: worker process
            5899 be/2 nginx         0.00 B    596.00 K  0.00 %  5.14 % nginx: worker process
            5897 be/2 nginx        28.00 K   1388.00 K  0.00 %  4.90 % nginx: worker process
            5908 be/2 nginx        48.00 K    700.00 K  0.00 %  4.43 % nginx: worker process
            5905 be/2 nginx        32.00 K   1140.00 K  0.00 %  4.36 % nginx: worker process
            5900 be/2 nginx         0.00 B   1208.00 K  0.00 %  4.31 % nginx: worker process
            5904 be/2 nginx        36.00 K   1244.00 K  0.00 %  2.80 % nginx: worker process
            5895 be/2 nginx        16.00 K    780.00 K  0.00 %  2.50 % nginx: worker process
            5907 be/2 nginx         0.00 B   1548.00 K  0.00 %  2.43 % nginx: worker process
            5903 be/2 nginx        36.00 K   1032.00 K  0.00 %  2.34 % nginx: worker process
            6130 be/4 nginx         0.00 B     72.00 K  0.00 %  2.18 % php-fpm: pool www
            5906 be/2 nginx        12.00 K    844.00 K  0.00 %  2.10 % nginx: worker process
            5889 be/2 nginx        40.00 K   1164.00 K  0.00 %  2.00 % nginx: worker process
            5894 be/2 nginx        44.00 K    760.00 K  0.00 %  1.61 % nginx: worker process
            5902 be/2 nginx        52.00 K    992.00 K  0.00 %  1.55 % nginx: worker process
            5893 be/2 nginx        64.00 K    972.00 K  0.00 %  1.22 % nginx: worker process
            5814 be/4 nginx        36.00 K     44.00 K  0.00 %  1.06 % php-fpm: pool www
            6159 be/4 nginx         4.00 K      4.00 K  0.00 %  1.00 % php-fpm: pool www
            5693 be/4 nginx         0.00 B      4.00 K  0.00 %  0.86 % php-fpm: pool www
            5912 be/2 nginx        68.00 K    300.00 K  0.00 %  0.72 % nginx: worker process
            5911 be/2 nginx        20.00 K    788.00 K  0.00 %  0.72 % nginx: worker process

        Use man iotop to view the descriptions of the following parameters:

           -o, --only
                  Only show processes or threads actually doing I/O, instead of showing all processes or threads. This can be dynamically toggled by pressing o.
           -P, --processes
                  Only show processes. Normally iotop shows all threads.
        
           -a, --accumulated
                  Show accumulated I/O instead of bandwidth. In this mode, iotop shows the amount of I/O processes have done since iotop started.

    Other Reasons

    Deploying non-Kubernetes services, such as databases, on the node may also cause high loads.

    Was this page helpful?

    Was this page helpful?

    • Not at all
    • Not very helpful
    • Somewhat helpful
    • Very helpful
    • Extremely helpful
    Send Feedback
    Help