Network Packet Loss

Last updated: 2021-11-16 17:26:00

This document describes the common causes and troubleshooting procedures of CVM network packet loss.

Common Cause

The common causes of CVM network packet loss are as follows:

Prerequisites

To troubleshoot a problem, you need to first log in to your CVM instance. For detailed directions, see Logging into Linux Instance and Logging into Windows Instance.

Troubleshooting

TCP packet loss due to the limit setting

Tencent Cloud provides various types of CVM instances. Each type has unique network performance. When the maximum bandwidth or packet size of an instance is reached, packet loss may occur. The troubleshooting procedure is as follows:

  1. Check the bandwidth and packets of the instance.
    For a Linux instance, run the sar -n DEV 2 command to check its bandwidth and packets. The rxpck/s and txpck/s metrics indicate the packets received and sent, respectively. The rxkB/s and txkB/s metrics indicate the inbound and outbound bandwidth, respectively.
  2. Compare the result with the performance indicator shown in the instance type and check if the performance hits the bottleneck.
    • If yes, upgrade the instance or adjust your business volume.
    • If no, please submit a ticket for assistance.

UDP packet loss due to the limit setting

Refer to the troubleshooting procedure for TCP packet loss due to the limit setting, and determine whether the cause is the instance performance bottleneck.

  • If yes, upgrade the instance or adjust your business volume.
  • If no, the cause may be the frequency limit on DNS requests. After the overall bandwidth or packets hit the performance bottleneck of the instance, the DNS request speed may be limited, which causes packet loss. In this case, please submit a ticket for assistance.

Packet loss due to soft interrupt

When the operating system detects that the second value of the /proc/net/softnet_stat statistics is increasing, a soft interrupt causes the packet loss. The troubleshooting procedure is as follows:
Check whether the RPS (Receive Packet Steering) is enabled:

  • If yes, increase the value of the kernel parameter net.core.netdev_max_backlog. For more information on how to modify a kernel parameter, see Introduction to Linux Kernel Parameters.
  • If no, check whether the CPU high single-core soft interrupt causes the delayed data receiving and sending. In this case, you can:
    • Choose to enable RPS to make soft interrupt distribution more balanced.
    • Check whether the business program will cause uneven distribution of soft interrupts.

Full UDP send buffer

If your instance lost packets due to insufficient UDP buffer, the troubleshooting procedure is as follows:

  1. Run the ss -nump command to check whether the UDP send buffer is full.
  2. If the buffer is full, increase the values of the kernel parameters net.core.wmem_max and net.core.wmem_default, and restart the UDP program for the configuration to take effect. For more information about kernel parameters, see Introduction to Linux Kernel Parameters.
  3. If the packet loss problem persists, run the ss -nump command, and you will find that the send buffer size does not increase as expected. In this case, check whether SO_SNDBUF is configured through the setsockopt function in the code. If so, modify the code to increase the value of SO_SNDBUF.

Full UDP receive buffer

If your instance lost packets due to insufficient UDP buffer, the troubleshooting procedure is as follows:

  1. Run the ss -nump command to check whether the UDP receive buffer is full.
  2. If the buffer is full, increase the values of the kernel parameters net.core.rmem_max and net.core.rmem_default, and restart the UDP program for the configuration to take effect. For more information about kernel parameters, see Introduction to Linux Kernel Parameters.
  3. If the packet loss problem persists, run the ss -nump command, and you will find that the receive buffer size does not increase as expected. In this case, check whether SO_RCVBUF is configured through the setsockopt function in the code. If so, modify the code to increase the value of SO_RCVBUF.

Full TCP accept queue

The TCP accept queue length is the net.core.somaxconn value or the passed-in backlog value when a business process calls the listen system, whichever is smaller. If your instance lost packets due to full TCP accept queue, the troubleshooting procedure is as follows:

  1. Increase the value of the kernel parameter net.core.somaxconn. For more information about kernel parameters, see Introduction to Linux Kernel Parameters.
  2. Check whether the business process passes in the backlog parameter, and increase its value accordingly.

TCP request overflow

If you lock the socket when TCP receives data, the data will be sent to the backlog queue. If the process fails, packet loss occurs due to the TCP request overflow. Assume the business program performs well, troubleshoot the packet loss problem at the system level.

Check whether the business program sets the buffer size through the setsockopt function.

  • If yes, modify the business program to specify a larger value or abandon the setting.
    Note

    The setsockopt value is restricted by the kernel parameters net.core.rmem_max and net.core.wmem_max. You can also adjust the values of the two kernel parameters, and then restart the business program for the configuration to take effect.

  • If not, increase the respective values of the kernel parameters net.ipv4.tcp_mem, net.ipv4.tcp_rmem and net.ipv4.tcp_wmem to heighten the socket level.
    For kernel parameter modifications, see Introduction to Linux Kernel Parameters.

Connections exceeding the upper limit

Tencent Cloud provides various types of CVM instances. Each type has unique connection performance. When instance connections exceed the specified threshold, no connection is allowed, resulting in packet loss. The troubleshooting procedure is as follows:

Note:

The connection refers to the number of CVM instance sessions (including TCP, UDP, and ICMP sessions) saved on a host. If the value is greater than the network connections obtained by using the ss or netstat command on the instance, the threshold is exceeded.

Compare the network connections on your instance with the number of connections shown in the instance type and check if the performance hits the bottleneck.

  • If yes, upgrade the instance or adjust your business volume.
  • If no, please submit a ticket for assistance.