tencent cloud

RDMA Network Configuration Component RDMA-agent Description
Last updated:2026-02-04 10:48:03
RDMA Network Configuration Component RDMA-agent Description
Last updated: 2026-02-04 10:48:03

Overview

RDMA (Remote Direct Memory Access) is a technology that bypasses the remote host's operating system kernel to access data in its memory. Since it does not pass through the OS kernel, it not only saves a large amount of CPU resources but also improves system throughput and reduces the system's network communication latency. Hyper Computing Cluster (Tencent Hyper Computing Cluster, THCC) uses high-performance CVMs as nodes and interconnects with RDMA, providing high-bandwidth and ultra-low latency network services, greatly enhancing network performance, and meeting the parallel computing needs of large-scale high-performance computing, artificial intelligence, big data recommendation, and other applications. The rdma-agent component is a Tencent Cloud Hyper Computing Cluster instance Linux server RDMA network configuration component, functioning as a systemd independent service rdma-agent on Linux.


Installing the rdma-agent Component

THCC GPU instances with Linux Public Image supporting RDMA network will install the rdma-agent component by default during instance startup.
If your instance uses a Tencent Cloud shared Custom Image, it may not have the rdma-agent component installed. The RDMA network configuration depends on the following two scripts in /etc/rc.local to start at boot.
bash /usr/local/qcloud/rdma/set_bonding.sh
nohup bash /usr/local/qcloud/rdma/dscp.sh &
If you wish to convert the configuration script in /etc/rc.local into a systemd-managed rdma-agent service, you can execute the following command to install it when there is no business scenario. After installation, the corresponding script configuration in /etc/rc.local will be canceled to avoid duplicate configuration of the RDMA network during startup. If your rdma-agent component version is too low, you can also execute the following command to upgrade it when there is no business scenario.
wget http://mirrors.tencentyun.com/install/cvm/rdma/bs2_rdma.tgz -O /tmp/bs2_rdma.tgz && tar -axf /tmp/bs2_rdma.tgz -C /tmp && chmod a+x /tmp/bs2_rdma/install.sh && cd /tmp/bs2_rdma/ && bash install.sh

If you wish to upgrade the rdma-agent component without reinitializing the network configuration and avoid impacting business, execute the following command to upgrade.
wget http://mirrors.tencentyun.com/install/cvm/rdma/bs2_rdma.tgz -O /tmp/bs2_rdma.tgz && tar -axf /tmp/bs2_rdma.tgz -C /tmp && chmod a+x /tmp/bs2_rdma/lossless_upgrade.sh && cd /tmp/bs2_rdma/ && bash lossless_upgrade.sh


Checking the rdma-agent Service Status

Run the following command to check whether the rdma-agent service is running normally. If the service is active (running), the status is normal.
systemctl status rdma-agent



Reconfiguring the RDMA Network

Warning: Reconfiguring the RDMA network may affect business operations. Please reconfigure the network when there is no business scenario.
If your GPU instance in a hyper computing cluster with RDMA network support has the rdma-agent component installed, just restart the rdma-agent service by running the following command to reconfigure the RDMA network. Wait a few minutes for the reconfiguration.
systemctl restart rdma-agent
If your GPU instance in a hyper computing cluster with RDMA network support does not have the rdma-agent component installed and depends on the following two script configurations in /etc/rc.local, just run the following command again to reconfigure the RDMA network. Wait a few minutes for the reconfiguration.
bash /usr/local/qcloud/rdma/set_bonding.sh
nohup bash /usr/local/qcloud/rdma/dscp.sh &


Checking the RDMA Network and Configuration

Run the following command to check whether the RDMA network of the instance is functioning properly. The check command does not impact the normal operation of business.
bash /usr/local/qcloud/rdma/rdma_check.sh -f
If the instance container network uses host mode, the results are normally OKKKK status after configuration, as shown below.

If an ERROR is detected, confirm whether the RDMA network initialization is completed. Normally, RDMA network configuration takes a few minutes. If initialization is not completed, re-execute the check command after initialization.
If the instance container network is not in host mode, there may be partial ERROR situations. You can suspend the business, restart the rdma-agent service, and reconfigure the RDMA network to host mode, then check whether the RDMA network is functioning normally.
For other scenario check with RDMA network error, submit a ticket to contact Tencent Cloud Technical Support.

Was this page helpful?
You can also Contact Sales or Submit a Ticket for help.
Yes
No

Feedback