Release Notes and Announcements
- Release Notes
- Announcements
- Release Notes
Product Introduction
Purchase Guide
- Purchase Instructions
- Purchase a TKE General Cluster
- Purchasing Native Nodes
- Purchasing a Super Node
Getting Started
Cluster Configuration
- General Cluster Overview
- Cluster Management
- Network Management
- Storage Management
- Node Management
- GPU Resource Management
- Remote Terminals
Application Configuration
- Workload Management
- Service and Configuration Management
- Component and Application Management
- Auto Scaling
- Container Login Methods
Observability Configuration
- Ops Observability
- Cost Insights and Optimization
Scheduler Configuration
- Scheduling Component Overview
- Resource Utilization Optimization Scheduling
- Business Priority Assurance Scheduling
- QoS Awareness Scheduling
Security and Stability
- TKE Security Group Settings
- Identity Authentication and Authorization
- Application Security
Multi-cluster Management
- Planned Upgrade
- Backup Center
Cloud Native Service Guide
- Cloud Service for etcd
- TMP
- TKE Serverless Cluster Guide
- TKE Registered Cluster Guide
Use Cases
- Cluster
- Serverless Cluster
- Scheduling
- Security
- Service Deployment
- Network
- Release
- Logs
- Monitoring
- OPS
- Terraform
- DevOps
- Auto Scaling
- Containerization
- Cost Management
- Hybrid Cloud
- AI
Troubleshooting
API Documentation
- History
- Introduction
- API Category
- Making API Requests
- Elastic Cluster APIs
- Resource Reserved Coupon APIs
- Cluster APIs
- Third-party Node APIs
- Relevant APIs for Addon
- Network APIs
- Node APIs
- Node Pool APIs
- TKE Edge Cluster APIs
- Cloud Native Monitoring APIs
- Scaling group APIs
- Super Node APIs
- Other APIs
- Data Types
- Error Codes
- TKE API 2022-05-01
FAQs
- TKE General Cluster
- TKE Serverless Cluster
- About OPS
- Hidden Danger Handling
- About Services
- Image Repositories
- About Remote Terminals
- Event FAQs
- Resource Management
Service Agreement
- TKE Service Level Agreement
- TKE Serverless Service Level Agreement
Contact Us
Glossary

Using AIBrix for Multi-Node Distributed Inference on TKE

Download

Focus Mode

Font Size

Last updated: 2025-04-30 16:02:27

Overview
﻿AIBrix is an open-source cloud-native large model inference control plane project launched in February 2025. It is specifically designed to optimize the production deployment of large language models (LLMs). As the first full-stack Kubernetes solution deeply integrated with vLLM, it provides multiple core features such as LoRA dynamic loading, multi-node reasoning, heterogeneous GPU scheduling, and distributed KV cache.
Distributed inference refers to the technology of splitting and processing LLM models across multiple nodes or devices. This approach is particularly useful for large models that cannot be accommodated in the memory of a single machine. AIBrix uses Ray as its distributed computing framework, combined with KubeRay to coordinate Ray clusters to implement distributed inference technology.
AIBrix introduced two key APIs for managing RayCluster, namely RayClusterReplicaSet and RayClusterFleet. RayClusterFleet manages RayClusterReplicaSet, and RayClusterReplicaSet manages RayCluster. The relationship among the three is similar to that among the core concepts of Kubernetes, Deployment, ReplicaSet, and Pod. In most cases, users only need to use RayClusterFleet.
﻿
In this document, we will introduce how to use AIBrix for distributed inference on a TKE cluster.
Image Description:
The image used in the example in this document is vllm/vllm-openai, hosted on DockerHub, with a relatively large volume (about 8GB).
In TKE environment, a free DockerHub image acceleration service is provided by default. Therefore, users in the Chinese mainland can directly pull images, but the speed may be slow. It is advisable to synchronize the image to Tencent Container Registry (TCR) to improve the image pull speed and replace the corresponding image address in the YAML file.
Operation Steps
1. Creating a TKE Cluster
Log in to the TKE console. Follow the steps in Create a Cluster to create a TKE cluster.
Cluster Type: TKE standard cluster.
Kubernetes version: Must be greater than or equal to 1.28 that recommend choose the latest version. (This document uses 1.30)
Basic configuration: Select CFS for the storage component, as shown below:
﻿
2. Creating a Super Node
In the cluster list, click the cluster ID to enter the cluster details page. Refer to step creating a super node, to create a super node pool.
3. Downloading a Model
3.1 Creating a StorageClass
Create a StorageClass through the console
1. In the cluster list, click the cluster ID to enter the cluster details page.
2. Select Storage in the left sidebar and click Create on the StorageClass page.
3. On the Create Storage page, create a CFS-type StorageClass according to actual needs. As shown below:
﻿
3.2 Create PVC
Create a PVC through the console
1. In the cluster list, click the tke cluster ID to enter the cluster details page.
2. Select Storage in the left sidebar and click Create on the PersistentVolumeClaim page.
3. On the Create Storage page, create a PVC for the storage model file according to actual needs. As shown below:
﻿
3.3 Using a Job to Download Model Files
Create a Job used for downloading large model files to CFS.
Notes:
The model used in the example in this document is the 7B version of Qwen2.5-Coder.
apiVersion: batch/v1
kind: Job
metadata:
  name: download-model
  labels:
    app: download-model
spec:
  template:
    metadata:
      name: download-model
      labels:
        app: download-model
      annotations:
        eks.tke.cloud.tencent.com/root-cbs-size: "100" # The system disk capacity of a super node is only 20 Gi by default. After decompressing the vllm mirror, the disk will be full. Use this annotation to customize the system disk capacity (charges will apply for the part exceeding 20 Gi).
    spec:
      containers:
        - name: vllm
          image: vllm/vllm-openai:v0.7.1
          command:
            - modelscope
            - download
            - --local_dir=/data/model/Qwen2.5-Coder-7B-Instruct
            - --model=Qwen/Qwen2.5-Coder-7B-Instruct
          volumeMounts:
            - name: data
              mountPath: /data/model
      volumes:
        - name: data
          persistentVolumeClaim:
            claimName: ai-model # Name of the created PVC
      restartPolicy: OnFailure
4. Install AIBrix
Refer to the official documentation of AIBrix Installation | AIBrix to install AIBrix.
# Install component dependencies
kubectl create -f https://github.com/vllm-project/aibrix/releases/download/v0.2.1/aibrix-dependency-v0.2.1.yaml
﻿
# Install aibrix components
kubectl create -f https://github.com/vllm-project/aibrix/releases/download/v0.2.1/aibrix-core-v0.2.1.yaml
Check the installation of AIBrix and confirm that all pods are in the Running state.
kubectl -n aibrix-system get pods
5. Deploying a Model
Create a RayClusterFleet deployment for the Qwen2.5-Coder-7B-Instruct model.
apiVersion: orchestration.aibrix.ai/v1alpha1
kind: RayClusterFleet
metadata:
  labels:
    app.kubernetes.io/name: aibrix
    app.kubernetes.io/managed-by: kustomize
  name: qwen-coder-7b-instruct
spec:
  replicas: 1
  selector:
    matchLabels:
      model.aibrix.ai/name: qwen-coder-7b-instruct
  strategy:
    rollingUpdate:
      maxSurge: 25%
      maxUnavailable: 25%
    type: RollingUpdate
  template:
    metadata:
      labels:
        model.aibrix.ai/name: qwen-coder-7b-instruct
      annotations:
        ray.io/overwrite-container-cmd: "true"
    spec:
      rayVersion: "2.10.0" # Must match the Ray version within the container
      headGroupSpec:
        rayStartParams:
          dashboard-host: "0.0.0.0"
        template:
          metadata:
            annotations:
              eks.tke.cloud.tencent.com/gpu-type: V100 # Specify the GPU card model
              eks.tke.cloud.tencent.com/root-cbs-size: '100' # The system disk capacity of a super node is only 20 Gi by default. After decompressing the vllm image, the disk will be full. Use this annotation to customize the system disk capacity (charges will apply for the part exceeding 20 Gi).
          spec:
            containers:
              - name: ray-head
                image: vllm/vllm-openai:v0.7.1
                ports:
                  - containerPort: 6379
                    name: gcs-server
                  - containerPort: 8265
                    name: dashboard
                  - containerPort: 10001
                    name: client
                  - containerPort: 8000
                    name: service
                command: ["/bin/bash", "-lc", "--"]
                args:
                  - |﻿
                    ulimit -n 65536; 
                    echo head;
                    $KUBERAY_GEN_RAY_START_CMD & KUBERAY_GEN_WAIT_FOR_RAY_NODES_CMDS;
                    vllm serve /data/model/Qwen2.5-Coder-7B-Instruct \\
                      --served-model-name Qwen/Qwen2.5-Coder-7B-Instruct \\
                      --tensor-parallel-size 2 \\
                      --distributed-executor-backend ray \\
                      --dtype=half
                resources:
                  limits:
                    cpu: "4"
                    nvidia.com/gpu: 1
                  requests:
                    cpu: "4"
                    nvidia.com/gpu: 1
                volumeMounts:
                  - name: data
                    mountPath: /data/model
            volumes:
              - name: data
                persistentVolumeClaim:
                  claimName: ai-model # Name of the created PVC
      workerGroupSpecs:
        - replicas: 1
          minReplicas: 1
          maxReplicas: 5
          groupName: small-group
          rayStartParams: {}
          template:
            metadata:
              annotations:
                eks.tke.cloud.tencent.com/gpu-type: V100 # Assign the GPU card model
                eks.tke.cloud.tencent.com/root-cbs-size: '100' # The system disk capacity of a super node is only 20 Gi by default. After decompressing the vllm image, the disk will be full. Use this annotation to customize the system disk capacity (charges will apply for the part exceeding 20 Gi).
            spec:
              containers:
                - name: ray-worker
                  image: vllm/vllm-openai:v0.7.1
                  command: ["/bin/bash", "-lc", "--"]
                  args:
                    ["ulimit -n 65536; echo worker; $KUBERAY_GEN_RAY_START_CMD"]
                  lifecycle:
                    preStop:
                      exec:
                        command: ["/bin/sh", "-c", "ray stop"]
                  resources:
                    limits:
                      cpu: "4"
                      nvidia.com/gpu: 1
                    requests:
                      cpu: "4"
                      nvidia.com/gpu: 1
                  volumeMounts:
                    - name: data
                      mountPath: /data/model
              volumes:
                - name: data
                  persistentVolumeClaim:
                    claimName: ai-model # Name of the created PVC
6. Verify API
After the Pod deployed by RayClusterFleet runs successfully, you can quickly verify the API through kubectl port-forward.
# Get service name
svc=$(kubectl get svc -o name | grep qwen-coder-7b-instruct)
﻿
# Use the forward function to expose the API to port 18000 locally
kubectl port-forward $svc 18000:8000
﻿
# Start another terminal and run the following command to test the API
curl -X POST "http://localhost:18000/v1/chat/completions" \\
-H "Content-Type: application/json" \\
-d '{
  "model": "Qwen/Qwen2.5-Coder-7B-Instruct",
  "messages": [
    {"role": "system", "content": "You are an AI programming assistant."},
    {"role": "user", "content": "Implement quick sort algorithm in Python"}
  ],
  "temperature": 0.3,
  "max_tokens": 512,
  "top_p": 0.9
}'
FAQs
aibrix-kuberay-operator cannot start, Error runtime/cgo: pthread_create failed: Operation not permitted
Check whether aibrix-kuberay-operator is deployed on the super node. If aibrix-kuberay-operator is deployed on the super node, please refer to the following two solutions:
1. Modify the Deployment of aibrix-kuberay-operator and add the following annotations in the Pod Template:
eks.tke.cloud.tencent.com/cpu-type: intel # Specify the CPU type as Intel
2. Refer to setting scheduling rules for workloads and route aibrix-kuberay-operator to a regular node.
How to set up API Key to restrict access?
vLLM provides the following two methods to set the API key:
1. Set the --api-key parameter.
2. Set the environment variable VLLM_API_KEY.
Modify the definition of RayClusterFleet, after setting the API key in headGroupSpec using either of the above methods, include the following Header in the request for access:
Authorization: Bearer <VLLM_API_KEY>
﻿

Help and Support

Was this page helpful?

You can also Contact sales or Submit a Ticket for help.

Help us improve! Rate your documentation experience in 5 mins.

Feedback

tencent cloud

Tencent Kubernetes Engine

Using AIBrix for Multi-Node Distributed Inference on TKE

Overview

Operation Steps

1. Creating a TKE Cluster

2. Creating a Super Node

3. Downloading a Model

3.1 Creating a StorageClass

3.2 Create PVC

3.3 Using a Job to Download Model Files

4. Install AIBrix

5. Deploying a Model

6. Verify API

FAQs

aibrix-kuberay-operator cannot start, Error `runtime/cgo: pthread_create failed: Operation not permitted`

How to set up API Key to restrict access?

Help and Support