

initContainers during startup, it will cause the startup time to be too long. Therefore, it is recommended to use shared storage to mount AI large models (that is, first download the model to shared storage through a Job task, and then mount the storage volume to the Pod where the large model runs). In this way, subsequent Pod startups can skip the model downloading step. Although it is still necessary to load the model from shared storage through the network, if a high-performance shared storage (such as Turbo type) is selected, this process is still rapid and effective.

apiVersion: v1kind: PersistentVolumeClaimmetadata:name: ai-modellabels:app: ai-modelspec:storageClassName: cfs-aiaccessModes:- ReadWriteManyresources:requests:storage: 100Gi
storageClassName:apiVersion: v1kind: PersistentVolumeClaimmetadata:name: webuilabels:app: webuispec:accessModes:- ReadWriteManystorageClassName: cfs-airesources:requests:storage: 100Gi

LLM_MODEL environment variable.USE_MODELSCOPE environment variable to control whether to download from ModelScope.apiVersion: batch/v1kind: Jobmetadata:name: vllm-download-modellabels:app: vllm-download-modelspec:template:metadata:name: vllm-download-modellabels:app: vllm-download-modelannotations:eks.tke.cloud.tencent.com/root-cbs-size: '100' # If using a super node, the default system disk is only 20Gi. After decompressing the vllm mirror, the disk will be full. Use this annotation to customize the system disk capacity (charges will apply for the part exceeding 20Gi).spec:containers:- name: vllmimage: vllm/vllm-openai:latestenv:- name: LLM_MODELvalue: deepseek-ai/DeepSeek-R1-Distill-Qwen-7B- name: USE_MODELSCOPEvalue: "1"command:- bash- -c- |set -exif [[ "$USE_MODELSCOPE" == "1" ]]; thenexec modelscope download --local_dir=/data/$LLM_MODEL --model="$LLM_MODEL"elseexec huggingface-cli download --local-dir=/data/$LLM_MODEL $LLM_MODELfivolumeMounts:- name: datamountPath: /datavolumes:- name: datapersistentVolumeClaim:claimName: ai-modelrestartPolicy: OnFailure
apiVersion: batch/v1kind: Jobmetadata:name: sglang-download-modellabels:app: sglang-download-modelspec:template:metadata:name: sglang-download-modellabels:app: sglang-download-modelannotations:eks.tke.cloud.tencent.com/root-cbs-size: '100' # If using a super node, the default system disk is only 20Gi. After decompressing the sglang mirror, the disk will be full. Use this annotation to customize the system disk capacity (charges will apply for the part exceeding 20Gi).spec:containers:- name: sglangimage: lmsysorg/sglang:latestenv:- name: LLM_MODELvalue: deepseek-ai/DeepSeek-R1-Distill-Qwen-32B- name: USE_MODELSCOPEvalue: "1"command:- bash- -c- |set -exif [[ "$USE_MODELSCOPE" == "1" ]]; thenexec modelscope download --local_dir=/data/$LLM_MODEL --model="$LLM_MODEL"elseexec huggingface-cli download --local-dir=/data/$LLM_MODEL $LLM_MODELfivolumeMounts:- name: datamountPath: /datavolumes:- name: datapersistentVolumeClaim:claimName: ai-modelrestartPolicy: OnFailure
apiVersion: batch/v1kind: Jobmetadata:name: ollama-download-modellabels:app: ollama-download-modelspec:template:metadata:name: ollama-download-modellabels:app: ollama-download-modelspec:containers:- name: ollamaimage: ollama/ollama:latestenv:- name: LLM_MODELvalue: deepseek-r1:7bcommand:- bash- -c- |set -exollama serve &sleep 5 # sleep 5 seconds to wait for ollama to startexec ollama pull $LLM_MODELvolumeMounts:- name: datamountPath: /root/.ollama # Model data of ollama is stored in the /root/.ollama directory. Mount the PVC of type CFS to this path.volumes:- name: datapersistentVolumeClaim:claimName: ai-modelrestartPolicy: OnFailure
apiVersion: apps/v1kind: Deploymentmetadata:name: vllmlabels:app: vllmspec:selector:matchLabels:app: vllmreplicas: 1template:metadata:labels:app: vllmspec:containers:- name: vllmimage: vllm/vllm-openai:latestimagePullPolicy: Alwaysenv:- name: PYTORCH_CUDA_ALLOC_CONFvalue: expandable_segments:True- name: LLM_MODELvalue: deepseek-ai/DeepSeek-R1-Distill-Qwen-7Bcommand:- bash- -c- |vllm serve /data/$LLM_MODEL \\--served-model-name $LLM_MODEL \\--host 0.0.0.0 \\--port 8000 \\--trust-remote-code \\--enable-chunked-prefill \\--max_num_batched_tokens 1024 \\--max_model_len 1024 \\--enforce-eager \\--tensor-parallel-size 1securityContext:runAsNonRoot: falseports:- containerPort: 8000readinessProbe:failureThreshold: 3httpGet:path: /healthport: 8000initialDelaySeconds: 5periodSeconds: 5resources:requests:cpu: 2000mmemory: 2Ginvidia.com/gpu: "1"limits:nvidia.com/gpu: "1"volumeMounts:- name: datamountPath: /data- name: shmmountPath: /dev/shmvolumes:- name: datapersistentVolumeClaim:claimName: ai-model# vLLM needs to access the host's shared memory for tensor parallel inference.- name: shmemptyDir:medium: MemorysizeLimit: "2Gi"restartPolicy: Always---apiVersion: v1kind: Servicemetadata:name: vllm-apispec:selector:app: vllmtype: ClusterIPports:- name: apiprotocol: TCPport: 8000targetPort: 8000
apiVersion: apps/v1kind: Deploymentmetadata:name: vllmlabels:app: vllmspec:selector:matchLabels:app: vllmreplicas: 1template:metadata:labels:app: vllmannotations:eks.tke.cloud.tencent.com/gpu-type: V100 # Specify the GPU card modeleks.tke.cloud.tencent.com/root-cbs-size: '100' # For a super node, the default system disk capacity is only 20Gi. After decompressing the vllm mirror, the disk will be full. Use this annotation to customize the system disk capacity (charges will apply for the part exceeding 20Gi).spec:containers:- name: vllmimage: vllm/vllm-openai:latestimagePullPolicy: Alwaysenv:- name: PYTORCH_CUDA_ALLOC_CONFvalue: expandable_segments:True- name: LLM_MODELvalue: deepseek-ai/DeepSeek-R1-Distill-Qwen-7Bcommand:- bash- -c- |vllm serve /data/$LLM_MODEL \\--served-model-name $LLM_MODEL \\--host 0.0.0.0 \\--port 8000 \\--trust-remote-code \\--enable-chunked-prefill \\--max_num_batched_tokens 1024 \\--max_model_len 1024 \\--enforce-eager \\--tensor-parallel-size 1securityContext:runAsNonRoot: falseports:- containerPort: 8000readinessProbe:failureThreshold: 3httpGet:path: /healthport: 8000initialDelaySeconds: 5periodSeconds: 5resources:requests:cpu: 2000mmemory: 2Ginvidia.com/gpu: "1"limits:nvidia.com/gpu: "1"volumeMounts:- name: datamountPath: /data- name: shmmountPath: /dev/shmvolumes:- name: datapersistentVolumeClaim:claimName: ai-model# vLLM needs to access the host's shared memory for tensor parallel inference.- name: shmemptyDir:medium: MemorysizeLimit: "2Gi"restartPolicy: Always---apiVersion: v1kind: Servicemetadata:name: vllm-apispec:selector:app: vllmtype: ClusterIPports:- name: apiprotocol: TCPport: 8000targetPort: 8000
/data directory.apiVersion: apps/v1kind: Deploymentmetadata:name: sglanglabels:app: sglangspec:selector:matchLabels:app: sglangreplicas: 1template:metadata:labels:app: sglangspec:containers:- name: sglangimage: lmsysorg/sglang:latestenv:- name: LLM_MODELvalue: deepseek-ai/DeepSeek-R1-Distill-Qwen-32Bcommand:- bash- -c- |set -xexec python3 -m sglang.launch_server \\--host 0.0.0.0 \\--port 30000 \\--model-path /data/$LLM_MODELresources:limits:nvidia.com/gpu: "1"ports:- containerPort: 30000volumeMounts:- name: datamountPath: /data- name: shmmountPath: /dev/shmvolumes:- name: datapersistentVolumeClaim:claimName: ai-model- name: shmemptyDir:medium: MemorysizeLimit: 40GirestartPolicy: Always---apiVersion: v1kind: Servicemetadata:name: sglangspec:selector:app: sglangtype: ClusterIPports:- name: apiprotocol: TCPport: 30000targetPort: 30000
apiVersion: apps/v1kind: Deploymentmetadata:name: sglanglabels:app: sglangspec:selector:matchLabels:app: sglangreplicas: 1template:metadata:labels:app: sglangannotations:eks.tke.cloud.tencent.com/gpu-type: V100 # Specify the GPU card modeleks.tke.cloud.tencent.com/root-cbs-size: '100' # For a super node, the default system disk capacity is only 20Gi. After decompressing the sglang mirror, the disk will be full. Use this annotation to customize the system disk capacity (charges will apply for the part exceeding 20Gi).spec:containers:- name: sglangimage: lmsysorg/sglang:latestenv:- name: LLM_MODELvalue: deepseek-ai/DeepSeek-R1-Distill-Qwen-32Bcommand:- bash- -c- |set -xexec python3 -m sglang.launch_server \\--host 0.0.0.0 \\--port 30000 \\--model-path /data/$LLM_MODELresources:limits:nvidia.com/gpu: "1"ports:- containerPort: 30000volumeMounts:- name: datamountPath: /data- name: shmmountPath: /dev/shmvolumes:- name: datapersistentVolumeClaim:claimName: ai-model- name: shmemptyDir:medium: MemorysizeLimit: 40GirestartPolicy: Always---apiVersion: v1kind: Servicemetadata:name: sglangspec:selector:app: sglangtype: ClusterIPports:- name: apiprotocol: TCPport: 30000targetPort: 30000
LLM_MODEL environment variable specifies the large model name, which should be consistent with the name specified in the previous Job./data directory.apiVersion: apps/v1kind: Deploymentmetadata:name: ollamalabels:app: ollamaspec:selector:matchLabels:app: ollamareplicas: 1template:metadata:labels:app: ollamaspec:containers:- name: ollamaimage: ollama/ollama:latestimagePullPolicy: IfNotPresentcommand: ["ollama", "serve"]env:- name: OLLAMA_HOSTvalue: ":11434"resources:requests:cpu: 2000mmemory: 2Ginvidia.com/gpu: "1"limits:cpu: 4000mmemory: 4Ginvidia.com/gpu: "1"ports:- containerPort: 11434name: ollamavolumeMounts:- name: datamountPath: /root/.ollamavolumes:- name: datapersistentVolumeClaim:claimName: ai-modelrestartPolicy: Always---apiVersion: v1kind: Servicemetadata:name: ollamaspec:selector:app: ollamatype: ClusterIPports:- name: serverprotocol: TCPport: 11434targetPort: 11434
apiVersion: apps/v1kind: Deploymentmetadata:name: ollamalabels:app: ollamaspec:selector:matchLabels:app: ollamareplicas: 1template:metadata:labels:app: ollamaannotations:eks.tke.cloud.tencent.com/gpu-type: V100spec:containers:- name: ollamaimage: ollama/ollama:latestimagePullPolicy: IfNotPresentcommand: ["ollama", "serve"]env:- name: OLLAMA_HOSTvalue: ":11434"resources:requests:cpu: 2000mmemory: 2Ginvidia.com/gpu: "1"limits:cpu: 4000mmemory: 4Ginvidia.com/gpu: "1"ports:- containerPort: 11434name: ollamavolumeMounts:- name: datamountPath: /root/.ollamavolumes:- name: datapersistentVolumeClaim:claimName: ai-modelrestartPolicy: Always---apiVersion: v1kind: Servicemetadata:name: ollamaspec:selector:app: ollamatype: ClusterIPports:- name: serverprotocol: TCPport: 11434targetPort: 11434
/root/.ollama directory. Therefore, it needs to mount the CFS - type PVC with the downloaded AI large model to this path.127.0.0.1). By specifying the OLLAMA_HOST environment variable, it forces the exposure of port 11434 to the public.nvidia.com/gpu resource is specified in requests/limits so that Pods can be scheduled to GPU models and allocated with GPU cards for use.eks.tke.cloud.tencent.com/gpu-type. The options include V100, T4, A10*PNV4, A10*GNV4. For details, see GPU specification.apiVersion: autoscaling/v2kind: HorizontalPodAutoscalermetadata:name: vllmspec:minReplicas: 1maxReplicas: 2scaleTargetRef:apiVersion: apps/v1kind: Deploymentname: vllmmetrics: # See more GPU metrics at https://www.tencentcloud.com/document/product/457/38929?from_cn_redirect=1#gpu- pods:metric:name: k8s_pod_rate_gpu_used_request # gpu utilization (as a percentage of request)target:averageValue: "80"type: AverageValuetype: Podsbehavior:scaleDown:policies:- periodSeconds: 15type: Percentvalue: 100selectPolicy: MaxstabilizationWindowSeconds: 300scaleUp:policies:- periodSeconds: 15type: Percentvalue: 100- periodSeconds: 15type: Podsvalue: 4selectPolicy: MaxstabilizationWindowSeconds: 0
behavior:scaleDown:selectPolicy: Disabled


apiVersion: apps/v1kind: Deploymentmetadata:name: webuispec:replicas: 1selector:matchLabels:app: webuitemplate:metadata:labels:app: webuispec:containers:- name: webuiimage: imroc/open-webui:main # mirror image from docker hub with long-term automatic synchronization, safe to useenv:- name: OPENAI_API_BASE_URLvalue: http://vllm-api:8000/v1 # domain name or IP address of vLLM- name: ENABLE_OLLAMA_API # Disable Ollama API, keep only OpenAI APIvalue: "False"tty: trueports:- containerPort: 8080resources:requests:cpu: "500m"memory: "500Mi"limits:cpu: "1000m"memory: "1Gi"volumeMounts:- name: webui-volumemountPath: /app/backend/datavolumes:- name: webui-volumepersistentVolumeClaim:claimName: webui---apiVersion: v1kind: Servicemetadata:name: webuilabels:app: webuispec:type: ClusterIPports:- port: 8080protocol: TCPtargetPort: 8080selector:app: webui
apiVersion: apps/v1kind: Deploymentmetadata:name: webuispec:replicas: 1selector:matchLabels:app: webuitemplate:metadata:labels:app: webuispec:containers:- name: webuiimage: imroc/open-webui:main # mirror image from docker hub with long-term automatic synchronization, safe to useenv:- name: OPENAI_API_BASE_URLvalue: http://sglang:30000/v1 # domain name or IP address of sglang- name: ENABLE_OLLAMA_API # Disable Ollama API, keep only OpenAI APIvalue: "False"tty: trueports:- containerPort: 8080resources:requests:cpu: "500m"memory: "500Mi"limits:cpu: "1000m"memory: "1Gi"volumeMounts:- name: webui-volumemountPath: /app/backend/datavolumes:- name: webui-volumepersistentVolumeClaim:claimName: webui---apiVersion: v1kind: Servicemetadata:name: webuilabels:app: webuispec:type: ClusterIPports:- port: 8080protocol: TCPtargetPort: 8080selector:app: webui
apiVersion: apps/v1kind: Deploymentmetadata:name: webuispec:replicas: 1selector:matchLabels:app: webuitemplate:metadata:labels:app: webuispec:containers:- name: webuiimage: imroc/open-webui:main # mirror image from docker hub with long-term automatic synchronization, safe to useenv:- name: OLLAMA_BASE_URLvalue: http://ollama:11434 # domain name or IP address of ollama- name: ENABLE_OPENAI_API # Disable OpenAI API, reserve only Ollama APIvalue: "False"tty: trueports:- containerPort: 8080resources:requests:cpu: "500m"memory: "500Mi"limits:cpu: "1000m"memory: "1Gi"volumeMounts:- name: webui-volumemountPath: /app/backend/datavolumes:- name: webui-volumepersistentVolumeClaim:claimName: webui---apiVersion: v1kind: Servicemetadata:name: webuilabels:app: webuispec:type: ClusterIPports:- port: 8080protocol: TCPtargetPort: 8080selector:app: webui
/app/backend/data directory (such as account password, chat history and other data). This document mounts the PVC to this path.kubectl port-forward command to expose services:kubectl port-forward service/webui 8080:8080
http://127.0.0.1:8080 in the browser.apiVersion: gateway.networking.k8s.io/v1kind: HTTPRoutemetadata:name: aispec:parentRefs:- group: gateway.networking.k8s.iokind: Gatewaynamespace: envoy-gateway-systemname: ai-gatewayhostnames:- "ai.your.domain"rules:- backendRefs:- group: ""kind: Servicename: webuiport: 8080
parentRefs refer to a well-defined Gateway (normally, one Gateway corresponds to one CLB).hostnames: Replace with your own domain name and ensure the domain name can be resolved normally to the CLB address corresponding to the Gateway.backendRefs: Specify the Service for OpenWebUI.apiVersion: networking.k8s.io/v1kind: Ingressmetadata:name: webuispec:rules:- host: "ai.your.domain"http:paths:- path: /pathType: Prefixbackend:service:name: webuiport:number: 8080
host field: Enter your custom domain name and ensure the domain name can be resolved normally to the CLB address corresponding to the Ingress.backend.service: It needs to be specified as the Service for OpenWebUI.
Feedback