tencent cloud

Tencent Kubernetes Engine

Release Notes and Announcements
Release Notes
Announcements
Release Notes
Product Introduction
Overview
Strengths
Architecture
Scenarios
Features
Concepts
Native Kubernetes Terms
Common High-Risk Operations
Regions and Availability Zones
Service Regions and Service Providers
Open Source Components
Purchase Guide
Purchase Instructions
Purchase a TKE General Cluster
Purchasing Native Nodes
Purchasing a Super Node
Getting Started
Beginner’s Guide
Quickly Creating a Standard Cluster
Examples
Container Application Deployment Check List
Cluster Configuration
General Cluster Overview
Cluster Management
Network Management
Storage Management
Node Management
GPU Resource Management
Remote Terminals
Application Configuration
Workload Management
Service and Configuration Management
Component and Application Management
Auto Scaling
Container Login Methods
Observability Configuration
Ops Observability
Cost Insights and Optimization
Scheduler Configuration
Scheduling Component Overview
Resource Utilization Optimization Scheduling
Business Priority Assurance Scheduling
QoS Awareness Scheduling
Security and Stability
TKE Security Group Settings
Identity Authentication and Authorization
Application Security
Multi-cluster Management
Planned Upgrade
Backup Center
Cloud Native Service Guide
Cloud Service for etcd
TMP
TKE Serverless Cluster Guide
TKE Registered Cluster Guide
Use Cases
Cluster
Serverless Cluster
Scheduling
Security
Service Deployment
Network
Release
Logs
Monitoring
OPS
Terraform
DevOps
Auto Scaling
Containerization
Microservice
Cost Management
Hybrid Cloud
AI
Troubleshooting
Disk Full
High Workload
Memory Fragmentation
Cluster DNS Troubleshooting
Cluster kube-proxy Troubleshooting
Cluster API Server Inaccessibility Troubleshooting
Service and Ingress Inaccessibility Troubleshooting
Common Service & Ingress Errors and Solutions
Engel Ingres appears in Connechtin Reverside
CLB Ingress Creation Error
Troubleshooting for Pod Network Inaccessibility
Pod Status Exception and Handling
Authorizing Tencent Cloud OPS Team for Troubleshooting
CLB Loopback
API Documentation
History
Introduction
API Category
Making API Requests
Elastic Cluster APIs
Resource Reserved Coupon APIs
Cluster APIs
Third-party Node APIs
Relevant APIs for Addon
Network APIs
Node APIs
Node Pool APIs
TKE Edge Cluster APIs
Cloud Native Monitoring APIs
Scaling group APIs
Super Node APIs
Other APIs
Data Types
Error Codes
TKE API 2022-05-01
FAQs
TKE General Cluster
TKE Serverless Cluster
About OPS
Hidden Danger Handling
About Services
Image Repositories
About Remote Terminals
Event FAQs
Resource Management
Service Agreement
TKE Service Level Agreement
TKE Serverless Service Level Agreement
Contact Us
Glossary
DocumentationTencent Kubernetes EngineUse CasesAITACO LLM Inference Acceleration Engine

TACO LLM Inference Acceleration Engine

PDF
Focus Mode
Font Size
Last updated: 2025-04-30 16:03:16

1. Product Introduction

TACO-LLM (Tencent Cloud Accelerated Computing Optimization LLM) is an inference acceleration engine for large language models (LLMs) launched based on Tencent Cloud's heterogeneous Computing products to improve the inference efficiency of large language models. By fully leveraging the parallel Computing capabilities of Computing resources, TACO-LLM can process more LLM inference requests simultaneously, providing users with Optimization solutions that balance high throughput and low latency. TACO-LLM can reduce the waiting time for generation results, improve the efficiency of the inference process, and help you optimize business costs.
Advantages of TACO-LLM:
high ease of use
TACO-LLM is designed and implemented with a simple - to - use API, fully compatible with the open - source LLM inference framework vLLM in the industry. If you are using vLLM as an inference engine, you can seamlessly migrate to TACO-LLM and easily obtain better performance than vLLM. In addition, the simple and easy - to - use API of TACO-LLM enables users of other inference frameworks to quickly get started.
support for multiple computing platforms
TACO-LLM supports multiple computing platforms such as GPUs (Nvidia/AMD/Intel), CPUs (Intel/AMD), and TPUs, and will subsequently support major domestic computing platforms.
high efficiency
TACO-LLM uses multiple LLM inference acceleration technologies such as Continuous Batching, Paged Attention, speculative sampling, Auto Prefix Caching, CPU - assisted acceleration, and long - sequence optimization. It optimizes performance against different computing resources and all - round improves the performance of LLM inference computation.

2. Supported Models

TACO-LLM supports multiple generative Transformer models in Huggingface model format. The following lists the currently supported model architectures and corresponding commonly used models.

Decoder-Only Language Model

Architecture
Models
Example HuggingFace Models
LoRA
BaiChuanForCausalLM
Baichuan & Baichuan2
baichuan-inc/Baichuan2-13B-Chat, baichuan-inc/Baichuan-7B, etc.
BloomForCausalLM
BLOOM, BLOOMZ, BLOOMChat
bigscience/bloom, bigscience/bloomz, etc.
-
ChatGLMModel
ChatGLM
THUDM/chatglm2-6b, THUDM/chatglm3-6b, etc.
FalconForCausalLM
Falcon
tiiuae/falcon-7b, tiiuae/falcon-40b, tiiuae/falcon-rw-7b, etc.
-
GemmaForCausalLM
Gemma
google/gemma-2b, google/gemma-7b, etc.
Gemma2ForCausalLM
Gemma2
google/gemma-2-9b, google/gemma-2-27b, etc.
GPT2LMHeadModel
GPT-2
gpt2, gpt2-xl, etc.
-
GPTBigCodeForCausalLM
StarCoder, SantaCoder, WizardCoder
bigcode/starcoder, bigcode/gpt_bigcode-santacoder, WizardLM/WizardCoder-15B-V1.0, etc.
GPTJForCausalLM
GPT-J
EleutherAI/gpt-j-6b, nomic-ai/gpt4all-j, etc.
-
GPTNeoXForCausalLM
GPT-NeoX, Pythia, OpenAssistant, Dolly V2, StableLM
EleutherAI/gpt-neox-20b, EleutherAI/pythia-12b, OpenAssistant/oasst-sft-4-pythia-12b-epoch-3.5, databricks/dolly-v2-12b, stabilityai/stablelm-tuned-alpha-7b, etc.
-
InternLMForCausalLM
InternLM
internlm/internlm-7b, internlm/internlm-chat-7b, etc.
InternLM2ForCausalLM
InternLM2
internlm/internlm2-7b, internlm/internlm2-chat-7b, etc.
-
LlamaForCausalLM
Llama 3.1, Llama 3, Llama 2, LLaMA, Yi
meta-llama/Meta-Llama-3.1-405B-Instruct, meta-llama/Meta-Llama-3.1-70B, meta-llama/Meta-Llama-3-70B-Instruct, meta-llama/Llama-2-70b-hf, 01-ai/Yi-34B, etc.
MistralForCausalLM
Mistral, Mistral-Instruct
mistralai/Mistral-7B-v0.1, mistralai/Mistral-7B-Instruct-v0.1, etc.
MixtralForCausalLM
Mixtral-8x7B, Mixtral-8x7B-Instruct
mistralai/Mixtral-8x7B-v0.1, mistralai/Mixtral-8x7B-Instruct-v0.1, mistral-community/Mixtral-8x22B-v0.1, etc.
NemotronForCausalLM
Nemotron-3, Nemotron-4, Minitron
nvidia/Minitron-8B-Base, mgoin/Nemotron-4-340B-Base-hf-FP8, etc.
OPTForCausalLM
OPT, OPT-IML
facebook/opt-66b,
facebook/opt-iml-max-30b, etc.

PhiForCausalLM
Phi
microsoft/phi-1_5, microsoft/phi-2, etc.
Phi3ForCausalLM
Phi-3
microsoft/Phi-3-mini-4k-instruct,
microsoft/Phi-3-mini-128k-instruct,
microsoft/Phi-3-medium-128k-instruct, etc.
-
Phi3SmallForCausalLM
Phi-3-Small
microsoft/Phi-3-small-8k-instruct, microsoft/Phi-3-small-128k-instruct, etc.
-
PhiMoEForCausalLM
Phi-3.5-MoE
microsoft/Phi-3.5-MoE-instruct
, etc.
-
QWenLMHeadModel
Qwen
Qwen/Qwen-7B, Qwen/Qwen-7B-Chat, etc.
-

Qwen2ForCausalLM
Qwen2
Qwen/Qwen2-beta-7B, Qwen/Qwen2-beta-7B-Chat, etc.
Qwen2MoeForCausalLM
Qwen2MoE
Qwen/Qwen1.5-MoE-A2.7B, Qwen/Qwen1.5-MoE-A2.7B-Chat, etc.
-
StableLmForCausalLM
StableLM
stabilityai/stablelm-3b-4e1t/ , stabilityai/stablelm-base-alpha-7b-v2, etc.
-
Starcoder2ForCausalLM
Starcoder2
bigcode/starcoder2-3b, bigcode/starcoder2-7b, bigcode/starcoder2-15b, etc.
-
XverseForCausalLM
Xverse
xverse/XVERSE-7B-Chat, xverse/XVERSE-13B-Chat, xverse/XVERSE-65B-Chat, etc.
-

Multimodal Language Model

Architecture
Models
Modalities
Example HuggingFace Models
LoRA
InternVLChatModel
InternVL2
Image(E+)
OpenGVLab/InternVL2-4B, OpenGVLab/InternVL2-8B, etc.
-
LlavaForConditionalGeneration
LLaVA-1.5
Image(E+)
llava-hf/llava-1.5-7b-hf, llava-hf/llava-1.5-13b-hf, etc.
-
LlavaNextForConditionalGeneration
LLaVA-NeXT
Image(E+)
llava-hf/llava-v1.6-mistral-7b-hf, llava-hf/llava-v1.6-vicuna-7b-hf, etc.
-
LlavaNextVideoForConditionalGeneration
LLaVA-NeXT-Video
Video
llava-hf/LLaVA-NeXT-Video-7B-hf, etc. (see note)
-
PaliGemmaForConditionalGeneration
PaliGemma
Image(E)
google/paligemma-3b-pt-224, google/paligemma-3b-mix-224, etc.
-
Phi3VForCausalLM
Phi-3-Vision, Phi-3.5-Vision
Image(E+)
microsoft/Phi-3-vision-128k-instruct, microsoft/Phi-3.5-vision-instruct etc.
-
PixtralForConditionalGeneration
Pixtral
Image(+)
mistralai/Pixtral-12B-2409
-
QWenLMHeadModel
Qwen-VL
Image(E+)
Qwen/Qwen-VL, Qwen/Qwen-VL-Chat, etc.
-
Qwen2VLForConditionalGeneration
Qwen2-VL (see note)
Image(+) / Video(+)
Qwen/Qwen2-VL-2B-Instruct, Qwen/Qwen2-VL-7B-Instruct, Qwen/Qwen2-VL-72B-Instruct, etc.
-
Note:
E: Pre-computed embeddings can serve as multimodal input.
+ : Indicates that a prompt can insert multiple multimodal inputs.

3. Installing TACO LLM

Environment Preparation

TACO-LLM relies on basic software related to GPU, such as GPU driver/CUDA, etc. To prevent basic software dependencies from preventing TACO-LLM from running normally, we provide a TACO-LLM docker environment image. It is recommended that you preferentially use this image as the runtime environment for TACO-LLM. You can obtain the docker image and start the container environment by the following commands:
docker run -it \\
--privileged \\
--net=host \\
--ipc=host \\
--shm-size=16g \\
--name=taco_llm \\
--gpus all \\
-v /home/workspace:/home/workspace \\
ccr.ccs.tencentyun.com/taco/tacollm-dev:latest /bin/bash

Installing Whl Package

Notes:
If you have any business requirements and need to try out TACO-LLM, submit a ticket to contact the TACO team to obtain the installation package.
1. After obtaining the TACO-LLM whl installation package by submitting a ticket, you can install TACO-LLM in the container environment with the following commands:
pip3 install taco_llm-${version}-cp310-cp310-linux_x86_64.whl
2. When installing the TACO-LLM whl package, related python dependency packages will be automatically installed.

4. Using TACO LLM

TACO-LLM provides an HTTP server to implement OpenAI Completions and Chat APIs. You can use it by following the steps below.

Start Service

First, execute the following commands to start the service:
taco_llm serve facebook/opt-125m --api-key taco-llm-test

Send the request

You can use OpenAI's official Python client to send a request:
from openai import OpenAI

client = OpenAI(
base_url="http://localhost:8000/v1",
api_key="taco-llm-test",
)

completion = client.chat.completions.create(
model="facebook/opt-125m",
messages=[
{"role": "user", "content": "Hello!"}
]
)

print(completion.choices[0].message)
You can also use an HTTP client to send a request:
import requests

api_key = "taco-llm-test"

headers = {
"Authorization": f"Bearer {api_key}"
}

pload = {
"prompt": "Hello!",
"stream": True,
"max_tokens": 128,
}

response = requests.post("http://localhost:8000/v1/completions",
headers=headers,
json=pload,
stream=True)

for chunk in response.iter_lines(chunk_size=8192,
decode_unicode=False,
delimiter=b"\\0"):
if chunk:
data = json.loads(chunk.decode("utf-8"))
output = data["text"][0]
print(output)

Complete Client Parameter Configuration

Except for a few unsupported parameters, TACO-LLM fully supports OpenAI's parameter configuration. You can refer to OpenAI API Official Documentation to view the complete API parameter configuration. The unsupported parameters are as follows:
Chat: tools, and tool_choice.
Completions: suffix.


Help and Support

Was this page helpful?

Help us improve! Rate your documentation experience in 5 mins.

Feedback