소식 및 공지 사항
- 릴리스 노트
- 제품 릴리스 기록
제품 소개
빠른 시작
TKE 표준 클러스터 가이드
- Tencent Kubernetes Engine(TKE)
- 클러스터 관리
- 네트워크 관리
- 스토리지 관리
- Worker 노드 소개
Kubernetes Object Management
- 워크로드
클라우드 네이티브 서비스 가이드
- Tencent Managed Service for Prometheus
- TKE Serverless 클러스터 가이드
- TKE 클러스터 등록 가이드
실습 튜토리얼
- Serverless 클러스터
- 네트워크
- 로그
- 모니터링
- 유지보수
- DevOps
- 탄력적 스케일링
자주 묻는 질문
- 클러스터
- TKE Serverless 클러스터
- 유지보수
- 서비스
- 이미지 레지스트리
- 원격 터미널

TACO LLM Inference Acceleration Engine

Download

포커스 모드

폰트 크기

마지막 업데이트 시간: 2025-04-30 16:03:16

1. Product Introduction
TACO-LLM (Tencent Cloud Accelerated Computing Optimization LLM) is an inference acceleration engine for large language models (LLMs) launched based on Tencent Cloud's heterogeneous Computing products to improve the inference efficiency of large language models. By fully leveraging the parallel Computing capabilities of Computing resources, TACO-LLM can process more LLM inference requests simultaneously, providing users with Optimization solutions that balance high throughput and low latency. TACO-LLM can reduce the waiting time for generation results, improve the efficiency of the inference process, and help you optimize business costs.
Advantages of TACO-LLM:
high ease of use
TACO-LLM is designed and implemented with a simple - to - use API, fully compatible with the open - source LLM inference framework vLLM in the industry. If you are using vLLM as an inference engine, you can seamlessly migrate to TACO-LLM and easily obtain better performance than vLLM. In addition, the simple and easy - to - use API of TACO-LLM enables users of other inference frameworks to quickly get started.
support for multiple computing platforms
TACO-LLM supports multiple computing platforms such as GPUs (Nvidia/AMD/Intel), CPUs (Intel/AMD), and TPUs, and will subsequently support major domestic computing platforms.
high efficiency
TACO-LLM uses multiple LLM inference acceleration technologies such as Continuous Batching, Paged Attention, speculative sampling, Auto Prefix Caching, CPU - assisted acceleration, and long - sequence optimization. It optimizes performance against different computing resources and all - round improves the performance of LLM inference computation.
2. Supported Models
TACO-LLM supports multiple generative Transformer models in Huggingface model format. The following lists the currently supported model architectures and corresponding commonly used models.
Decoder-Only Language Model
Architecture
Models
Example HuggingFace Models
LoRA
BaiChuanForCausalLM
Baichuan & Baichuan2
baichuan-inc/Baichuan2-13B-Chat, baichuan-inc/Baichuan-7B, etc.
✓
BloomForCausalLM
BLOOM, BLOOMZ, BLOOMChat
bigscience/bloom, bigscience/bloomz, etc.
-
ChatGLMModel
ChatGLM
THUDM/chatglm2-6b, THUDM/chatglm3-6b, etc.
✓
FalconForCausalLM
Falcon
tiiuae/falcon-7b, tiiuae/falcon-40b, tiiuae/falcon-rw-7b, etc.
-
GemmaForCausalLM
Gemma
google/gemma-2b, google/gemma-7b, etc.
✓
Gemma2ForCausalLM
Gemma2
google/gemma-2-9b, google/gemma-2-27b, etc.
✓
GPT2LMHeadModel
GPT-2
gpt2, gpt2-xl, etc.
-
GPTBigCodeForCausalLM
StarCoder, SantaCoder, WizardCoder
bigcode/starcoder, bigcode/gpt_bigcode-santacoder, WizardLM/WizardCoder-15B-V1.0, etc.
✓
GPTJForCausalLM
GPT-J
EleutherAI/gpt-j-6b, nomic-ai/gpt4all-j, etc.
-
GPTNeoXForCausalLM
GPT-NeoX, Pythia, OpenAssistant, Dolly V2, StableLM
EleutherAI/gpt-neox-20b, EleutherAI/pythia-12b, OpenAssistant/oasst-sft-4-pythia-12b-epoch-3.5, databricks/dolly-v2-12b, stabilityai/stablelm-tuned-alpha-7b, etc.
-
InternLMForCausalLM
InternLM
internlm/internlm-7b, internlm/internlm-chat-7b, etc.
✓
InternLM2ForCausalLM
InternLM2
internlm/internlm2-7b, internlm/internlm2-chat-7b, etc.
-
LlamaForCausalLM
Llama 3.1, Llama 3, Llama 2, LLaMA, Yi
meta-llama/Meta-Llama-3.1-405B-Instruct, meta-llama/Meta-Llama-3.1-70B, meta-llama/Meta-Llama-3-70B-Instruct, meta-llama/Llama-2-70b-hf, 01-ai/Yi-34B, etc.
✓
MistralForCausalLM
Mistral, Mistral-Instruct
mistralai/Mistral-7B-v0.1, mistralai/Mistral-7B-Instruct-v0.1, etc.
✓
MixtralForCausalLM
Mixtral-8x7B, Mixtral-8x7B-Instruct
mistralai/Mixtral-8x7B-v0.1, mistralai/Mixtral-8x7B-Instruct-v0.1, mistral-community/Mixtral-8x22B-v0.1, etc.
✓
NemotronForCausalLM
Nemotron-3, Nemotron-4, Minitron
nvidia/Minitron-8B-Base, mgoin/Nemotron-4-340B-Base-hf-FP8, etc.
✓
OPTForCausalLM
OPT, OPT-IML
facebook/opt-66b,
facebook/opt-iml-max-30b, etc.
﻿
PhiForCausalLM
Phi
microsoft/phi-1_5, microsoft/phi-2, etc.
✓
Phi3ForCausalLM
Phi-3
microsoft/Phi-3-mini-4k-instruct, 
microsoft/Phi-3-mini-128k-instruct,
microsoft/Phi-3-medium-128k-instruct, etc.
-
Phi3SmallForCausalLM
Phi-3-Small
microsoft/Phi-3-small-8k-instruct, microsoft/Phi-3-small-128k-instruct, etc.
-
PhiMoEForCausalLM
Phi-3.5-MoE
microsoft/Phi-3.5-MoE-instruct
, etc.
-
QWenLMHeadModel
Qwen
Qwen/Qwen-7B, Qwen/Qwen-7B-Chat, etc.
-
﻿
Qwen2ForCausalLM
Qwen2
Qwen/Qwen2-beta-7B, Qwen/Qwen2-beta-7B-Chat, etc.
✓
Qwen2MoeForCausalLM
Qwen2MoE
Qwen/Qwen1.5-MoE-A2.7B, Qwen/Qwen1.5-MoE-A2.7B-Chat, etc.
-
StableLmForCausalLM
StableLM
stabilityai/stablelm-3b-4e1t/ , stabilityai/stablelm-base-alpha-7b-v2, etc.
-
Starcoder2ForCausalLM
Starcoder2
bigcode/starcoder2-3b, bigcode/starcoder2-7b, bigcode/starcoder2-15b, etc.
-
XverseForCausalLM
Xverse
xverse/XVERSE-7B-Chat, xverse/XVERSE-13B-Chat, xverse/XVERSE-65B-Chat, etc.
-
Multimodal Language Model
Architecture
Models
Modalities
Example HuggingFace Models
LoRA
InternVLChatModel
InternVL2
Image(E+)
OpenGVLab/InternVL2-4B, OpenGVLab/InternVL2-8B, etc.
-
LlavaForConditionalGeneration
LLaVA-1.5
Image(E+)
llava-hf/llava-1.5-7b-hf, llava-hf/llava-1.5-13b-hf, etc.
-
LlavaNextForConditionalGeneration
LLaVA-NeXT
Image(E+)
llava-hf/llava-v1.6-mistral-7b-hf, llava-hf/llava-v1.6-vicuna-7b-hf, etc.
-
LlavaNextVideoForConditionalGeneration
LLaVA-NeXT-Video
Video
llava-hf/LLaVA-NeXT-Video-7B-hf, etc. (see note)
-
PaliGemmaForConditionalGeneration
PaliGemma
Image(E)
google/paligemma-3b-pt-224, google/paligemma-3b-mix-224, etc.
-
Phi3VForCausalLM
Phi-3-Vision, Phi-3.5-Vision
Image(E+)
microsoft/Phi-3-vision-128k-instruct, microsoft/Phi-3.5-vision-instruct etc.
-
PixtralForConditionalGeneration
Pixtral
Image(+)
mistralai/Pixtral-12B-2409
-
QWenLMHeadModel
Qwen-VL
Image(E+)
Qwen/Qwen-VL, Qwen/Qwen-VL-Chat, etc.
-
Qwen2VLForConditionalGeneration
Qwen2-VL (see note)
Image(+) / Video(+)
Qwen/Qwen2-VL-2B-Instruct, Qwen/Qwen2-VL-7B-Instruct, Qwen/Qwen2-VL-72B-Instruct, etc.
-
Note:
E: Pre-computed embeddings can serve as multimodal input.
+ : Indicates that a prompt can insert multiple multimodal inputs.
3. Installing TACO LLM
Environment Preparation
TACO-LLM relies on basic software related to GPU, such as GPU driver/CUDA, etc. To prevent basic software dependencies from preventing TACO-LLM from running normally, we provide a TACO-LLM docker environment image. It is recommended that you preferentially use this image as the runtime environment for TACO-LLM. You can obtain the docker image and start the container environment by the following commands:
docker run -it \\
    --privileged \\
    --net=host \\
    --ipc=host \\
    --shm-size=16g \\
    --name=taco_llm \\
    --gpus all \\
    -v /home/workspace:/home/workspace \\
    ccr.ccs.tencentyun.com/taco/tacollm-dev:latest /bin/bash
Installing Whl Package
Notes:
If you have any business requirements and need to try out TACO-LLM, submit a ticket to contact the TACO team to obtain the installation package.
1. After obtaining the TACO-LLM whl installation package by submitting a ticket, you can install TACO-LLM in the container environment with the following commands:
pip3 install taco_llm-${version}-cp310-cp310-linux_x86_64.whl
2. When installing the TACO-LLM whl package, related python dependency packages will be automatically installed.
4. Using TACO LLM
TACO-LLM provides an HTTP server to implement OpenAI Completions and Chat APIs. You can use it by following the steps below.
Start Service
First, execute the following commands to start the service:
taco_llm serve facebook/opt-125m --api-key taco-llm-test
Send the request
You can use OpenAI's official Python client to send a request:
from openai import OpenAI
﻿
client = OpenAI(
    base_url="http://localhost:8000/v1",
    api_key="taco-llm-test",
)
﻿
completion = client.chat.completions.create(
    model="facebook/opt-125m",
    messages=[
        {"role": "user", "content": "Hello!"}
    ]
)
﻿
print(completion.choices[0].message)
You can also use an HTTP client to send a request:
import requests
﻿
api_key = "taco-llm-test"
﻿
headers = {
    "Authorization": f"Bearer {api_key}"
}
﻿
pload = {
    "prompt": "Hello!",
    "stream": True,
    "max_tokens": 128,
}
﻿
response = requests.post("http://localhost:8000/v1/completions",
                         headers=headers,
                         json=pload,
                         stream=True)
﻿
for chunk in response.iter_lines(chunk_size=8192,
                                 decode_unicode=False,
                                 delimiter=b"\\0"):
    if chunk:
        data = json.loads(chunk.decode("utf-8"))
        output = data["text"][0]
        print(output)
Complete Client Parameter Configuration
Except for a few unsupported parameters, TACO-LLM fully supports OpenAI's parameter configuration. You can refer to OpenAI API Official Documentation to view the complete API parameter configuration. The unsupported parameters are as follows:
Chat: tools, and tool_choice.
Completions: suffix.
﻿

도움말 및 지원

문제 해결에 도움이 되었나요?

더 자세한 내용은 문의하기 또는 티겟 제출 을 통해 문의할 수 있습니다.

피드백

Architecture	Models	Example HuggingFace Models	LoRA
BaiChuanForCausalLM	Baichuan & Baichuan2	baichuan-inc/Baichuan2-13B-Chat, baichuan-inc/Baichuan-7B, etc.	✓
BloomForCausalLM	BLOOM, BLOOMZ, BLOOMChat	bigscience/bloom, bigscience/bloomz, etc.	-
ChatGLMModel	ChatGLM	THUDM/chatglm2-6b, THUDM/chatglm3-6b, etc.	✓
FalconForCausalLM	Falcon	tiiuae/falcon-7b, tiiuae/falcon-40b, tiiuae/falcon-rw-7b, etc.	-
GemmaForCausalLM	Gemma	google/gemma-2b, google/gemma-7b, etc.	✓
Gemma2ForCausalLM	Gemma2	google/gemma-2-9b, google/gemma-2-27b, etc.	✓
GPT2LMHeadModel	GPT-2	gpt2, gpt2-xl, etc.	-
GPTBigCodeForCausalLM	StarCoder, SantaCoder, WizardCoder	bigcode/starcoder, bigcode/gpt_bigcode-santacoder, WizardLM/WizardCoder-15B-V1.0, etc.	✓
GPTJForCausalLM	GPT-J	EleutherAI/gpt-j-6b, nomic-ai/gpt4all-j, etc.	-
GPTNeoXForCausalLM	GPT-NeoX, Pythia, OpenAssistant, Dolly V2, StableLM	EleutherAI/gpt-neox-20b, EleutherAI/pythia-12b, OpenAssistant/oasst-sft-4-pythia-12b-epoch-3.5, databricks/dolly-v2-12b, stabilityai/stablelm-tuned-alpha-7b, etc.	-
InternLMForCausalLM	InternLM	internlm/internlm-7b, internlm/internlm-chat-7b, etc.	✓
InternLM2ForCausalLM	InternLM2	internlm/internlm2-7b, internlm/internlm2-chat-7b, etc.	-
LlamaForCausalLM	Llama 3.1, Llama 3, Llama 2, LLaMA, Yi	meta-llama/Meta-Llama-3.1-405B-Instruct, meta-llama/Meta-Llama-3.1-70B, meta-llama/Meta-Llama-3-70B-Instruct, meta-llama/Llama-2-70b-hf, 01-ai/Yi-34B, etc.	✓
MistralForCausalLM	Mistral, Mistral-Instruct	mistralai/Mistral-7B-v0.1, mistralai/Mistral-7B-Instruct-v0.1, etc.	✓
MixtralForCausalLM	Mixtral-8x7B, Mixtral-8x7B-Instruct	mistralai/Mixtral-8x7B-v0.1, mistralai/Mixtral-8x7B-Instruct-v0.1, mistral-community/Mixtral-8x22B-v0.1, etc.	✓
NemotronForCausalLM	Nemotron-3, Nemotron-4, Minitron	nvidia/Minitron-8B-Base, mgoin/Nemotron-4-340B-Base-hf-FP8, etc.	✓
OPTForCausalLM	OPT, OPT-IML	facebook/opt-66b, facebook/opt-iml-max-30b, etc.
PhiForCausalLM	Phi	microsoft/phi-1_5, microsoft/phi-2, etc.	✓
Phi3ForCausalLM	Phi-3	microsoft/Phi-3-mini-4k-instruct, microsoft/Phi-3-mini-128k-instruct, microsoft/Phi-3-medium-128k-instruct, etc.	-
Phi3SmallForCausalLM	Phi-3-Small	microsoft/Phi-3-small-8k-instruct, microsoft/Phi-3-small-128k-instruct, etc.	-
PhiMoEForCausalLM	Phi-3.5-MoE	microsoft/Phi-3.5-MoE-instruct , etc.	-
QWenLMHeadModel	Qwen	Qwen/Qwen-7B, Qwen/Qwen-7B-Chat, etc.	-
Qwen2ForCausalLM	Qwen2	Qwen/Qwen2-beta-7B, Qwen/Qwen2-beta-7B-Chat, etc.	✓
Qwen2MoeForCausalLM	Qwen2MoE	Qwen/Qwen1.5-MoE-A2.7B, Qwen/Qwen1.5-MoE-A2.7B-Chat, etc.	-
StableLmForCausalLM	StableLM	stabilityai/stablelm-3b-4e1t/ , stabilityai/stablelm-base-alpha-7b-v2, etc.	-
Starcoder2ForCausalLM	Starcoder2	bigcode/starcoder2-3b, bigcode/starcoder2-7b, bigcode/starcoder2-15b, etc.	-
XverseForCausalLM	Xverse	xverse/XVERSE-7B-Chat, xverse/XVERSE-13B-Chat, xverse/XVERSE-65B-Chat, etc.	-

Architecture	Models	Modalities	Example HuggingFace Models	LoRA
InternVLChatModel	InternVL2	Image(E+)	OpenGVLab/InternVL2-4B, OpenGVLab/InternVL2-8B, etc.	-
LlavaForConditionalGeneration	LLaVA-1.5	Image(E+)	llava-hf/llava-1.5-7b-hf, llava-hf/llava-1.5-13b-hf, etc.	-
LlavaNextForConditionalGeneration	LLaVA-NeXT	Image(E+)	llava-hf/llava-v1.6-mistral-7b-hf, llava-hf/llava-v1.6-vicuna-7b-hf, etc.	-
LlavaNextVideoForConditionalGeneration	LLaVA-NeXT-Video	Video	llava-hf/LLaVA-NeXT-Video-7B-hf, etc. (see note)	-
PaliGemmaForConditionalGeneration	PaliGemma	Image(E)	google/paligemma-3b-pt-224, google/paligemma-3b-mix-224, etc.	-
Phi3VForCausalLM	Phi-3-Vision, Phi-3.5-Vision	Image(E+)	microsoft/Phi-3-vision-128k-instruct, microsoft/Phi-3.5-vision-instruct etc.	-
PixtralForConditionalGeneration	Pixtral	Image(+)	mistralai/Pixtral-12B-2409	-
QWenLMHeadModel	Qwen-VL	Image(E+)	Qwen/Qwen-VL, Qwen/Qwen-VL-Chat, etc.	-
Qwen2VLForConditionalGeneration	Qwen2-VL (see note)	Image(+) / Video(+)	Qwen/Qwen2-VL-2B-Instruct, Qwen/Qwen2-VL-7B-Instruct, Qwen/Qwen2-VL-72B-Instruct, etc.	-

tencent cloud

Tencent Kubernetes Engine

TACO LLM Inference Acceleration Engine

1. Product Introduction

2. Supported Models

Decoder-Only Language Model

Multimodal Language Model

3. Installing TACO LLM

Environment Preparation

Installing Whl Package

4. Using TACO LLM

Start Service

Send the request

Complete Client Parameter Configuration

도움말 및 지원