来源:llm/qwen
在您的 Kubernetes 或云上部署 Qwen3/Qwen2#
Qwen2 是顶级的开源 LLM 之一。截至 2024 年 6 月,Qwen1.5-110B-Chat 在 LMSYS Chatbot Arena 排行榜上的排名高于 GPT-4-0613。
更新 (2025 年 4 月 28 日) - SkyPilot 现已支持 Qwen3 模型!
📰 更新 (2024 年 9 月 18 日) - SkyPilot 现已支持 Qwen2.5 模型!
📰 更新 (2024 年 6 月 6 日) - SkyPilot 现已支持 Qwen2 模型!它在有竞争力的 Qwen1.5 模型基础上做了进一步改进。
📰 更新 (2024 年 4 月 26 日) - SkyPilot 现已支持 Qwen1.5-110B 模型!它在一系列评估中与 Llama-3-70B 表现出竞争力。使用 qwen15-110b.yaml 部署 110B 模型。
一条命令启动 Qwen3#
sky launch -c qwen qwen3-235b.yaml
参考资料#
为什么选择 SkyPilot 而非商业托管解决方案进行部署?#
通过利用跨 Kubernetes 集群和多个区域/云服务商的多个资源池,获得最佳的 GPU 可用性。
支付最低成本 — SkyPilot 会在跨 Kubernetes 集群和区域/云服务商中选择最便宜的资源。没有托管解决方案的加价。
在不同位置和加速器上扩展到多个副本,所有这些都通过一个单一的端点提供服务
所有内容都保留在您的 Kubernetes 或云账号中(您的虚拟机和存储桶)
完全私密 - 其他人无法查看您的聊天历史
使用 SkyPilot 运行您自己的 Qwen#
安装 SkyPilot 后,使用 SkyPilot 一键在 vLLM 上运行您自己的 Qwen 模型
使用 vLLM 驱动的 OpenAI 兼容端点,在 qwen15-110b.yaml 中指定的列表中任何可用 GPU 的单个实例上开始部署 Qwen 110B。(您也可以切换到 qwen25-72b.yaml 或 qwen25-7b.yaml 使用较小的模型)
sky launch -c qwen qwen3-235b.yaml
向端点发送补全请求
ENDPOINT=$(sky status --endpoint 8000 qwen)
curl http://$ENDPOINT/v1/completions \
-H "Content-Type: application/json" \
-d '{
"model": "Qwen/Qwen3-235B-A22B-FP8",
"prompt": "My favorite food is",
"max_tokens": 512
}' | jq -r '.choices[0].text'
发送聊天补全请求
curl http://$ENDPOINT/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "Qwen/Qwen3-235B-A22B-FP8",
"messages": [
{
"role": "system",
"content": "You are a helpful and honest chat expert."
},
{
"role": "user",
"content": "What is the best food?"
}
],
"max_tokens": 512
}' | jq -r '.choices[0].message.content'
Qwen3 输出
The concept of "the best food" is highly subjective and depends on personal preferences, cultural background, dietary needs, and even mood! For example:
- **Some crave comfort foods** like macaroni and cheese, ramen, or dumplings.
- **Others prioritize health** and might highlight dishes like quinoa bowls, grilled salmon, or fresh salads.
- **Global favorites** often include pizza, sushi, tacos, or curry.
- **Unique or adventurous eaters** might argue for dishes like insects, fermented foods, or molecular gastronomy creations.
Could you clarify what you mean by "best"? For instance:
- Are you asking about taste, health benefits, cultural significance, or something else?
- Are you looking for a specific dish, ingredient, or cuisine?
This helps me tailor a more meaningful answer! 😊
运行多模态 Qwen2-VL#
开始部署 Qwen2-VL
sky launch -c qwen2-vl qwen2-vl-7b.yaml
向端点发送多模态补全请求
ENDPOINT=$(sky status --endpoint 8000 qwen2-vl)
curl http://$ENDPOINT/v1/chat/completions \
-H 'Content-Type: application/json' \
-H 'Authorization: Bearer token' \
--data '{
"model": "Qwen/Qwen2-VL-7B-Instruct",
"messages": [
{
"role": "user",
"content": [
{"type" : "text", "text": "Covert this logo to ASCII art"},
{"type": "image_url", "image_url": {"url": "https://pbs.twimg.com/profile_images/1584596138635632640/HWexMoH5_400x400.jpg"}}
]
}],
"max_tokens": 1024
}' | jq .
使用 SkyServe 扩展服务#
SkyPilot Serving 是构建在 SkyPilot 之上的服务部署库,使用它可以轻松扩展 Qwen 服务,只需运行以下命令:SkyPilot Serving
sky serve up -n qwen ./qwen25-72b.yaml
这将启动在最便宜的可用位置和加速器上的多个服务副本。SkyServe 将自动管理副本,监控其健康状况,根据负载自动扩缩容,并在需要时重启副本。
将返回一个单一的端点,发送到该端点的任何请求都将路由到就绪的副本。
要检查服务的状态,请运行
sky serve status qwen
过一会儿,您将看到以下输出
Services
NAME VERSION UPTIME STATUS REPLICAS ENDPOINT
Qwen 1 - READY 2/2 3.85.107.228:30002
Service Replicas
SERVICE_NAME ID VERSION ENDPOINT LAUNCHED RESOURCES STATUS REGION
Qwen 1 1 - 2 mins ago 1x Azure({'A100-80GB': 8}) READY eastus
Qwen 2 1 - 2 mins ago 1x GCP({'L4': 8}) READY us-east4-a
如图所示,该服务目前由 2 个副本支持,一个在 Azure 上,一个在 GCP 上,并且选择的加速器类型是云上最便宜的可用类型。也就是说,它在最大化服务可用性的同时最小化了成本。
要访问模型,我们使用
curl
命令向端点发送请求
ENDPOINT=$(sky serve status --endpoint qwen)
curl http://$ENDPOINT/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "Qwen/Qwen2.5-72B-Instruct",
"messages": [
{
"role": "system",
"content": "You are a helpful and honest code assistant expert in Python."
},
{
"role": "user",
"content": "Show me the python code for quick sorting a list of integers."
}
],
"max_tokens": 512
}' | jq -r '.choices[0].message.content'
可选:使用聊天 GUI 访问 Qwen#
也可以使用 vLLM 通过 GUI 访问 Qwen 服务。
启动聊天 Web UI(将
--env
标志更改为您正在运行的模型)
sky launch -c qwen-gui ./gui.yaml --env MODEL_NAME='Qwen/Qwen2.5-72B-Instruct' --env ENDPOINT=$(sky serve status --endpoint qwen)
然后,我们可以通过返回的 gradio 链接访问 GUI
| INFO | stdout | Running on public URL: https://6141e84201ce0bb4ed.gradio.live
包含的文件#
gui.yaml
# Starts a GUI server that connects to the Qwen OpenAI API server.
#
# Refer to llm/qwen/README.md for more details.
#
# Usage:
#
# 1. If you have a endpoint started on a cluster (sky launch):
# `sky launch -c qwen-gui ./gui.yaml --env ENDPOINT=$(sky status --ip qwen):8000`
# 2. If you have a SkyPilot Service started (sky serve up) called qwen:
# `sky launch -c qwen-gui ./gui.yaml --env ENDPOINT=$(sky serve status --endpoint qwen)`
#
# After the GUI server is started, you will see a gradio link in the output and
# you can click on it to open the GUI.
envs:
ENDPOINT: x.x.x.x:3031 # Address of the API server running qwen.
MODEL_NAME: Qwen/Qwen1.5-72B-Chat
resources:
cpus: 2
setup: |
conda activate qwen
if [ $? -ne 0 ]; then
conda create -n qwen python=3.10 -y
conda activate qwen
fi
# Install Gradio for web UI.
pip install gradio openai
run: |
conda activate qwen
export PATH=$PATH:/sbin
WORKER_IP=$(hostname -I | cut -d' ' -f1)
CONTROLLER_PORT=21001
WORKER_PORT=21002
echo 'Starting gradio server...'
git clone https://github.com/vllm-project/vllm.git || true
python vllm/examples/gradio_openai_chatbot_webserver.py \
-m $MODEL_NAME \
--port 8811 \
--model-url http://$ENDPOINT/v1 | tee ~/gradio.log
qwen15-110b.yaml
envs:
MODEL_NAME: Qwen/Qwen1.5-110B-Chat
service:
# Specifying the path to the endpoint to check the readiness of the replicas.
readiness_probe:
path: /v1/chat/completions
post_data:
model: $MODEL_NAME
messages:
- role: user
content: Hello! What is your name?
max_tokens: 1
initial_delay_seconds: 1200
# How many replicas to manage.
replicas: 2
resources:
accelerators: {A100:8, A100-80GB:4, A100-80GB:8}
disk_size: 1024
disk_tier: best
memory: 32+
ports: 8000
setup: |
pip install vllm==0.6.1.post2
pip install vllm-flash-attn
run: |
export PATH=$PATH:/sbin
vllm serve $MODEL_NAME \
--host 0.0.0.0 \
--tensor-parallel-size $SKYPILOT_NUM_GPUS_PER_NODE \
--max-model-len 1024 | tee ~/openai_api_server.log
qwen2-vl-7b.yaml
envs:
MODEL_NAME: Qwen/Qwen2-VL-7B-Instruct
service:
# Specifying the path to the endpoint to check the readiness of the replicas.
readiness_probe:
path: /v1/chat/completions
post_data:
model: $MODEL_NAME
messages:
- role: user
content: Hello! What is your name?
max_tokens: 1
initial_delay_seconds: 1200
# How many replicas to manage.
replicas: 2
resources:
accelerators: {L4, A10g, A10, L40, A40, A100, A100-80GB}
disk_tier: best
ports: 8000
setup: |
# Install later transformers version for the support of
# qwen2_vl support
pip install git+https://github.com/huggingface/transformers@21fac7abba2a37fae86106f87fcf9974fd1e3830
pip install vllm==0.6.1.post2
pip install vllm-flash-attn
run: |
export PATH=$PATH:/sbin
vllm serve $MODEL_NAME \
--host 0.0.0.0 \
--tensor-parallel-size $SKYPILOT_NUM_GPUS_PER_NODE \
--max-model-len 2048 | tee ~/openai_api_server.log
qwen25-72b.yaml
envs:
MODEL_NAME: Qwen/Qwen2.5-72B-Instruct
service:
# Specifying the path to the endpoint to check the readiness of the replicas.
readiness_probe:
path: /v1/chat/completions
post_data:
model: $MODEL_NAME
messages:
- role: user
content: Hello! What is your name?
max_tokens: 1
initial_delay_seconds: 1200
# How many replicas to manage.
replicas: 2
resources:
accelerators: {A100:8, A100-80GB:4, A100-80GB:8}
disk_size: 1024
disk_tier: best
memory: 32+
ports: 8000
setup: |
pip install vllm==0.6.1.post2
pip install vllm-flash-attn
run: |
export PATH=$PATH:/sbin
vllm serve $MODEL_NAME \
--host 0.0.0.0 \
--tensor-parallel-size $SKYPILOT_NUM_GPUS_PER_NODE \
--max-model-len 1024 | tee ~/openai_api_server.log
qwen25-7b.yaml
envs:
MODEL_NAME: Qwen/Qwen2.5-7B-Instruct
service:
# Specifying the path to the endpoint to check the readiness of the replicas.
readiness_probe:
path: /v1/chat/completions
post_data:
model: $MODEL_NAME
messages:
- role: user
content: Hello! What is your name?
max_tokens: 1
initial_delay_seconds: 1200
# How many replicas to manage.
replicas: 2
resources:
accelerators: {L4, A10g, A10, L40, A40, A100, A100-80GB}
disk_tier: best
ports: 8000
setup: |
pip install vllm==0.6.1.post2
pip install vllm-flash-attn
run: |
export PATH=$PATH:/sbin
vllm serve $MODEL_NAME \
--host 0.0.0.0 \
--tensor-parallel-size $SKYPILOT_NUM_GPUS_PER_NODE \
--max-model-len 1024 | tee ~/openai_api_server.log
qwen3-235b.yaml
envs:
MODEL_NAME: Qwen/Qwen3-235B-A22B-FP8
service:
# Specifying the path to the endpoint to check the readiness of the replicas.
readiness_probe:
path: /v1/chat/completions
post_data:
model: $MODEL_NAME
messages:
- role: user
content: Hello! What is your name?
max_tokens: 1
initial_delay_seconds: 1200
# How many replicas to manage.
replicas: 2
resources:
accelerators: {A100:8, A100-80GB:4, A100-80GB:8, H100:8, H200:8}
disk_size: 1024
disk_tier: best
memory: 32+
ports: 8000
setup: |
uv pip install "sglang>=0.4.6"
run: |
export PATH=$PATH:/sbin
export SGL_ENABLE_JIT_DEEPGEMM=1
# --tp 4 is required even with 8 GPUs, as the output size
# of qwen3 is not divisible by quantization block_n=128
python3 -m sglang.launch_server --model $MODEL_NAME \
--tp 4 --reasoning-parser qwen3 --port 8000 --host 0.0.0.0