在您的 Kubernetes 或云上部署 Qwen3/Qwen2#

Qwen2 是顶级的开源 LLM 之一。截至 2024 年 6 月，Qwen1.5-110B-Chat 在 LMSYS Chatbot Arena 排行榜上的排名高于 GPT-4-0613。

更新 (2025 年 4 月 28 日) - SkyPilot 现已支持 Qwen3 模型！

📰 更新 (2024 年 9 月 18 日) - SkyPilot 现已支持 Qwen2.5 模型！

📰 更新 (2024 年 6 月 6 日) - SkyPilot 现已支持 Qwen2 模型！它在有竞争力的 Qwen1.5 模型基础上做了进一步改进。

📰 更新 (2024 年 4 月 26 日) - SkyPilot 现已支持 Qwen1.5-110B 模型！它在一系列评估中与 Llama-3-70B 表现出竞争力。使用 qwen15-110b.yaml 部署 110B 模型。

一条命令启动 Qwen3#

sky launch -c qwen qwen3-235b.yaml

qwen

参考资料#

Qwen 文档

为什么选择 SkyPilot 而非商业托管解决方案进行部署？#

通过利用跨 Kubernetes 集群和多个区域/云服务商的多个资源池，获得最佳的 GPU 可用性。
支付最低成本 — SkyPilot 会在跨 Kubernetes 集群和区域/云服务商中选择最便宜的资源。没有托管解决方案的加价。
在不同位置和加速器上扩展到多个副本，所有这些都通过一个单一的端点提供服务
所有内容都保留在您的 Kubernetes 或云账号中（您的虚拟机和存储桶）
完全私密 - 其他人无法查看您的聊天历史

使用 SkyPilot 运行您自己的 Qwen#

安装 SkyPilot 后，使用 SkyPilot 一键在 vLLM 上运行您自己的 Qwen 模型

使用 vLLM 驱动的 OpenAI 兼容端点，在 qwen15-110b.yaml 中指定的列表中任何可用 GPU 的单个实例上开始部署 Qwen 110B。（您也可以切换到 qwen25-72b.yaml 或 qwen25-7b.yaml 使用较小的模型）

sky launch -c qwen qwen3-235b.yaml

向端点发送补全请求

ENDPOINT=$(sky status --endpoint 8000 qwen)

curl http://$ENDPOINT/v1/completions \
    -H "Content-Type: application/json" \
    -d '{
      "model": "Qwen/Qwen3-235B-A22B-FP8",
      "prompt": "My favorite food is",
      "max_tokens": 512
  }' | jq -r '.choices[0].text'

发送聊天补全请求

curl http://$ENDPOINT/v1/chat/completions \
    -H "Content-Type: application/json" \
    -d '{
      "model": "Qwen/Qwen3-235B-A22B-FP8",
      "messages": [
        {
          "role": "system",
          "content": "You are a helpful and honest chat expert."
        },
        {
          "role": "user",
          "content": "What is the best food?"
        }
      ],
      "max_tokens": 512
  }' | jq -r '.choices[0].message.content'

Qwen3 输出

The concept of "the best food" is highly subjective and depends on personal preferences, cultural background, dietary needs, and even mood! For example:

- **Some crave comfort foods** like macaroni and cheese, ramen, or dumplings.  
- **Others prioritize health** and might highlight dishes like quinoa bowls, grilled salmon, or fresh salads.  
- **Global favorites** often include pizza, sushi, tacos, or curry.  
- **Unique or adventurous eaters** might argue for dishes like insects, fermented foods, or molecular gastronomy creations.  

Could you clarify what you mean by "best"? For instance:  
- Are you asking about taste, health benefits, cultural significance, or something else?  
- Are you looking for a specific dish, ingredient, or cuisine?  

This helps me tailor a more meaningful answer! 😊

运行多模态 Qwen2-VL#

开始部署 Qwen2-VL

sky launch -c qwen2-vl qwen2-vl-7b.yaml

向端点发送多模态补全请求

ENDPOINT=$(sky status --endpoint 8000 qwen2-vl)

curl http://$ENDPOINT/v1/chat/completions \
    -H 'Content-Type: application/json' \
    -H 'Authorization: Bearer token' \
    --data '{
        "model": "Qwen/Qwen2-VL-7B-Instruct",
        "messages": [
        {
            "role": "user",
            "content": [
                {"type" : "text", "text": "Covert this logo to ASCII art"},
                {"type": "image_url", "image_url": {"url": "https://pbs.twimg.com/profile_images/1584596138635632640/HWexMoH5_400x400.jpg"}}
            ]
        }],
        "max_tokens": 1024
    }' | jq .

使用 SkyServe 扩展服务#

SkyPilot Serving 是构建在 SkyPilot 之上的服务部署库，使用它可以轻松扩展 Qwen 服务，只需运行以下命令：SkyPilot Serving

sky serve up -n qwen ./qwen25-72b.yaml

这将启动在最便宜的可用位置和加速器上的多个服务副本。SkyServe 将自动管理副本，监控其健康状况，根据负载自动扩缩容，并在需要时重启副本。

将返回一个单一的端点，发送到该端点的任何请求都将路由到就绪的副本。

要检查服务的状态，请运行

sky serve status qwen

过一会儿，您将看到以下输出

Services
NAME  VERSION  UPTIME  STATUS        REPLICAS  ENDPOINT            
Qwen  1        -       READY         2/2       3.85.107.228:30002  

Service Replicas
SERVICE_NAME  ID  VERSION  ENDPOINT  LAUNCHED    RESOURCES                  STATUS REGION  
Qwen          1   1        -         2 mins ago  1x Azure({'A100-80GB': 8}) READY  eastus  
Qwen          2   1        -         2 mins ago  1x GCP({'L4': 8})          READY  us-east4-a 

如图所示，该服务目前由 2 个副本支持，一个在 Azure 上，一个在 GCP 上，并且选择的加速器类型是云上最便宜的可用类型。也就是说，它在最大化服务可用性的同时最小化了成本。

要访问模型，我们使用 curl 命令向端点发送请求

ENDPOINT=$(sky serve status --endpoint qwen)

curl http://$ENDPOINT/v1/chat/completions \
    -H "Content-Type: application/json" \
    -d '{
      "model": "Qwen/Qwen2.5-72B-Instruct",
      "messages": [
        {
          "role": "system",
          "content": "You are a helpful and honest code assistant expert in Python."
        },
        {
          "role": "user",
          "content": "Show me the python code for quick sorting a list of integers."
        }
      ],
      "max_tokens": 512
  }' | jq -r '.choices[0].message.content'

可选：使用聊天 GUI 访问 Qwen#

也可以使用 vLLM 通过 GUI 访问 Qwen 服务。

启动聊天 Web UI（将 --env 标志更改为您正在运行的模型）

sky launch -c qwen-gui ./gui.yaml --env MODEL_NAME='Qwen/Qwen2.5-72B-Instruct' --env ENDPOINT=$(sky serve status --endpoint qwen)

然后，我们可以通过返回的 gradio 链接访问 GUI

| INFO | stdout | Running on public URL: https://6141e84201ce0bb4ed.gradio.live

包含的文件#

gui.yaml

# Starts a GUI server that connects to the Qwen OpenAI API server.
#
# Refer to llm/qwen/README.md for more details.
#
# Usage:
#
#  1. If you have a endpoint started on a cluster (sky launch):
#     `sky launch -c qwen-gui ./gui.yaml --env ENDPOINT=$(sky status --ip qwen):8000`
#  2. If you have a SkyPilot Service started (sky serve up) called qwen:
#     `sky launch -c qwen-gui ./gui.yaml --env ENDPOINT=$(sky serve status --endpoint qwen)`
#
# After the GUI server is started, you will see a gradio link in the output and
# you can click on it to open the GUI.

envs:
  ENDPOINT: x.x.x.x:3031 # Address of the API server running qwen.
  MODEL_NAME: Qwen/Qwen1.5-72B-Chat

resources:
  cpus: 2

setup: |
  conda activate qwen
  if [ $? -ne 0 ]; then
    conda create -n qwen python=3.10 -y
    conda activate qwen
  fi

  # Install Gradio for web UI.
  pip install gradio openai

run: |
  conda activate qwen
  export PATH=$PATH:/sbin
  WORKER_IP=$(hostname -I | cut -d' ' -f1)
  CONTROLLER_PORT=21001
  WORKER_PORT=21002

  echo 'Starting gradio server...'
  git clone https://github.com/vllm-project/vllm.git || true
  python vllm/examples/gradio_openai_chatbot_webserver.py \
    -m $MODEL_NAME \
    --port 8811 \
    --model-url http://$ENDPOINT/v1 | tee ~/gradio.log

qwen15-110b.yaml

envs:
  MODEL_NAME: Qwen/Qwen1.5-110B-Chat

service:
  # Specifying the path to the endpoint to check the readiness of the replicas.
  readiness_probe:
    path: /v1/chat/completions
    post_data:
      model: $MODEL_NAME
      messages:
        - role: user
          content: Hello! What is your name?
      max_tokens: 1
    initial_delay_seconds: 1200
  # How many replicas to manage.
  replicas: 2
  

resources:
  accelerators: {A100:8, A100-80GB:4, A100-80GB:8}
  disk_size: 1024
  disk_tier: best
  memory: 32+
  ports: 8000

setup: |
  pip install vllm==0.6.1.post2
  pip install vllm-flash-attn

run: |
  export PATH=$PATH:/sbin
  vllm serve $MODEL_NAME \
    --host 0.0.0.0 \
    --tensor-parallel-size $SKYPILOT_NUM_GPUS_PER_NODE \
    --max-model-len 1024 | tee ~/openai_api_server.log

qwen2-vl-7b.yaml

envs:
  MODEL_NAME: Qwen/Qwen2-VL-7B-Instruct

service:
  # Specifying the path to the endpoint to check the readiness of the replicas.
  readiness_probe:
    path: /v1/chat/completions
    post_data:
      model: $MODEL_NAME
      messages:
        - role: user
          content: Hello! What is your name?
      max_tokens: 1
    initial_delay_seconds: 1200
  # How many replicas to manage.
  replicas: 2
  

resources:
  accelerators: {L4, A10g, A10, L40, A40, A100, A100-80GB}
  disk_tier: best
  ports: 8000

setup: |
  # Install later transformers version for the support of
  # qwen2_vl support
  pip install git+https://github.com/huggingface/transformers@21fac7abba2a37fae86106f87fcf9974fd1e3830
  pip install vllm==0.6.1.post2
  pip install vllm-flash-attn

run: |
  export PATH=$PATH:/sbin
  vllm serve $MODEL_NAME \
    --host 0.0.0.0 \
    --tensor-parallel-size $SKYPILOT_NUM_GPUS_PER_NODE \
    --max-model-len 2048 | tee ~/openai_api_server.log

qwen25-72b.yaml

envs:
  MODEL_NAME: Qwen/Qwen2.5-72B-Instruct

service:
  # Specifying the path to the endpoint to check the readiness of the replicas.
  readiness_probe:
    path: /v1/chat/completions
    post_data:
      model: $MODEL_NAME
      messages:
        - role: user
          content: Hello! What is your name?
      max_tokens: 1
    initial_delay_seconds: 1200
  # How many replicas to manage.
  replicas: 2
  

resources:
  accelerators: {A100:8, A100-80GB:4, A100-80GB:8}
  disk_size: 1024
  disk_tier: best
  memory: 32+
  ports: 8000

setup: |
  pip install vllm==0.6.1.post2
  pip install vllm-flash-attn

run: |
  export PATH=$PATH:/sbin
  vllm serve $MODEL_NAME \
    --host 0.0.0.0 \
    --tensor-parallel-size $SKYPILOT_NUM_GPUS_PER_NODE \
    --max-model-len 1024 | tee ~/openai_api_server.log

qwen25-7b.yaml

envs:
  MODEL_NAME: Qwen/Qwen2.5-7B-Instruct

service:
  # Specifying the path to the endpoint to check the readiness of the replicas.
  readiness_probe:
    path: /v1/chat/completions
    post_data:
      model: $MODEL_NAME
      messages:
        - role: user
          content: Hello! What is your name?
      max_tokens: 1
    initial_delay_seconds: 1200
  # How many replicas to manage.
  replicas: 2
  

resources:
  accelerators: {L4, A10g, A10, L40, A40, A100, A100-80GB}
  disk_tier: best
  ports: 8000

setup: |
  pip install vllm==0.6.1.post2
  pip install vllm-flash-attn

run: |
  export PATH=$PATH:/sbin
  vllm serve $MODEL_NAME \
    --host 0.0.0.0 \
    --tensor-parallel-size $SKYPILOT_NUM_GPUS_PER_NODE \
    --max-model-len 1024 | tee ~/openai_api_server.log

qwen3-235b.yaml

envs:
  MODEL_NAME: Qwen/Qwen3-235B-A22B-FP8

service:
  # Specifying the path to the endpoint to check the readiness of the replicas.
  readiness_probe:
    path: /v1/chat/completions
    post_data:
      model: $MODEL_NAME
      messages:
        - role: user
          content: Hello! What is your name?
      max_tokens: 1
    initial_delay_seconds: 1200
  # How many replicas to manage.
  replicas: 2
  

resources:
  accelerators: {A100:8, A100-80GB:4, A100-80GB:8, H100:8, H200:8}
  disk_size: 1024
  disk_tier: best
  memory: 32+
  ports: 8000

setup: |
  uv pip install "sglang>=0.4.6"

run: |
  export PATH=$PATH:/sbin
  export SGL_ENABLE_JIT_DEEPGEMM=1
  # --tp 4 is required even with 8 GPUs, as the output size
  # of qwen3 is not divisible by quantization block_n=128
  python3 -m sglang.launch_server --model $MODEL_NAME \
    --tp 4 --reasoning-parser qwen3 --port 8000 --host 0.0.0.0