使用 SGLang 和 SkyPilot 实现高吞吐量的分布式 DeepSeek-R1 服务#

DeepSeek-R1 on SkyPilot

2025 年 1 月 20 日，DeepSeek AI 发布了 DeepSeek-R1，包括一系列参数量高达 671B 的模型。

DeepSeek-R1 自然地展现出许多强大而有趣的推理行为。它超越了 最先进的专有模型（例如 OpenAI-o1-mini），并成为 第一个能够与 OpenAI-o1 等闭源模型匹敌的开源 LLM。

在本示例中，我们使用 SGLang 来以高吞吐量分布式地服务该模型。

注意：本示例适用于原始的 DeepSeek-R1 671B 模型。对于较小的蒸馏模型，请参考 deepseek-r1-distilled。

在 Kubernetes 或任何云上运行 671B DeepSeek-R1#

SkyPilot 允许您利用 SGLang 框架，通过一个简单的命令分布式地运行该模型。

sky launch -c r1 llm/deepseek-r1/deepseek-r1-671B.yaml --retry-until-up

下方是 DeepSeek-R1 671B 的 SkyPilot YAML 配置，位于 llm/deepseek-r1/deepseek-r1-671B.yaml

name: deepseek-r1

resources:
  accelerators: {H200:8, H100:8}
  disk_size: 1024 # Large disk for model weights
  disk_tier: best
  ports: 30000
  any_of:
    - use_spot: true
    - use_spot: false

num_nodes: 2 # Specify number of nodes to launch; requirements may vary based on accelerators

setup: |
  # Install sglang with all dependencies using uv
  uv pip install "sglang[all]>=0.4.2.post4" --find-links https://flashinfer.ai/whl/cu124/torch2.5/flashinfer

  # Set up shared memory for better performance
  sudo bash -c "echo 'vm.max_map_count=655300' >> /etc/sysctl.conf"
  sudo sysctl -p

run: |
  # Launch the server with appropriate configuration
  MASTER_ADDR=$(echo "$SKYPILOT_NODE_IPS" | head -n1)
  # TP should be number of GPUs per node times number of nodes
  TP=$(($SKYPILOT_NUM_GPUS_PER_NODE * $SKYPILOT_NUM_NODES))

  python -m sglang.launch_server \
    --model deepseek-ai/DeepSeek-R1 \
    --tp $TP \
    --dist-init-addr ${MASTER_ADDR}:5000 \
    --nnodes ${SKYPILOT_NUM_NODES} \
    --node-rank ${SKYPILOT_NODE_RANK} \
    --trust-remote-code \
    --enable-dp-attention \
    --enable-torch-compile \
    --torch-compile-max-bs 8 \
    --host 0.0.0.0 \
    --port 30000

您还可以调整 accelerators 和 num_nodes 以满足您的需求。常见配置包括

GPU	节点数量
H200:8	1
H100:8	2
A100-80GB:8	4
A100:8	8

您可以在命令行中覆盖 num_nodes，而无需修改 YAML 文件。例如

sky launch -c r1-A100 llm/deepseek-r1/deepseek-r1-671B-A100.yaml --retry-until-up --gpus A100-80GB:8 --num-nodes 4

[!NOTE] 对于 A100 GPU，请使用 deepseek-r1-671B-A100.yaml，其中包含一个预处理步骤，用于将模型从 FP8 转换为 BF16，因为 A100 不支持 FP8。此转换过程需要额外 30-40 分钟。或者，您可以使用 Hugging Face 社区提供的预转换 BF16 模型，以跳过此转换步骤。

由于 BF16 模型占用更多内存，A100 部署所需的节点数量是 H100 的两倍。也就是说，如果 H100 设置需要 2 个节点，则 A100-80GB 设置需要 4 个节点，而 A100-40GB 设置需要 8 个节点。

有关更多配置选项，请参阅 DeepSeek SGLang 文档。

Find any cheapest candidate resources

SkyPilot 会为您寻找最便宜的候选资源，并自动通过不同区域、云提供商或 Kubernetes 集群进行故障转移，以找到启动模型所需的资源。

SGLang 下载模型权重、编译并启动服务器可能需要一段时间（30-40 分钟）。

DeepSeek-R1 on SkyPilot

查询端点#

初始化完成后，您可以使用以下端点访问模型

ENDPOINT=$(sky status --endpoint 30000 r1)

curl http://$ENDPOINT/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "deepseek-ai/DeepSeek-R1-671B",
    "messages": [
      {
        "role": "system",
        "content": "You are a helpful assistant."
      },
      {
        "role": "user",
        "content": "how many rs are in strawberry"
      }
    ]
  }' | jq .

您将得到以下答案，有趣的是它没有触发任何思维链。

单词 strawberry 中有多少个 R：所以，答案是 3。🍓

好的，我们来数一数单词“strawberry”里有多少个字母“r”。首先，我要确保我拼写“strawberry”正确。有时候人们可能会漏掉字母或添加额外的字母。我把它写出来：S-T-R-A-W-B-E-R-R-Y。等等，这样对吗？我们再核对一下。“strawberry”的拼写是 S-T-R-A-W-B-E-R-R-Y。是的，这是正确的。现在，我要逐个字母检查，数一数“r”的数量。\n\n从第一个字母开始：S（没有），T（没有），R（有，这是第一个）。然后 A（没有），W（没有），B（没有），E（没有），R（这是第二个），R（这是第三个），Y（没有）。等等，等等，稍等。我把字母及其位置写出来，以便更精确。\n\n逐个字母分解“strawberry”：\n1. S\n2. T\n3. R\n4. A\n5. W\n6. B\n7. E\n8. R\n9. R\n10. Y\n\n所以，看位置 3、8 和 9：这是三个“r”。但等等，这和实际拼写一致吗？我再确认一下。这个词是 strawberry。有时候人们可能会认为它是带有两个“r”的“strawberry”，但实际上，根据正确的拼写，它是 S-T-R-A-W-B-E-R-R-Y。所以在 B 和 E 之后，有两个 R，对吗？我查一下词典，或者想想发音。Straw-ber-ry。“ber”部分是一个 R，但正确的拼写在 E 后面包含两个 R。所以是的，总共有三个 R。嗯，但我得确保我没有数错。所以位置 3、8、9：R，然后在 Y 前面有两个 R。这是三个 R。等等，实际上，在上面的分解中，位置 3 是 R，然后位置 8 和 9 是两个 R。所以总共是三个。是的，没错。所以答案应该是三。我看看能不能找到什么来源来证实这一点。另外，我也可以再写一遍这个词并数：S T R A W B E R R Y。所以 R 在开头出现一次（第三个字母），然后在末尾出现两次（第 8 个和第 9 个字母）。所以总共出现三次。因此，正确答案是三。\n\n\n单词“strawberry”包含 3 个字母“r”。具体分解如下：\n\n1. S \n2. T \n3. R (第 1 个“r”) \n4. A \n5. W \n6. B \n7. E \n8. R (第 2 个“r”) \n9. R (第 3 个“r”) \n10. Y \n\n所以，答案是 3。🍓

```console
{"id":"01add72820794f5c884c4d5c126d2a62","object":"chat.completion","created":1739493784,"model":"deepseek-ai/DeepSeek-R1-671B","choices":[{"index":0,"message":{"role":"assistant","content":"Okay, let's figure out how many times the letter \"r\" appears in the word \"strawberry.\" First, I need to make sure I'm spelling \"strawberry\" correctly. Sometimes people might miss letters or add extra ones. Let me write it out: S-T-R-A-W-B-E-R-R-Y. Wait, is that right? Let's double-check. Strawberry is spelled S-T-R-A-W-B-E-R-R-Y. Yes, that's correct. Now, I need to go through each letter one by one and count the number of \"r\"s.\n\nStarting with the first letter: S (no), T (no), R (yes, that's one). Then A (no), W (no), B (no), E (no), R (that's two), R (that's three), Y (no). Wait, wait, hold on. Let me write out the letters with their positions to be precise.\n\nBreaking down \"strawberry\" letter by letter:\n1. S\n2. T\n3. R\n4. A\n5. W\n6. B\n7. E\n8. R\n9. R\n10. Y\n\nSo, looking at positions 3, 8, and 9: that's three \"r\"s. But wait, does that match the actual spelling? Let me confirm again. The word is strawberry. Sometimes people might think it's \"strawberry\" with two \"r\"s, but actually, according to correct spelling, it's S-T-R-A-W-B-E-R-R-Y. So after the B and E, there are two R's, right? Let me check a dictionary or maybe think of the pronunciation. Straw-ber-ry. The \"ber\" part is one R, but the correct spelling includes two R's after the E. So yes, that makes three R's in total. Hmm, but let me make sure I'm not miscounting. So positions 3, 8, 9: R, then two R's at the end before Y. That's three R's. Wait, actually, in the breakdown above, position 3 is R, then positions 8 and 9 are the two R's. So total three. Yes, that's right. So the answer should be three. Let me see if I can find any source that confirms this. Alternatively, I can write the word again and count: S T R A W B E R R Y. So R appears once at the beginning (third letter) and then twice towards the end (8th and 9th letters). So total of three times. Therefore, the correct answer is three.\n</think>\n\nThe word \"strawberry\" contains **3** instances of the letter \"r\". Here's the breakdown:\n\n1. **S**  \n2. **T**  \n3. **R** (1st \"r\")  \n4. **A**  \n5. **W**  \n6. **B**  \n7. **E**  \n8. **R** (2nd \"r\")  \n9. **R** (3rd \"r\")  \n10. **Y**  \n\nSo, the answer is **3**. 🍓","tool_calls":null},"logprobs":null,"finish_reason":"stop","matched_stop":1}],"usage":{"prompt_tokens":17,"total_tokens":688,"completion_tokens":671,"prompt_tokens_details":null}}
```

生成速度#

您可以在服务器日志中找到生成速度。

在 GCP 上使用 2 个 H100:8 节点进行单个请求的示例速度（启用 gvnic 可能会获得更好的性能）

(head, rank=0, pid=18260) [2025-02-14 00:42:22 DP2 TP2] Decode batch. #running-req: 1, #token: 210, token usage: 0.00, gen throughput (token/s): 11.45, #queue-req: 0
(head, rank=0, pid=18260) [2025-02-14 00:42:25 DP2 TP2] Decode batch. #running-req: 1, #token: 250, token usage: 0.00, gen throughput (token/s): 11.53, #queue-req: 0
(head, rank=0, pid=18260) [2025-02-14 00:42:29 DP2 TP2] Decode batch. #running-req: 1, #token: 290, token usage: 0.00, gen throughput (token/s): 11.42, #queue-req: 0

使用多个副本部署服务#

上面的启动命令仅为服务启动了一个副本（包含 2 个节点）。SkyServe 可以帮助您部署具有多个副本的服务，并提供开箱即用的负载均衡、自动扩缩容和自动恢复功能。重要的是，它还可以在 Spot 实例上提供服务，从而降低 30% 的成本。

唯一需要的更改是添加一个服务部分，用于特定的服务配置

service:
  # Specifying the path to the endpoint to check the readiness of the service.
  readiness_probe:
    path: /health
    # Allow up to 1 hour for cold start
    initial_delay_seconds: 3600
  # Autoscaling from 0 to 2 replicas
  replica_policy:
    min_replicas: 0
    max_replicas: 2

然后使用一个命令运行 SkyPilot YAML 文件

sky serve up -n r1-serve deepseek-r1-671B.yaml

包含的文件#

deepseek-r1-671B-A100.yaml

# Ajusted on deepseek-r1-671B.yaml for A100.
name: deepseek-r1-A100

resources:
  accelerators: { A100-80GB:8 }
  disk_size: 2048 # The model in BF16 format takes about 1.3TB
  disk_tier: best
  ports: 30000
  any_of:
    - use_spot: true
    - use_spot: false

num_nodes: 4 # Specify number of nodes to launch, the requirement might be different for different accelerators

setup: |
  # Install sglang with all dependencies using uv
  uv pip install "sglang[all]>=0.4.2.post4" --find-links https://flashinfer.ai/whl/cu124/torch2.5/flashinfer

  # Set up shared memory for better performance
  sudo bash -c "echo 'vm.max_map_count=655300' >> /etc/sysctl.conf"
  sudo sysctl -p

  echo "FP8 is not supported on A100, we need to convert the model to BF16 format"

  # Conversion script
  git clone https://github.com/deepseek-ai/DeepSeek-V3.git deepseek_repo
  # A workaround for running conversion script on A100. See https://github.com/deepseek-ai/DeepSeek-V3/issues/4
  CONVERSION_SCRIPT="deepseek_repo/inference/fp8_cast_bf16.py"
  sed -i 's/new_state_dict\[weight_name\] = weight_dequant(weight, scale_inv)/new_state_dict[weight_name] = weight_dequant(weight.float(), scale_inv)/' $CONVERSION_SCRIPT

  uv venv venv_convert && source venv_convert/bin/activate

  # setuptools is needed by triton
  uv pip install huggingface_hub setuptools -r deepseek_repo/inference/requirements.txt

  # Download the model weights and convert to BF16 format
  echo "Downloading model weights..."
  FP8_MODEL_DIR="DeepSeek-R1-FP8"
  python -c "from huggingface_hub import snapshot_download; snapshot_download(repo_id='deepseek-ai/DeepSeek-R1', local_dir='./$FP8_MODEL_DIR')"

  # Convert the model to BF16 format
  MODEL_DIR="DeepSeek-R1-BF16"
  python $CONVERSION_SCRIPT \
    --input-fp8-hf-path $FP8_MODEL_DIR \
    --output-bf16-hf-path $MODEL_DIR

  if [ $? -ne 0 ]; then
    echo "BF16 conversion failed"
    exit 1
  fi

  MODEL_FILES=(
  "config.json"
  "generation_config.json"
  "modeling_deepseek.py"
  "configuration_deepseek.py"
  "tokenizer.json"
  "tokenizer_config.json"
  # the bf16 directory has its own model.safetensors.index.json
  )
  cp "${MODEL_FILES[@]/#/$FP8_MODEL_DIR/}" $MODEL_DIR/
  # See https://github.com/sgl-project/sglang/issues/3491
  sed -i '/"quantization_config": {/,/}/d' $MODEL_DIR/config.json

  echo "BF16 conversion completed. Model saved to $(realpath $MODEL_DIR)"
  ls -lh "$MODEL_DIR"  # List files for verification

run: |
  # Launch the server with appropriate configuration
  MASTER_ADDR=$(echo "$SKYPILOT_NODE_IPS" | head -n1)
  # TP should be number of GPUs per node times number of nodes
  TP=$(($SKYPILOT_NUM_GPUS_PER_NODE * $SKYPILOT_NUM_NODES))

  # For A100, we only export the head node for serving requests
  if [ "$SKYPILOT_NODE_RANK" -eq 0 ]; then
      HEAD_NODE_ARGS="--host 0.0.0.0 --port 30000"
  else
      HEAD_NODE_ARGS=""
  fi

  python -m sglang.launch_server \
    --model-path DeepSeek-R1-BF16 \
    --tp $TP \
    --dist-init-addr ${MASTER_ADDR}:5000 \
    --nnodes ${SKYPILOT_NUM_NODES} \
    --node-rank ${SKYPILOT_NODE_RANK} \
    --trust-remote-code \
    --enable-dp-attention \
    --enable-torch-compile \
    --torch-compile-max-bs 8 \
    $HEAD_NODE_ARGS

# Optional: Service configuration for SkyServe deployment
# This will be ignored when deploying with `sky launch`
service:
  # Specifying the path to the endpoint to check the readiness of the service.
  readiness_probe:
    path: /health
    # Allow up to 1 hour for cold start
    initial_delay_seconds: 3600
  # Autoscaling from 0 to 2 replicas
  replica_policy:
    min_replicas: 0
    max_replicas: 2

deepseek-r1-671B.yaml

name: deepseek-r1

resources:
  accelerators: {H200:8, H100:8, A100-80GB:8}
  disk_size: 1024 # Large disk for model weights
  disk_tier: best
  ports: 30000
  any_of:
    - use_spot: true
    - use_spot: false

num_nodes: 2 # Specify number of nodes to launch

setup: |
  # Install sglang with all dependencies using uv
  uv pip install "sglang[all]==0.4.2.post4" --find-links https://flashinfer.ai/whl/cu124/torch2.5/flashinfer

  # Set up shared memory for better performance
  sudo bash -c "echo 'vm.max_map_count=655300' >> /etc/sysctl.conf"
  sudo sysctl -p

run: |
  # Launch the server with appropriate configuration
  MASTER_ADDR=$(echo "$SKYPILOT_NODE_IPS" | head -n1)
  # TP should be number of GPUs per node times number of nodes
  TP=$(($SKYPILOT_NUM_GPUS_PER_NODE * $SKYPILOT_NUM_NODES))

  python -m sglang.launch_server \
    --model deepseek-ai/DeepSeek-R1 \
    --tp $TP \
    --dist-init-addr ${MASTER_ADDR}:5000 \
    --nnodes ${SKYPILOT_NUM_NODES} \
    --node-rank ${SKYPILOT_NODE_RANK} \
    --trust-remote-code \
    --enable-dp-attention \
    --enable-torch-compile \
    --torch-compile-max-bs 8 \
    --host 0.0.0.0 \
    --port 30000

# Optional: Service configuration for SkyServe deployment
# This will be ignored when deploying with `sky launch`
service:
  # Specifying the path to the endpoint to check the readiness of the service.
  readiness_probe:
    path: /health
    # Allow up to 1 hour for cold start
    initial_delay_seconds: 3600
  # Autoscaling from 0 to 2 replicas
  replica_policy:
    min_replicas: 0
    max_replicas: 2