来源：llm/llama-3_1

在您自己的基础设施上部署 Llama 3.1#

Llama-3.1 on SkyPilot

2024 年 7 月 23 日，Meta AI 发布了 Llama 3.1 模型系列，其中包括 4050 亿参数的模型，提供基础模型和指令微调版本。

Llama 3.1 4050 亿参数模型成为迄今为止能力最强的开源 LLM 模型。这是首次有开源 LLM 能够与 GPT-4o 和 Claude 3.5 Sonnet 等最先进的专有模型匹敌。

本指南将详细介绍如何在您的基础设施（集群或云 VPC）上完全部署 Llama 3.1 模型。支持的基础设施包括

本地 GPU 工作站
Kubernetes 集群
云账户（支持 15 家以上云服务商）

SkyPilot 将用作统一框架，用于在您提供的任何（或多个）基础设施上启动服务部署。

在您的基础设施上部署 Llama 3.1#

以下是使用 SkyPilot 在 GPU 开发节点上测试新模型，然后将其打包以便在任何基础设施上一键部署的分步指南。

若要直接跳到 Llama 3.1 的打包部署 YAML，请参阅步骤 3：使用 SkyPilot 打包和部署。

部署 Llama 3.1 所需的 GPU#

Llama 3.1 有不同的尺寸，每种尺寸对 GPU 的要求也不同。以下是模型-GPU 兼容性矩阵（适用于预训练模型和指令微调模型）

GPU	Meta-Llama-3.1-8B	Meta-Llama-3.1-70B	Meta-Llama-3.1-405B-FP8
L4:1	✅，使用 `--max-model-len 4096`	❌	❌
L4:8	✅	❌	❌
A100:8	✅	✅	❌
A100-80GB:8	✅	✅	✅，使用 `--max-model-len 4096`

步骤 0：准备您的基础设施#

在您的本地机器上安装 SkyPilot

pip install 'skypilot-nightly[all]'

根据您希望在哪个基础设施上运行 Llama 3.1，选择以下之一

如果您的本地机器是 GPU 节点：使用此命令启动一个轻量级 Kubernetes 集群

sky local up

如果您有 Kubernetes GPU 集群（例如，本地、EKS / GKE / AKS / …）

# Should show Enabled if you have ~/.kube/config set up.
sky check kubernetes

如果您想使用云服务（例如，预留实例）：支持 12 家以上云服务商

sky check

详见文档。

步骤 1：获取一个 GPU 开发节点（Pod 或 VM）#

提示：如果您只想获取最终的部署 YAML，请直接跳到步骤 3。

使用一条命令获取 GPU 开发 Pod/VM

sky launch -c llama --gpus A100-80GB:8

如果您使用本地机器或 Kubernetes，上述命令将创建一个 Pod。如果您使用云服务，上述命令将创建一个 VM。

您可以添加 -r / --retry-until-up 标志，让 SkyPilot 自动重试以防容量不足错误。

提示：修改 --gpus 标志可以获取不同类型和数量的 GPU。例如，--gpus H100:8 会创建一个包含 8 个 H100 GPU 的 Pod。

您可以运行 sky show-gpus 查看您基础设施上所有可用的 GPU 类型。

配置完成后，您可以轻松连接到它以开始开发工作。推荐两种方法

打开 VSCode，点击左下角，选择 Connect to Host，输入 llama
或者，使用 ssh llama SSH 连接到它

步骤 2：在开发节点内测试服务部署#

登录后，运行以下命令安装并运行 vLLM（它会自动从 HuggingFace 拉取模型权重）

pip install vllm==0.5.3.post1 huggingface

# Paste your HuggingFace token to get access to Meta Llama repos:
# https://hugging-face.cn/collections/meta-llama/llama-31-669fc079a0c406a149a5738f
huggingface-cli login

现在我们准备开始服务部署。如果您有 N=8 个 GPU

vllm serve meta-llama/Meta-Llama-3.1-8B-Instruct --tensor-parallel-size 8

将 --tensor-parallel-size 的值更改为您拥有的 GPU 数量。

提示：可用的模型名称可以在此处和下方找到。

预训练模型
- Meta-Llama-3.1-8B
- Meta-Llama-3.1-70B
- Meta-Llama-3.1-405B-FP8
指令微调模型
- Meta-Llama-3.1-8B-Instruct
- Meta-Llama-3.1-70B-Instruct
- Meta-Llama-3.1-405B-Instruct-FP8

全精度 4050 亿参数模型 Meta-Llama-3.1-405B 需要多节点推理，目前正在开发中——加入 SkyPilot 社区 Slack 进行讨论。

在节点内部测试 curl 是否工作

ENDPOINT=127.0.0.1:8000
curl http://$ENDPOINT/v1/chat/completions \
 -H "Content-Type: application/json" \
 -d '{
    "model": "meta-llama/Meta-Llama-3.1-8B-Instruct",
    "messages": [
      {
        "role": "system",
        "content": "You are a helpful assistant."
      },
      {
        "role": "user",
        "content": "Who are you?"
      }
    ]
  }' | jq

🎉 大功告成！您应该会看到类似这样的结果

Llama-3.1 on SkyPilot

完成后，使用以下命令终止您的集群

sky down llama

步骤 3：使用 SkyPilot 打包和部署#

现在我们已经验证了模型可以工作，接下来我们将其打包以便进行自动化部署。

无论您使用哪种基础设施部署 GPU，SkyPilot 都能抽象化繁琐的基础设施任务（例如，在 K8s 上设置服务，为云 VM 打开端口），使 AI 模型可以通过一条命令非常轻松地部署。

通过 SkyPilot 进行部署有几个关键优势

控制节点和副本完全保留在您的基础设施中
跨多个副本的自动负载均衡
副本的自动恢复
副本可以使用不同的基础设施，显著节省成本
- 例如，混合使用不同的云，或混合使用预留和 Spot GPU

点击查看 YAML：serve.yaml。

envs:
  MODEL_NAME: meta-llama/Meta-Llama-3.1-8B-Instruct
  HF_TOKEN: # TODO: Fill with your own huggingface token, or use --env to pass.

service:
  replicas: 2
  # An actual request for readiness probe.
  readiness_probe:
    path: /v1/chat/completions
    post_data:
      model: $MODEL_NAME
      messages:
        - role: user
          content: Hello! What is your name?
      max_tokens: 1

resources:
  accelerators: {L4:8, A10g:8, A10:8, A100:4, A100:8, A100-80GB:2, A100-80GB:4, A100-80GB:8}
  # accelerators: {L4, A10g, A10, L40, A40, A100, A100-80GB} # We can use cheaper accelerators for 8B model.
  cpus: 32+
  disk_size: 1000  # Ensure model checkpoints can fit.
  disk_tier: best
  ports: 8081  # Expose to internet traffic.

setup: |
  pip install vllm==0.5.3post1
  pip install vllm-flash-attn==2.5.9.post1
  # Install Gradio for web UI.
  pip install gradio openai

run: |
  echo 'Starting vllm api server...'
  
  vllm serve $MODEL_NAME \
    --tensor-parallel-size $SKYPILOT_NUM_GPUS_PER_NODE \
    --max-model-len 4096 \
    --port 8081 \
    2>&1 | tee api_server.log &

  while ! `cat api_server.log | grep -q 'Uvicorn running on'`; do
    echo 'Waiting for vllm api server to start...'
    sleep 5
  done
  
  echo 'Starting gradio server...'
  git clone https://github.com/vllm-project/vllm.git || true
  python vllm/examples/gradio_openai_chatbot_webserver.py \
    -m $MODEL_NAME \
    --port 8811 \
    --model-url https://:8081/v1

您也可以在此处获取完整的 YAML 文件。

启动具有负载均衡和自动恢复功能的完全托管服务

HF_TOKEN=xxx sky serve up llama-3_1.yaml -n llama31 --env HF_TOKEN --gpus L4:1 --env MODEL_NAME=meta-llama/Meta-Llama-3.1-8B-Instruct

等待服务就绪

watch -n10 sky serve status llama31

获取一个单一端点，该端点在副本之间进行负载均衡

ENDPOINT=$(sky serve status --endpoint llama31)

在终端中查询端点

curl -L http://$ENDPOINT/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "meta-llama/Meta-Llama-3.1-8B-Instruct",
    "messages": [
      {
        "role": "system",
        "content": "You are a helpful assistant."
      },
      {
        "role": "user",
        "content": "Who are you?"
      }
    ]
  }' | jq .

点击查看输出

{
  "id": "chat-5cdbc2091c934e619e56efd0ed85e28f",
  "object": "chat.completion",
  "created": 1721784853,
  "model": "meta-llama/Meta-Llama-3.1-8B-Instruct",
  "choices": [
    {
      "index": 0,
      "message": {
        "role": "assistant",
        "content": "I am a helpful assistant, here to provide information and assist with tasks to the best of my abilities. I'm a computer program designed to simulate conversation and answer questions on a wide range of topics. I can help with things like:\n\n* Providing definitions and explanations\n* Answering questions on history, science, and technology\n* Generating text and ideas\n* Translating languages\n* Offering suggestions and recommendations\n* And more!\n\nI'm constantly learning and improving, so feel free to ask me anything. What can I help you with today?",
        "tool_calls": []
      },
      "logprobs": null,
      "finish_reason": "stop",
      "stop_reason": null
    }
  ],
  "usage": {
    "prompt_tokens": 25,
    "total_tokens": 136,
    "completion_tokens": 111
  }
}

🎉 恭喜！您现在已在两个副本上部署了 Llama 3.1 8B 模型。总结一下，所有模型副本都保留在您自己的私有基础设施中，SkyPilot 确保它们健康且可用。

有关自动扩缩容、滚动更新等更多详细信息，请参阅 SkyServe 文档。

完成后，关闭所有资源

sky serve down llama31

彩蛋：微调 Llama 3.1#

您还可以使用 SkyPilot 在您的基础设施上微调 Llama 3.1。更多详细信息请查阅我们的博客。

包含的文件#

llama-3_1.yaml

# Serving Meta Llama-3.1 on your own infra.
#
# Usage:
#
#  # Launch Llama-3.1 8B on a single L4 GPU:
#  HF_TOKEN=xxx sky launch llama-31.yaml -c llama31 --gpus L4:1 --env HF_TOKEN --env MODEL_NAME=meta-llama/Meta-Llama-3.1-8B-Instruct
#
#  # Launch Llama-3.1 405B-FP8 on a A100-80GB:8 GPU:
#  HF_TOKEN=xxx sky launch llama-31.yaml -c llama31 --gpus A100-80GB:8 --env HF_TOKEN --env MODEL_NAME=meta-llama/Meta-Llama-3.1-405B-Instruct-FP8
#
# curl /v1/chat/completions:
#
#   ENDPOINT=$(sky status --endpoint 8081 llama31)
#
#   curl http://$ENDPOINT/v1/chat/completions \
#     -H "Content-Type: application/json" \
#     -d '{
#       "model": "meta-llama/Meta-Llama-3.1-8B-Instruct",
#       "messages": [
#         {
#           "role": "system",
#           "content": "You are a helpful assistant."
#         },
#         {
#           "role": "user",
#           "content": "Who are you?"
#         }
#       ]
#     }'
#
# Chat with model with Gradio UI (URLs printed in logs):
#
#   Running on local URL:  http://127.0.0.1:8811
#   Running on public URL: https://<hash>.gradio.live
#
# Scale up with SkyServe:
#  HF_TOKEN=xxx sky serve up llama-31.yaml -n llama31 --env HF_TOKEN --gpus L4:1 --env MODEL_NAME=meta-llama/Meta-Llama-3.1-8B-Instruct
#
# curl /v1/chat/completions:
#
#   ENDPOINT=$(sky serve status --endpoint llama31)
#   curl -L $ENDPOINT/v1/models
#   curl -L http://$ENDPOINT/v1/chat/completions \
#     -H "Content-Type: application/json" \
#     -d '{
#       "model": "meta-llama/Meta-Llama-3-8B-Instruct",
#       "messages": [
#         {
#           "role": "system",
#           "content": "You are a helpful assistant."
#         },
#         {
#           "role": "user",
#           "content": "Who are you?"
#         }
#       ]
#     }'


envs:
  MODEL_NAME: meta-llama/Meta-Llama-3.1-8B-Instruct
  HF_TOKEN: # TODO: Fill with your own huggingface token, or use --env to pass.

service:
  replicas: 2
  # An actual request for readiness probe.
  readiness_probe:
    path: /v1/chat/completions
    post_data:
      model: $MODEL_NAME
      messages:
        - role: user
          content: Hello! What is your name?
      max_tokens: 1

resources:
  accelerators: {L4:8, A10g:8, A10:8, A100:4, A100:8, A100-80GB:2, A100-80GB:4, A100-80GB:8}
  # accelerators: {L4, A10g, A10, L40, A40, A100, A100-80GB} # We can use cheaper accelerators for 8B model.
  cpus: 32+
  disk_size: 1000  # Ensure model checkpoints can fit.
  disk_tier: best
  ports: 8081  # Expose to internet traffic.

setup: |
  pip install vllm==0.5.3post1
  pip install vllm-flash-attn==2.5.9.post1
  # Install Gradio for web UI.
  pip install gradio openai

run: |
  echo 'Starting vllm api server...'
  
  vllm serve $MODEL_NAME \
    --tensor-parallel-size $SKYPILOT_NUM_GPUS_PER_NODE \
    --max-model-len 4096 \
    --port 8081 \
    2>&1 | tee api_server.log &

  while ! `cat api_server.log | grep -q 'Uvicorn running on'`; do
    echo 'Waiting for vllm api server to start...'
    sleep 5
  done
  
  echo 'Starting gradio server...'
  git clone https://github.com/vllm-project/vllm.git || true
  python vllm/examples/gradio_openai_chatbot_webserver.py \
    -m $MODEL_NAME \
    --port 8811 \
    --model-url https://:8081/v1