来源: llm/llama-3_2

Vision Llama 3.2#

Llama 3.2 系列由 Meta 于 2024 年 9 月 25 日发布。它不仅包括最新的改进型（且更小）用于聊天的 LLM 模型，还包括多模态视觉语言模型。让我们使用 SkyPilot 指向并启动 (point and launch) 它。

Llama 3.2 发布

为何使用 SkyPilot?#

指向、启动和服务部署：只需指向您有权访问的云/Kubernetes 集群，即可使用单个命令在那里启动模型。
无锁定：在任何受支持的云上运行 — AWS、Azure、GCP、Lambda Cloud、IBM、Samsung、OCI
所有内容都保留在您的云账户中（您的虚拟机和存储桶）
其他人无法看到您的聊天记录
支付最低费用 — 无托管解决方案加价
根据规模和预算自由选择您自己的模型大小、GPU 类型、GPU 数量等。

…只需一次点击即可获得所有这些 — 让 SkyPilot 自动化基础设施。

先决条件#

前往 HuggingFace 模型页面并请求访问模型 meta-llama/Llama-3.2-1B-Instruct 和 meta-llama/Llama-3.2-11B-Vision。
检查您是否已安装 SkyPilot (文档)。
检查 sky check 是否显示云或 Kubernetes 已启用。

SkyPilot YAML#

点击查看完整配方 YAML

envs:
  MODEL_NAME: meta-llama/Llama-3.2-3B-Instruct
  # MODEL_NAME: meta-llama/Llama-3.2-3B-Vision
  HF_TOKEN: # TODO: Fill with your own huggingface token, or use --env to pass.

service:
  replicas: 2
  # An actual request for readiness probe.
  readiness_probe:
    path: /v1/chat/completions
    post_data:
      model: $MODEL_NAME
      messages:
        - role: user
          content: Hello! What is your name?
      max_tokens: 1

resources:
  accelerators: {L4:1, L40S:1, L40:1, A10g:1, A10:1, A100:1, H100:1}
  # accelerators: {L4, A10g, A10, L40, A40, A100, A100-80GB} # We can use cheaper accelerators for 8B model.
  cpus: 8+
  disk_size: 512  # Ensure model checkpoints can fit.
  disk_tier: best
  ports: 8081  # Expose to internet traffic.

setup: |
  # Install huggingface transformers for the support of Llama 3.2
  pip install git+https://github.com/huggingface/transformers.git@f0eabf6c7da2afbe8425546c092fa3722f9f219e
  pip install vllm==0.6.2

run: |
  echo 'Starting vllm api server...'

  vllm serve $MODEL_NAME \
    --port 8081 \
    --tensor-parallel-size $SKYPILOT_NUM_GPUS_PER_NODE \
    --max-model-len 4096 \
    2>&1

您也可以从此处获取完整的 YAML 文件。

指向并启动 Llama 3.2#

启动一个 Spot 实例以在您的基础设施上部署 Llama 3.2

$ HF_TOKEN=xxx sky launch llama3_2.yaml -c llama3_2 --env HF_TOKEN

...
------------------------------------------------------------------------------------------------------------------
 CLOUD        INSTANCE                       vCPUs   Mem(GB)   ACCELERATORS   REGION/ZONE     COST ($)   CHOSEN
------------------------------------------------------------------------------------------------------------------
 Kubernetes   4CPU--16GB--1L4                4       16        L4:1           kubernetes      0.00          ✔
 RunPod       1x_L4_SECURE                   4       24        L4:1           CA              0.44
 GCP          g2-standard-4                  4       16        L4:1           us-east4-a      0.70
 AWS          g6.xlarge                      4       16        L4:1           us-east-1       0.80
 AWS          g5.xlarge                      4       16        A10G:1         us-east-1       1.01
 RunPod       1x_L40_SECURE                  16      48        L40:1          CA              1.14
 Fluidstack   L40_48GB::1                    32      60        L40:1          CANADA          1.15
 AWS          g6e.xlarge                     4       32        L40S:1         us-east-1       1.86
 Cudo         sapphire-rapids-h100_1x4v8gb   4       8         H100:1         ca-montreal-3   2.86
 Fluidstack   H100_PCIE_80GB::1              28      180       H100:1         CANADA          2.89
 Azure        Standard_NV36ads_A10_v5        36      440       A10:1          eastus          3.20
 GCP          a2-highgpu-1g                  12      85        A100:1         us-central1-a   3.67
 RunPod       1x_H100_SECURE                 16      80        H100:1         CA              4.49
 Azure        Standard_NC40ads_H100_v5       40      320       H100:1         eastus          6.98
------------------------------------------------------------------------------------------------------------------

等待模型准备就绪（这可能需要 10 分钟以上）。

🎉 恭喜！ 🎉 您已在您的基础设施上启动了 Llama 3.2 Instruct LLM。

使用 OpenAI API 与 Llama 3.2 对话#

Curl /v1/chat/completions 端点

ENDPOINT=$(sky status --endpoint 8081 llama3_2)

curl http://$ENDPOINT/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "meta-llama/Llama-3.2-3B-Instruct",
    "messages": [
      {
        "role": "system",
        "content": "You are a helpful assistant."
      },
      {
        "role": "user",
        "content": "Who are you?"
      }
    ]
  }' | jq .

示例输出

{
  "id": "chat-e7b6d2a2d2934bcab169f82812601baf",
  "object": "chat.completion",
  "created": 1727291780,
  "model": "meta-llama/Llama-3.2-3B-Instruct",
  "choices": [
    {
      "index": 0,
      "message": {
        "role": "assistant",
        "content": "I'm an artificial intelligence model known as Llama. Llama stands for \"Large Language Model Meta AI.\"",
        "tool_calls": []
      },
      "logprobs": null,
      "finish_reason": "stop",
      "stop_reason": null
    }
  ],
  "usage": {
    "prompt_tokens": 45,
    "total_tokens": 68,
    "completion_tokens": 23
  },
  "prompt_logprobs": null
}

停止实例

sky stop llama3_2

关闭所有资源

sky down llama3_2

指向并启动 Vision Llama 3.2#

现在让我们启动 Vision Llama！Llama-3.2 的多模态能力可以开启许多新的用例。我们将在此处使用最大的 11B 模型。

$ HF_TOKEN=xxx sky launch llama3_2-vision-11b.yaml -c llama3_2-vision --env HF_TOKEN

------------------------------------------------------------------------------------------------------------------
 CLOUD        INSTANCE                       vCPUs   Mem(GB)   ACCELERATORS   REGION/ZONE     COST ($)   CHOSEN
------------------------------------------------------------------------------------------------------------------
 Kubernetes   2CPU--8GB--1H100               2       8         H100:1         kubernetes      0.00          ✔
 RunPod       1x_L40_SECURE                  16      48        L40:1          CA              1.14
 Fluidstack   L40_48GB::1                    32      60        L40:1          CANADA          1.15
 AWS          g6e.xlarge                     4       32        L40S:1         us-east-1       1.86
 RunPod       1x_A100-80GB_SECURE            8       80        A100-80GB:1    CA              1.99
 Cudo         sapphire-rapids-h100_1x2v4gb   2       4         H100:1         ca-montreal-3   2.83
 Fluidstack   H100_PCIE_80GB::1              28      180       H100:1         CANADA          2.89
 GCP          a2-highgpu-1g                  12      85        A100:1         us-central1-a   3.67
 Azure        Standard_NC24ads_A100_v4       24      220       A100-80GB:1    eastus          3.67
 RunPod       1x_H100_SECURE                 16      80        H100:1         CA              4.49
 GCP          a2-ultragpu-1g                 12      170       A100-80GB:1    us-central1-a   5.03
 Azure        Standard_NC40ads_H100_v5       40      320       H100:1         eastus          6.98
------------------------------------------------------------------------------------------------------------------

与 Vision Llama 3.2 对话#

ENDPOINT=$(sky status --endpoint 8081 llama3_2-vision)

curl http://$ENDPOINT/v1/chat/completions \
    -H 'Content-Type: application/json' \
    -H 'Authorization: Bearer token' \
    --data '{
        "model": "meta-llama/Llama-3.2-11B-Vision-Instruct",
        "messages": [
        {
            "role": "user",
            "content": [
                {"type" : "text", "text": "Turn this logo into ASCII art."},
                {"type": "image_url", "image_url": {"url": "https://pbs.twimg.com/profile_images/1584596138635632640/HWexMoH5_400x400.jpg"}}
            ]
        }],
        "max_tokens": 1024
    }' | jq .

示例输出（解析后）

输出 1

-------------
-        -
-   -   -
-   -   -
-        -
-------------

输出 2

        ^_________
       /          \\
      /            \\
     /______________\\
     |               |
     |               |
     |_______________|
       \\            /
        \\          /
         \\________/

原始输出

{
  "id": "chat-c341b8a0b40543918f3bb2fef68b0952",
  "object": "chat.completion",
  "created": 1727295337,
  "model": "meta-llama/Llama-3.2-11B-Vision-Instruct",
  "choices": [
    {
      "index": 0,
      "message": {
        "role": "assistant",
        "content": "Sure, here is the logo in ASCII art:\n\n------------- \n-        - \n-   -   - \n-   -   - \n-        - \n------------- \n\nNote that this is a very simple representation and does not capture all the details of the original logo.",
        "tool_calls": []
      },
      "logprobs": null,
      "finish_reason": "stop",
      "stop_reason": null
    }
  ],
  "usage": {
    "prompt_tokens": 18,
    "total_tokens": 73,
    "completion_tokens": 55
  },
  "prompt_logprobs": null
}

部署 Llama-3：使用 SkyServe 扩缩容#

在试用模型后，您可以使用 SkyServe 部署具有自动扩缩容和负载均衡的模型。

无需更改 YAML 文件，即可在您的基础设施上启动完全托管的服务

HF_TOKEN=xxx sky serve up llama3_2-vision-11b.yaml -n llama3_2 --env HF_TOKEN

等待服务准备就绪

watch -n10 sky serve status llama3_2

示例输出

Services
NAME  VERSION  UPTIME  STATUS  REPLICAS  ENDPOINT
llama3_2  1        35s     READY   2/2       xx.yy.zz.100:30001

Service Replicas
SERVICE_NAME  ID  VERSION  IP            LAUNCHED     RESOURCES                       STATUS  REGION
llama3_2          1   1        xx.yy.zz.121  18 mins ago  1x GCP([Spot]{'A100-80GB': 8})  READY   us-east4
llama3_2          2   1        xx.yy.zz.245  18 mins ago  1x GCP([Spot]{'A100-80GB': 8})  READY   us-east4

获取一个在副本之间进行负载均衡的单一端点

ENDPOINT=$(sky serve status --endpoint llama3_2)

提示： SkyServe 完全管理副本的生命周期。例如，如果一个 Spot 副本被抢占，控制器将自动替换它。这显著减轻了操作负担，同时节省了成本。

Curl 端点

curl http://$ENDPOINT/v1/chat/completions \
    -H 'Content-Type: application/json' \
    -H 'Authorization: Bearer token' \
    --data '{
        "model": "meta-llama/Llama-3.2-11B-Vision-Instruct",
        "messages": [
        {
            "role": "user",
            "content": [
                {"type" : "text", "text": "Covert this logo to ASCII art"},
                {"type": "image_url", "image_url": {"url": "https://pbs.twimg.com/profile_images/1584596138635632640/HWexMoH5_400x400.jpg"}}
            ]
        }],
        "max_tokens": 2048
    }' | jq .

关闭所有资源

sky serve down llama3

请参阅 SkyServe 文档以获取更多详细信息。

开发和微调 Llama 3 系列#

SkyPilot 还简化了 Llama 3 系列的开发和微调。请查看开发和微调指南：开发和微调。

包含的文件#

llama3_2-vision-11b.yaml

# Serving Meta Llama 3.2 on your own infra.
#
# Usage:
#
#  HF_TOKEN=xxx sky launch llama3_2.yaml -c llama3_2 --env HF_TOKEN
#
# curl /v1/chat/completions:
#
#   ENDPOINT=$(sky status --endpoint 8081 llama3_2)
#  
#   # We need to manually specify the stop_token_ids to make sure the model finish
#   # on <|eot_id|>.
#   curl http://$ENDPOINT/v1/chat/completions \
#     -H "Content-Type: application/json" \
#     -d '{
#       "model": "meta-llama/Meta-Llama-3-8B-Instruct",
#       "messages": [
#         {
#           "role": "system",
#           "content": "You are a helpful assistant."
#         },
#         {
#           "role": "user",
#           "content": "Who are you?"
#         }
#       ],
#       "stop_token_ids": [128009,  128001]
#     }'
#
# Chat with model with Gradio UI:
#
#   Running on local URL:  http://127.0.0.1:8811
#   Running on public URL: https://<hash>.gradio.live
#
# Scale up with SkyServe:
#  HF_TOKEN=xxx sky serve up llama3_2.yaml -n llama3_2 --env HF_TOKEN
#
# curl /v1/chat/completions:
#
#   ENDPOINT=$(sky serve status --endpoint llama3_2)
#   curl -L $ENDPOINT/v1/models
#   curl -L http://$ENDPOINT/v1/chat/completions \
#     -H "Content-Type: application/json" \
#     -d '{
#       "model": "databricks/llama3-instruct",
#       "messages": [
#         {
#           "role": "system",
#           "content": "You are a helpful assistant."
#         },
#         {
#           "role": "user",
#           "content": "Who are you?"
#         }
#       ]
#     }'


envs:
  MODEL_NAME: meta-llama/Llama-3.2-11B-Vision-Instruct
  HF_TOKEN: # TODO: Fill with your own huggingface token, or use --env to pass.

service:
  replicas: 2
  # An actual request for readiness probe.
  readiness_probe:
    path: /v1/chat/completions
    post_data:
      model: $MODEL_NAME
      messages:
        - role: user
          content: Hello! What is your name?
      max_tokens: 1

resources:
  accelerators: {L40, L40S, A100, A100-80GB, H100}
  disk_size: 1000  # Ensure model checkpoints can fit.
  disk_tier: best
  ports: 8081  # Expose to internet traffic.

setup: |
  pip install vllm==0.6.2


run: |
  echo 'Starting vllm api server...'

  vllm serve $MODEL_NAME \
    --enforce-eager \
    --limit-mm-per-prompt "image=1" \
    --tensor-parallel-size $SKYPILOT_NUM_GPUS_PER_NODE \
    --max-model-len 4096 \
    --max-num-seqs 40 \
    --port 8081 \
    --disable-log-requests

llama3_2.yaml

# Serving Meta Llama 3.2 on your own infra.
#
# Usage:
#
#  HF_TOKEN=xxx sky launch llama3_2.yaml -c llama3_2 --env HF_TOKEN
#
# curl /v1/chat/completions:
#
#   ENDPOINT=$(sky status --endpoint 8081 llama3_2)
#  
#   # We need to manually specify the stop_token_ids to make sure the model finish
#   # on <|eot_id|>.
#   curl http://$ENDPOINT/v1/chat/completions \
#     -H "Content-Type: application/json" \
#     -d '{
#       "model": "meta-llama/Meta-Llama-3-8B-Instruct",
#       "messages": [
#         {
#           "role": "system",
#           "content": "You are a helpful assistant."
#         },
#         {
#           "role": "user",
#           "content": "Who are you?"
#         }
#       ],
#       "stop_token_ids": [128009,  128001]
#     }'
#
# Chat with model with Gradio UI:
#
#   Running on local URL:  http://127.0.0.1:8811
#   Running on public URL: https://<hash>.gradio.live
#
# Scale up with SkyServe:
#  HF_TOKEN=xxx sky serve up llama3_2.yaml -n llama3_2 --env HF_TOKEN
#
# curl /v1/chat/completions:
#
#   ENDPOINT=$(sky serve status --endpoint llama3_2)
#   curl -L $ENDPOINT/v1/models
#   curl -L http://$ENDPOINT/v1/chat/completions \
#     -H "Content-Type: application/json" \
#     -d '{
#       "model": "databricks/llama3-instruct",
#       "messages": [
#         {
#           "role": "system",
#           "content": "You are a helpful assistant."
#         },
#         {
#           "role": "user",
#           "content": "Who are you?"
#         }
#       ]
#     }'


envs:
  MODEL_NAME: meta-llama/Llama-3.2-3B-Instruct
  # MODEL_NAME: meta-llama/Llama-3.2-3B-Vision
  HF_TOKEN: # TODO: Fill with your own huggingface token, or use --env to pass.

service:
  replicas: 2
  # An actual request for readiness probe.
  readiness_probe:
    path: /v1/chat/completions
    post_data:
      model: $MODEL_NAME
      messages:
        - role: user
          content: Hello! What is your name?
      max_tokens: 1

resources:
  accelerators: {L4:1, L40S:1, L40:1, A10g:1, A10:1, A100:1, H100:1}
  # accelerators: {L4, A10g, A10, L40, A40, A100, A100-80GB} # We can use cheaper accelerators for 8B model.
  cpus: 8+
  disk_size: 512  # Ensure model checkpoints can fit.
  disk_tier: best
  ports: 8081  # Expose to internet traffic.

setup: |
  pip install vllm==0.6.2


run: |
  echo 'Starting vllm api server...'

  vllm serve $MODEL_NAME \
    --port 8081 \
    --tensor-parallel-size $SKYPILOT_NUM_GPUS_PER_NODE \
    --max-model-len 4096