来源: llm/llama-3_2
Vision Llama 3.2#
Llama 3.2 系列由 Meta 于 2024 年 9 月 25 日发布。它不仅包括最新的改进型(且更小)用于聊天的 LLM 模型,还包括多模态视觉语言模型。让我们使用 SkyPilot 指向并启动 (point and launch) 它。
为何使用 SkyPilot?#
指向、启动和服务部署:只需指向您有权访问的云/Kubernetes 集群,即可使用单个命令在那里启动模型。
无锁定:在任何受支持的云上运行 — AWS、Azure、GCP、Lambda Cloud、IBM、Samsung、OCI
所有内容都保留在您的云账户中(您的虚拟机和存储桶)
其他人无法看到您的聊天记录
支付最低费用 — 无托管解决方案加价
根据规模和预算自由选择您自己的模型大小、GPU 类型、GPU 数量等。
…只需一次点击即可获得所有这些 — 让 SkyPilot 自动化基础设施。
先决条件#
前往 HuggingFace 模型页面 并请求访问模型 meta-llama/Llama-3.2-1B-Instruct 和 meta-llama/Llama-3.2-11B-Vision。
检查您是否已安装 SkyPilot (文档)。
检查
sky check
是否显示云或 Kubernetes 已启用。
SkyPilot YAML#
点击查看完整配方 YAML
envs:
MODEL_NAME: meta-llama/Llama-3.2-3B-Instruct
# MODEL_NAME: meta-llama/Llama-3.2-3B-Vision
HF_TOKEN: # TODO: Fill with your own huggingface token, or use --env to pass.
service:
replicas: 2
# An actual request for readiness probe.
readiness_probe:
path: /v1/chat/completions
post_data:
model: $MODEL_NAME
messages:
- role: user
content: Hello! What is your name?
max_tokens: 1
resources:
accelerators: {L4:1, L40S:1, L40:1, A10g:1, A10:1, A100:1, H100:1}
# accelerators: {L4, A10g, A10, L40, A40, A100, A100-80GB} # We can use cheaper accelerators for 8B model.
cpus: 8+
disk_size: 512 # Ensure model checkpoints can fit.
disk_tier: best
ports: 8081 # Expose to internet traffic.
setup: |
# Install huggingface transformers for the support of Llama 3.2
pip install git+https://github.com/huggingface/transformers.git@f0eabf6c7da2afbe8425546c092fa3722f9f219e
pip install vllm==0.6.2
run: |
echo 'Starting vllm api server...'
vllm serve $MODEL_NAME \
--port 8081 \
--tensor-parallel-size $SKYPILOT_NUM_GPUS_PER_NODE \
--max-model-len 4096 \
2>&1
您也可以从此处获取完整的 YAML 文件。
指向并启动 Llama 3.2#
启动一个 Spot 实例以在您的基础设施上部署 Llama 3.2
$ HF_TOKEN=xxx sky launch llama3_2.yaml -c llama3_2 --env HF_TOKEN
...
------------------------------------------------------------------------------------------------------------------
CLOUD INSTANCE vCPUs Mem(GB) ACCELERATORS REGION/ZONE COST ($) CHOSEN
------------------------------------------------------------------------------------------------------------------
Kubernetes 4CPU--16GB--1L4 4 16 L4:1 kubernetes 0.00 ✔
RunPod 1x_L4_SECURE 4 24 L4:1 CA 0.44
GCP g2-standard-4 4 16 L4:1 us-east4-a 0.70
AWS g6.xlarge 4 16 L4:1 us-east-1 0.80
AWS g5.xlarge 4 16 A10G:1 us-east-1 1.01
RunPod 1x_L40_SECURE 16 48 L40:1 CA 1.14
Fluidstack L40_48GB::1 32 60 L40:1 CANADA 1.15
AWS g6e.xlarge 4 32 L40S:1 us-east-1 1.86
Cudo sapphire-rapids-h100_1x4v8gb 4 8 H100:1 ca-montreal-3 2.86
Fluidstack H100_PCIE_80GB::1 28 180 H100:1 CANADA 2.89
Azure Standard_NV36ads_A10_v5 36 440 A10:1 eastus 3.20
GCP a2-highgpu-1g 12 85 A100:1 us-central1-a 3.67
RunPod 1x_H100_SECURE 16 80 H100:1 CA 4.49
Azure Standard_NC40ads_H100_v5 40 320 H100:1 eastus 6.98
------------------------------------------------------------------------------------------------------------------
等待模型准备就绪(这可能需要 10 分钟以上)。
🎉 恭喜! 🎉 您已在您的基础设施上启动了 Llama 3.2 Instruct LLM。
使用 OpenAI API 与 Llama 3.2 对话#
Curl /v1/chat/completions
端点
ENDPOINT=$(sky status --endpoint 8081 llama3_2)
curl http://$ENDPOINT/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "meta-llama/Llama-3.2-3B-Instruct",
"messages": [
{
"role": "system",
"content": "You are a helpful assistant."
},
{
"role": "user",
"content": "Who are you?"
}
]
}' | jq .
示例输出
{
"id": "chat-e7b6d2a2d2934bcab169f82812601baf",
"object": "chat.completion",
"created": 1727291780,
"model": "meta-llama/Llama-3.2-3B-Instruct",
"choices": [
{
"index": 0,
"message": {
"role": "assistant",
"content": "I'm an artificial intelligence model known as Llama. Llama stands for \"Large Language Model Meta AI.\"",
"tool_calls": []
},
"logprobs": null,
"finish_reason": "stop",
"stop_reason": null
}
],
"usage": {
"prompt_tokens": 45,
"total_tokens": 68,
"completion_tokens": 23
},
"prompt_logprobs": null
}
停止实例
sky stop llama3_2
关闭所有资源
sky down llama3_2
指向并启动 Vision Llama 3.2#
现在让我们启动 Vision Llama!Llama-3.2 的多模态能力可以开启许多新的用例。我们将在此处使用最大的 11B 模型。
$ HF_TOKEN=xxx sky launch llama3_2-vision-11b.yaml -c llama3_2-vision --env HF_TOKEN
------------------------------------------------------------------------------------------------------------------
CLOUD INSTANCE vCPUs Mem(GB) ACCELERATORS REGION/ZONE COST ($) CHOSEN
------------------------------------------------------------------------------------------------------------------
Kubernetes 2CPU--8GB--1H100 2 8 H100:1 kubernetes 0.00 ✔
RunPod 1x_L40_SECURE 16 48 L40:1 CA 1.14
Fluidstack L40_48GB::1 32 60 L40:1 CANADA 1.15
AWS g6e.xlarge 4 32 L40S:1 us-east-1 1.86
RunPod 1x_A100-80GB_SECURE 8 80 A100-80GB:1 CA 1.99
Cudo sapphire-rapids-h100_1x2v4gb 2 4 H100:1 ca-montreal-3 2.83
Fluidstack H100_PCIE_80GB::1 28 180 H100:1 CANADA 2.89
GCP a2-highgpu-1g 12 85 A100:1 us-central1-a 3.67
Azure Standard_NC24ads_A100_v4 24 220 A100-80GB:1 eastus 3.67
RunPod 1x_H100_SECURE 16 80 H100:1 CA 4.49
GCP a2-ultragpu-1g 12 170 A100-80GB:1 us-central1-a 5.03
Azure Standard_NC40ads_H100_v5 40 320 H100:1 eastus 6.98
------------------------------------------------------------------------------------------------------------------
与 Vision Llama 3.2 对话#
ENDPOINT=$(sky status --endpoint 8081 llama3_2-vision)
curl http://$ENDPOINT/v1/chat/completions \
-H 'Content-Type: application/json' \
-H 'Authorization: Bearer token' \
--data '{
"model": "meta-llama/Llama-3.2-11B-Vision-Instruct",
"messages": [
{
"role": "user",
"content": [
{"type" : "text", "text": "Turn this logo into ASCII art."},
{"type": "image_url", "image_url": {"url": "https://pbs.twimg.com/profile_images/1584596138635632640/HWexMoH5_400x400.jpg"}}
]
}],
"max_tokens": 1024
}' | jq .
示例输出(解析后)
输出 1
-------------
- -
- - -
- - -
- -
-------------
输出 2
^_________
/ \\
/ \\
/______________\\
| |
| |
|_______________|
\\ /
\\ /
\\________/
原始输出
{
"id": "chat-c341b8a0b40543918f3bb2fef68b0952",
"object": "chat.completion",
"created": 1727295337,
"model": "meta-llama/Llama-3.2-11B-Vision-Instruct",
"choices": [
{
"index": 0,
"message": {
"role": "assistant",
"content": "Sure, here is the logo in ASCII art:\n\n------------- \n- - \n- - - \n- - - \n- - \n------------- \n\nNote that this is a very simple representation and does not capture all the details of the original logo.",
"tool_calls": []
},
"logprobs": null,
"finish_reason": "stop",
"stop_reason": null
}
],
"usage": {
"prompt_tokens": 18,
"total_tokens": 73,
"completion_tokens": 55
},
"prompt_logprobs": null
}
部署 Llama-3:使用 SkyServe 扩缩容#
在试用模型后,您可以使用 SkyServe 部署具有自动扩缩容和负载均衡的模型。
无需更改 YAML 文件,即可在您的基础设施上启动完全托管的服务
HF_TOKEN=xxx sky serve up llama3_2-vision-11b.yaml -n llama3_2 --env HF_TOKEN
等待服务准备就绪
watch -n10 sky serve status llama3_2
示例输出
Services
NAME VERSION UPTIME STATUS REPLICAS ENDPOINT
llama3_2 1 35s READY 2/2 xx.yy.zz.100:30001
Service Replicas
SERVICE_NAME ID VERSION IP LAUNCHED RESOURCES STATUS REGION
llama3_2 1 1 xx.yy.zz.121 18 mins ago 1x GCP([Spot]{'A100-80GB': 8}) READY us-east4
llama3_2 2 1 xx.yy.zz.245 18 mins ago 1x GCP([Spot]{'A100-80GB': 8}) READY us-east4
获取一个在副本之间进行负载均衡的单一端点
ENDPOINT=$(sky serve status --endpoint llama3_2)
提示: SkyServe 完全管理副本的生命周期。例如,如果一个 Spot 副本被抢占,控制器将自动替换它。这显著减轻了操作负担,同时节省了成本。
Curl 端点
curl http://$ENDPOINT/v1/chat/completions \
-H 'Content-Type: application/json' \
-H 'Authorization: Bearer token' \
--data '{
"model": "meta-llama/Llama-3.2-11B-Vision-Instruct",
"messages": [
{
"role": "user",
"content": [
{"type" : "text", "text": "Covert this logo to ASCII art"},
{"type": "image_url", "image_url": {"url": "https://pbs.twimg.com/profile_images/1584596138635632640/HWexMoH5_400x400.jpg"}}
]
}],
"max_tokens": 2048
}' | jq .
关闭所有资源
sky serve down llama3
请参阅 SkyServe 文档以获取更多详细信息。
开发和微调 Llama 3 系列#
包含的文件#
llama3_2-vision-11b.yaml
# Serving Meta Llama 3.2 on your own infra.
#
# Usage:
#
# HF_TOKEN=xxx sky launch llama3_2.yaml -c llama3_2 --env HF_TOKEN
#
# curl /v1/chat/completions:
#
# ENDPOINT=$(sky status --endpoint 8081 llama3_2)
#
# # We need to manually specify the stop_token_ids to make sure the model finish
# # on <|eot_id|>.
# curl http://$ENDPOINT/v1/chat/completions \
# -H "Content-Type: application/json" \
# -d '{
# "model": "meta-llama/Meta-Llama-3-8B-Instruct",
# "messages": [
# {
# "role": "system",
# "content": "You are a helpful assistant."
# },
# {
# "role": "user",
# "content": "Who are you?"
# }
# ],
# "stop_token_ids": [128009, 128001]
# }'
#
# Chat with model with Gradio UI:
#
# Running on local URL: http://127.0.0.1:8811
# Running on public URL: https://<hash>.gradio.live
#
# Scale up with SkyServe:
# HF_TOKEN=xxx sky serve up llama3_2.yaml -n llama3_2 --env HF_TOKEN
#
# curl /v1/chat/completions:
#
# ENDPOINT=$(sky serve status --endpoint llama3_2)
# curl -L $ENDPOINT/v1/models
# curl -L http://$ENDPOINT/v1/chat/completions \
# -H "Content-Type: application/json" \
# -d '{
# "model": "databricks/llama3-instruct",
# "messages": [
# {
# "role": "system",
# "content": "You are a helpful assistant."
# },
# {
# "role": "user",
# "content": "Who are you?"
# }
# ]
# }'
envs:
MODEL_NAME: meta-llama/Llama-3.2-11B-Vision-Instruct
HF_TOKEN: # TODO: Fill with your own huggingface token, or use --env to pass.
service:
replicas: 2
# An actual request for readiness probe.
readiness_probe:
path: /v1/chat/completions
post_data:
model: $MODEL_NAME
messages:
- role: user
content: Hello! What is your name?
max_tokens: 1
resources:
accelerators: {L40, L40S, A100, A100-80GB, H100}
disk_size: 1000 # Ensure model checkpoints can fit.
disk_tier: best
ports: 8081 # Expose to internet traffic.
setup: |
pip install vllm==0.6.2
run: |
echo 'Starting vllm api server...'
vllm serve $MODEL_NAME \
--enforce-eager \
--limit-mm-per-prompt "image=1" \
--tensor-parallel-size $SKYPILOT_NUM_GPUS_PER_NODE \
--max-model-len 4096 \
--max-num-seqs 40 \
--port 8081 \
--disable-log-requests
llama3_2.yaml
# Serving Meta Llama 3.2 on your own infra.
#
# Usage:
#
# HF_TOKEN=xxx sky launch llama3_2.yaml -c llama3_2 --env HF_TOKEN
#
# curl /v1/chat/completions:
#
# ENDPOINT=$(sky status --endpoint 8081 llama3_2)
#
# # We need to manually specify the stop_token_ids to make sure the model finish
# # on <|eot_id|>.
# curl http://$ENDPOINT/v1/chat/completions \
# -H "Content-Type: application/json" \
# -d '{
# "model": "meta-llama/Meta-Llama-3-8B-Instruct",
# "messages": [
# {
# "role": "system",
# "content": "You are a helpful assistant."
# },
# {
# "role": "user",
# "content": "Who are you?"
# }
# ],
# "stop_token_ids": [128009, 128001]
# }'
#
# Chat with model with Gradio UI:
#
# Running on local URL: http://127.0.0.1:8811
# Running on public URL: https://<hash>.gradio.live
#
# Scale up with SkyServe:
# HF_TOKEN=xxx sky serve up llama3_2.yaml -n llama3_2 --env HF_TOKEN
#
# curl /v1/chat/completions:
#
# ENDPOINT=$(sky serve status --endpoint llama3_2)
# curl -L $ENDPOINT/v1/models
# curl -L http://$ENDPOINT/v1/chat/completions \
# -H "Content-Type: application/json" \
# -d '{
# "model": "databricks/llama3-instruct",
# "messages": [
# {
# "role": "system",
# "content": "You are a helpful assistant."
# },
# {
# "role": "user",
# "content": "Who are you?"
# }
# ]
# }'
envs:
MODEL_NAME: meta-llama/Llama-3.2-3B-Instruct
# MODEL_NAME: meta-llama/Llama-3.2-3B-Vision
HF_TOKEN: # TODO: Fill with your own huggingface token, or use --env to pass.
service:
replicas: 2
# An actual request for readiness probe.
readiness_probe:
path: /v1/chat/completions
post_data:
model: $MODEL_NAME
messages:
- role: user
content: Hello! What is your name?
max_tokens: 1
resources:
accelerators: {L4:1, L40S:1, L40:1, A10g:1, A10:1, A100:1, H100:1}
# accelerators: {L4, A10g, A10, L40, A40, A100, A100-80GB} # We can use cheaper accelerators for 8B model.
cpus: 8+
disk_size: 512 # Ensure model checkpoints can fit.
disk_tier: best
ports: 8081 # Expose to internet traffic.
setup: |
pip install vllm==0.6.2
run: |
echo 'Starting vllm api server...'
vllm serve $MODEL_NAME \
--port 8081 \
--tensor-parallel-size $SKYPILOT_NUM_GPUS_PER_NODE \
--max-model-len 4096