来源:llm/dbrx
Databricks DBRX:最先进的开源大语言模型#
DBRX 是由 Databricks 创建的一个开源的通用大语言模型。它采用专家混合 (MoE) 架构,总参数量为 132B,其中任意输入激活的参数为 36B。
在本教程中,您将通过一个命令在您自己的基础设施(现有 Kubernetes 集群或云虚拟机)上部署 databricks/dbrx-instruct
模型服务。
前提条件#
前往 HuggingFace 模型页面 并请求访问模型
databricks/dbrx-instruct
。检查您是否已安装 SkyPilot (文档)。
检查
sky check
命令是否显示云或 Kubernetes 已启用。
SkyPilot YAML#
点击查看完整的教程 YAML
envs:
MODEL_NAME: databricks/dbrx-instruct
HF_TOKEN: # TODO: Fill with your own huggingface token, or use --env to pass.
service:
replicas: 2
# An actual request for readiness probe.
readiness_probe:
path: /v1/chat/completions
post_data:
model: $MODEL_NAME
messages:
- role: user
content: Hello! What is your name?
max_tokens: 1
resources:
accelerators: {A100-80GB:8, A100-80GB:4, A100:8, A100:16}
cpus: 32+
memory: 512+
use_spot: True
disk_size: 512 # Ensure model checkpoints (~246GB) can fit.
disk_tier: best
ports: 8081 # Expose to internet traffic.
setup: |
conda activate vllm
if [ $? -ne 0 ]; then
conda create -n vllm python=3.10 -y
conda activate vllm
fi
# DBRX merged on master, 3/27/2024
pip install git+https://github.com/vllm-project/vllm.git@e24336b5a772ab3aa6ad83527b880f9e5050ea2a
pip install gradio tiktoken==0.6.0 openai
run: |
conda activate vllm
echo 'Starting vllm api server...'
# https://github.com/vllm-project/vllm/issues/3098
export PATH=$PATH:/sbin
# NOTE: --gpu-memory-utilization 0.95 needed for 4-GPU nodes.
python -u -m vllm.entrypoints.openai.api_server \
--port 8081 \
--model $MODEL_NAME \
--trust-remote-code --tensor-parallel-size $SKYPILOT_NUM_GPUS_PER_NODE \
--gpu-memory-utilization 0.95 \
2>&1 | tee api_server.log &
while ! `cat api_server.log | grep -q 'Uvicorn running on'`; do
echo 'Waiting for vllm api server to start...'
sleep 5
done
echo 'Starting gradio server...'
git clone https://github.com/vllm-project/vllm.git || true
python vllm/examples/gradio_openai_chatbot_webserver.py \
-m $MODEL_NAME \
--port 8811 \
--model-url http://localhost:8081/v1
您也可以在此获取完整的 YAML 文件。
部署 DBRX:单实例#
在您的基础设施上启动一个 Spot 实例来部署 DBRX 服务
HF_TOKEN=xxx sky launch dbrx.yaml -c dbrx --env HF_TOKEN
示例输出
...
I 03-28 08:40:47 optimizer.py:690] == Optimizer ==
I 03-28 08:40:47 optimizer.py:701] Target: minimizing cost
I 03-28 08:40:47 optimizer.py:713] Estimated cost: $2.44 / hour
I 03-28 08:40:47 optimizer.py:713]
I 03-28 08:40:47 optimizer.py:836] Considered resources (1 node):
I 03-28 08:40:47 optimizer.py:906] ----------------------------------------------------------------------------------------------------------------------
I 03-28 08:40:47 optimizer.py:906] CLOUD INSTANCE vCPUs Mem(GB) ACCELERATORS REGION/ZONE COST ($) CHOSEN
I 03-28 08:40:47 optimizer.py:906] ----------------------------------------------------------------------------------------------------------------------
I 03-28 08:40:47 optimizer.py:906] Azure Standard_NC96ads_A100_v4[Spot] 96 880 A100-80GB:4 eastus 2.44 ✔
I 03-28 08:40:47 optimizer.py:906] AWS p4d.24xlarge[Spot] 96 1152 A100:8 us-east-2b 4.15
I 03-28 08:40:47 optimizer.py:906] Azure Standard_ND96asr_v4[Spot] 96 900 A100:8 eastus 4.82
I 03-28 08:40:47 optimizer.py:906] Azure Standard_ND96amsr_A100_v4[Spot] 96 1924 A100-80GB:8 southcentralus 5.17
I 03-28 08:40:47 optimizer.py:906] GCP a2-ultragpu-4g[Spot] 48 680 A100-80GB:4 us-east4-c 7.39
I 03-28 08:40:47 optimizer.py:906] GCP a2-highgpu-8g[Spot] 96 680 A100:8 us-central1-a 11.75
I 03-28 08:40:47 optimizer.py:906] GCP a2-ultragpu-8g[Spot] 96 1360 A100-80GB:8 us-east4-c 14.79
I 03-28 08:40:47 optimizer.py:906] GCP a2-megagpu-16g[Spot] 96 1360 A100:16 us-central1-a 22.30
I 03-28 08:40:47 optimizer.py:906] ----------------------------------------------------------------------------------------------------------------------
...
要在 Kubernetes 上运行或使用按需实例,请在上述命令中添加 --no-use-spot
参数。
使用 Kubernetes / 按需实例的示例输出
$ HF_TOKEN=xxx sky launch dbrx.yaml -c dbrx --env HF_TOKEN --no-use-spot
...
I 03-28 08:47:27 optimizer.py:690] == Optimizer ==
I 03-28 08:47:27 optimizer.py:701] Target: minimizing cost
I 03-28 08:47:27 optimizer.py:713] Estimated cost: $0.0 / hour
I 03-28 08:47:27 optimizer.py:713]
I 03-28 08:47:27 optimizer.py:836] Considered resources (1 node):
I 03-28 08:47:27 optimizer.py:906] ------------------------------------------------------------------------------------------------------------------
I 03-28 08:47:27 optimizer.py:906] CLOUD INSTANCE vCPUs Mem(GB) ACCELERATORS REGION/ZONE COST ($) CHOSEN
I 03-28 08:47:27 optimizer.py:906] ------------------------------------------------------------------------------------------------------------------
I 03-28 08:47:27 optimizer.py:906] Kubernetes 32CPU--512GB--8A100 32 512 A100:8 kubernetes 0.00 ✔
I 03-28 08:47:27 optimizer.py:906] Azure Standard_NC96ads_A100_v4 96 880 A100-80GB:4 eastus 14.69
I 03-28 08:47:27 optimizer.py:906] Fluidstack recUYj6oGJCvAvCXC7KQo5Fc7 252 960 A100-80GB:8 generic_1_canada 19.79
I 03-28 08:47:27 optimizer.py:906] GCP a2-ultragpu-4g 48 680 A100-80GB:4 us-central1-a 20.11
I 03-28 08:47:27 optimizer.py:906] Paperspace A100-80Gx8 96 640 A100-80GB:8 East Coast (NY2) 25.44
I 03-28 08:47:27 optimizer.py:906] Azure Standard_ND96asr_v4 96 900 A100:8 eastus 27.20
I 03-28 08:47:27 optimizer.py:906] GCP a2-highgpu-8g 96 680 A100:8 us-central1-a 29.39
I 03-28 08:47:27 optimizer.py:906] Azure Standard_ND96amsr_A100_v4 96 1924 A100-80GB:8 eastus 32.77
I 03-28 08:47:27 optimizer.py:906] AWS p4d.24xlarge 96 1152 A100:8 us-east-1 32.77
I 03-28 08:47:27 optimizer.py:906] GCP a2-ultragpu-8g 96 1360 A100-80GB:8 us-central1-a 40.22
I 03-28 08:47:27 optimizer.py:906] AWS p4de.24xlarge 96 1152 A100-80GB:8 us-east-1 40.97
I 03-28 08:47:27 optimizer.py:906] GCP a2-megagpu-16g 96 1360 A100:16 us-central1-a 55.74
I 03-28 08:47:27 optimizer.py:906] ------------------------------------------------------------------------------------------------------------------
...
等待模型准备就绪(可能需要 10 分钟以上),等待出现以下行表示模型已准备好
...
(task, pid=17433) Waiting for vllm api server to start...
...
(task, pid=17433) INFO: Started server process [20621]
(task, pid=17433) INFO: Waiting for application startup.
(task, pid=17433) INFO: Application startup complete.
(task, pid=17433) INFO: Uvicorn running on http://0.0.0.0:8081 (Press CTRL+C to quit)
...
(task, pid=17433) Running on local URL: http://127.0.0.1:8811
(task, pid=17433) Running on public URL: https://xxxxxxxxxx.gradio.live
...
(task, pid=17433) INFO 03-28 04:32:50 metrics.py:218] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 0 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.0%, CPU KV cache usage: 0.0%
🎉 恭喜! 🎉 您已在您的基础设施上启动了 DBRX Instruct 大语言模型。
您可以通过以下方式与模型交互
标准的 OpenAPI 兼容端点(例如,
/v1/chat/completions
)Gradio 用户界面(自动启动)
要使用 curl 访问 /v1/chat/completions
IP=$(sky status --ip dbrx)
curl http://$IP:8081/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "databricks/dbrx-instruct",
"messages": [
{
"role": "system",
"content": "You are a helpful assistant."
},
{
"role": "user",
"content": "Who are you?"
}
]
}'
要使用 Gradio 用户界面,请打开日志中显示的 URL
(task, pid=17433) Running on public URL: https://xxxxxxxxxx.gradio.live
停止实例
sky stop dbrx
关闭所有资源
sky down dbrx
部署 DBRX:使用 SkyServe 进行扩展#
试用模型后,您可以使用 SkyServe 部署具有自动扩缩容和负载均衡功能的模型。
无需修改 YAML 文件,即可在您的基础设施上启动一个完全托管的服务
HF_TOKEN=xxx sky serve up dbrx.yaml -n dbrx --env HF_TOKEN
等待服务准备就绪
watch -n10 sky serve status dbrx
示例输出
Services
NAME VERSION UPTIME STATUS REPLICAS ENDPOINT
dbrx 1 35s READY 2/2 xx.yy.zz.100:30001
Service Replicas
SERVICE_NAME ID VERSION IP LAUNCHED RESOURCES STATUS REGION
dbrx 1 1 xx.yy.zz.121 18 mins ago 1x GCP([Spot]{'A100-80GB': 4}) READY us-east4
dbrx 2 1 xx.yy.zz.245 18 mins ago 1x GCP([Spot]{'A100-80GB': 4}) READY us-east4
获取一个跨副本进行负载均衡的单个端点
ENDPOINT=$(sky serve status --endpoint dbrx)
提示: SkyServe 完全管理您副本的生命周期。例如,如果一个 Spot 副本被抢占,控制器将自动替换它。这显著降低了运维负担,同时节省了成本。
要使用 curl 访问端点
curl $ENDPOINT/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "databricks/dbrx-instruct",
"messages": [
{
"role": "system",
"content": "You are a helpful assistant."
},
{
"role": "user",
"content": "Who are you?"
}
]
}'
关闭所有资源
sky serve down dbrx
更多详细信息请参阅 SkyServe 文档。
包含的文件#
dbrx.yaml
# Serving Databricks DBRX on your own infra.
#
# Usage:
#
# HF_TOKEN=xxx sky launch dbrx.yaml -c dbrx --env HF_TOKEN
#
# curl /v1/chat/completions:
#
# IP=$(sky status --ip dbrx)
# curl $IP:8081/v1/models
# curl http://$IP:8081/v1/chat/completions \
# -H "Content-Type: application/json" \
# -d '{
# "model": "databricks/dbrx-instruct",
# "messages": [
# {
# "role": "system",
# "content": "You are a helpful assistant."
# },
# {
# "role": "user",
# "content": "Who are you?"
# }
# ]
# }'
#
# Chat with model with Gradio UI:
#
# Running on local URL: http://127.0.0.1:8811
# Running on public URL: https://<hash>.gradio.live
envs:
MODEL_NAME: databricks/dbrx-instruct
HF_TOKEN: # TODO: Fill with your own huggingface token, or use --env to pass.
service:
replicas: 2
# An actual request for readiness probe.
readiness_probe:
path: /v1/chat/completions
post_data:
model: $MODEL_NAME
messages:
- role: user
content: Hello! What is your name?
max_tokens: 1
resources:
accelerators: {A100-80GB:8, A100-80GB:4, A100:8, A100:16}
cpus: 32+
memory: 512+
use_spot: True
disk_size: 512 # Ensure model checkpoints (~246GB) can fit.
disk_tier: best
ports: 8081 # Expose to internet traffic.
setup: |
conda activate vllm
if [ $? -ne 0 ]; then
conda create -n vllm python=3.10 -y
conda activate vllm
fi
# DBRX merged on master, 3/27/2024
pip install git+https://github.com/vllm-project/vllm.git@e24336b5a772ab3aa6ad83527b880f9e5050ea2a
pip install gradio tiktoken==0.6.0 openai
run: |
conda activate vllm
echo 'Starting vllm api server...'
# https://github.com/vllm-project/vllm/issues/3098
export PATH=$PATH:/sbin
# NOTE: --gpu-memory-utilization 0.95 needed for 4-GPU nodes.
python -u -m vllm.entrypoints.openai.api_server \
--port 8081 \
--model $MODEL_NAME \
--trust-remote-code --tensor-parallel-size $SKYPILOT_NUM_GPUS_PER_NODE \
--gpu-memory-utilization 0.95 \
2>&1 | tee api_server.log &
while ! `cat api_server.log | grep -q 'Uvicorn running on'`; do
echo 'Waiting for vllm api server to start...'
sleep 5
done
echo 'Starting gradio server...'
git clone https://github.com/vllm-project/vllm.git || true
python vllm/examples/gradio_openai_chatbot_webserver.py \
-m $MODEL_NAME \
--port 8811 \
--model-url http://localhost:8081/v1