Databricks DBRX：最先进的开源大语言模型#

DBRX Blog Header

DBRX 是由 Databricks 创建的一个开源的通用大语言模型。它采用专家混合 (MoE) 架构，总参数量为 132B，其中任意输入激活的参数为 36B。

在本教程中，您将通过一个命令在您自己的基础设施（现有 Kubernetes 集群或云虚拟机）上部署 databricks/dbrx-instruct 模型服务。

前提条件#

前往 HuggingFace 模型页面并请求访问模型 databricks/dbrx-instruct。
检查您是否已安装 SkyPilot (文档)。
检查 sky check 命令是否显示云或 Kubernetes 已启用。

SkyPilot YAML#

点击查看完整的教程 YAML

envs:
  MODEL_NAME: databricks/dbrx-instruct
  HF_TOKEN: # TODO: Fill with your own huggingface token, or use --env to pass.

service:
  replicas: 2
  # An actual request for readiness probe.
  readiness_probe:
    path: /v1/chat/completions
    post_data:
      model: $MODEL_NAME
      messages:
        - role: user
          content: Hello! What is your name?
      max_tokens: 1

resources:
  accelerators: {A100-80GB:8, A100-80GB:4, A100:8, A100:16}
  cpus: 32+
  memory: 512+
  use_spot: True
  disk_size: 512  # Ensure model checkpoints (~246GB) can fit.
  disk_tier: best
  ports: 8081  # Expose to internet traffic.

setup: |
  conda activate vllm
  if [ $? -ne 0 ]; then
    conda create -n vllm python=3.10 -y
    conda activate vllm
  fi

  # DBRX merged on master, 3/27/2024
  pip install git+https://github.com/vllm-project/vllm.git@e24336b5a772ab3aa6ad83527b880f9e5050ea2a

  pip install gradio tiktoken==0.6.0 openai

run: |
  conda activate vllm
  echo 'Starting vllm api server...'

  # https://github.com/vllm-project/vllm/issues/3098
  export PATH=$PATH:/sbin

  # NOTE: --gpu-memory-utilization 0.95 needed for 4-GPU nodes.
  python -u -m vllm.entrypoints.openai.api_server \
    --port 8081 \
    --model $MODEL_NAME \
    --trust-remote-code --tensor-parallel-size $SKYPILOT_NUM_GPUS_PER_NODE \
    --gpu-memory-utilization 0.95 \
    2>&1 | tee api_server.log &

  while ! `cat api_server.log | grep -q 'Uvicorn running on'`; do
    echo 'Waiting for vllm api server to start...'
    sleep 5
  done

  echo 'Starting gradio server...'
  git clone https://github.com/vllm-project/vllm.git || true
  python vllm/examples/gradio_openai_chatbot_webserver.py \
    -m $MODEL_NAME \
    --port 8811 \
    --model-url http://localhost:8081/v1

您也可以在此获取完整的 YAML 文件。

部署 DBRX：单实例#

在您的基础设施上启动一个 Spot 实例来部署 DBRX 服务

HF_TOKEN=xxx sky launch dbrx.yaml -c dbrx --env HF_TOKEN

示例输出

...
I 03-28 08:40:47 optimizer.py:690] == Optimizer ==
I 03-28 08:40:47 optimizer.py:701] Target: minimizing cost
I 03-28 08:40:47 optimizer.py:713] Estimated cost: $2.44 / hour
I 03-28 08:40:47 optimizer.py:713]
I 03-28 08:40:47 optimizer.py:836] Considered resources (1 node):
I 03-28 08:40:47 optimizer.py:906] ----------------------------------------------------------------------------------------------------------------------
I 03-28 08:40:47 optimizer.py:906]  CLOUD        INSTANCE                          vCPUs   Mem(GB)   ACCELERATORS   REGION/ZONE      COST ($)   CHOSEN   
I 03-28 08:40:47 optimizer.py:906] ----------------------------------------------------------------------------------------------------------------------
I 03-28 08:40:47 optimizer.py:906]  Azure        Standard_NC96ads_A100_v4[Spot]    96      880       A100-80GB:4    eastus           2.44          ✔      
I 03-28 08:40:47 optimizer.py:906]  AWS          p4d.24xlarge[Spot]                96      1152      A100:8         us-east-2b       4.15                
I 03-28 08:40:47 optimizer.py:906]  Azure        Standard_ND96asr_v4[Spot]         96      900       A100:8         eastus           4.82                
I 03-28 08:40:47 optimizer.py:906]  Azure        Standard_ND96amsr_A100_v4[Spot]   96      1924      A100-80GB:8    southcentralus   5.17                
I 03-28 08:40:47 optimizer.py:906]  GCP          a2-ultragpu-4g[Spot]              48      680       A100-80GB:4    us-east4-c       7.39                
I 03-28 08:40:47 optimizer.py:906]  GCP          a2-highgpu-8g[Spot]               96      680       A100:8         us-central1-a    11.75               
I 03-28 08:40:47 optimizer.py:906]  GCP          a2-ultragpu-8g[Spot]              96      1360      A100-80GB:8    us-east4-c       14.79               
I 03-28 08:40:47 optimizer.py:906]  GCP          a2-megagpu-16g[Spot]              96      1360      A100:16        us-central1-a    22.30               
I 03-28 08:40:47 optimizer.py:906] ----------------------------------------------------------------------------------------------------------------------
...

要在 Kubernetes 上运行或使用按需实例，请在上述命令中添加 --no-use-spot 参数。

使用 Kubernetes / 按需实例的示例输出

$ HF_TOKEN=xxx sky launch dbrx.yaml -c dbrx --env HF_TOKEN --no-use-spot
...
I 03-28 08:47:27 optimizer.py:690] == Optimizer ==
I 03-28 08:47:27 optimizer.py:701] Target: minimizing cost
I 03-28 08:47:27 optimizer.py:713] Estimated cost: $0.0 / hour
I 03-28 08:47:27 optimizer.py:713] 
I 03-28 08:47:27 optimizer.py:836] Considered resources (1 node):
I 03-28 08:47:27 optimizer.py:906] ------------------------------------------------------------------------------------------------------------------
I 03-28 08:47:27 optimizer.py:906]  CLOUD        INSTANCE                    vCPUs   Mem(GB)   ACCELERATORS   REGION/ZONE        COST ($)   CHOSEN   
I 03-28 08:47:27 optimizer.py:906] ------------------------------------------------------------------------------------------------------------------
I 03-28 08:47:27 optimizer.py:906]  Kubernetes   32CPU--512GB--8A100         32      512       A100:8         kubernetes         0.00          ✔     
I 03-28 08:47:27 optimizer.py:906]  Azure        Standard_NC96ads_A100_v4    96      880       A100-80GB:4    eastus             14.69               
I 03-28 08:47:27 optimizer.py:906]  Fluidstack   recUYj6oGJCvAvCXC7KQo5Fc7   252     960       A100-80GB:8    generic_1_canada   19.79               
I 03-28 08:47:27 optimizer.py:906]  GCP          a2-ultragpu-4g              48      680       A100-80GB:4    us-central1-a      20.11               
I 03-28 08:47:27 optimizer.py:906]  Paperspace   A100-80Gx8                  96      640       A100-80GB:8    East Coast (NY2)   25.44               
I 03-28 08:47:27 optimizer.py:906]  Azure        Standard_ND96asr_v4         96      900       A100:8         eastus             27.20               
I 03-28 08:47:27 optimizer.py:906]  GCP          a2-highgpu-8g               96      680       A100:8         us-central1-a      29.39               
I 03-28 08:47:27 optimizer.py:906]  Azure        Standard_ND96amsr_A100_v4   96      1924      A100-80GB:8    eastus             32.77               
I 03-28 08:47:27 optimizer.py:906]  AWS          p4d.24xlarge                96      1152      A100:8         us-east-1          32.77               
I 03-28 08:47:27 optimizer.py:906]  GCP          a2-ultragpu-8g              96      1360      A100-80GB:8    us-central1-a      40.22               
I 03-28 08:47:27 optimizer.py:906]  AWS          p4de.24xlarge               96      1152      A100-80GB:8    us-east-1          40.97               
I 03-28 08:47:27 optimizer.py:906]  GCP          a2-megagpu-16g              96      1360      A100:16        us-central1-a      55.74               
I 03-28 08:47:27 optimizer.py:906] ------------------------------------------------------------------------------------------------------------------
...

等待模型准备就绪（可能需要 10 分钟以上），等待出现以下行表示模型已准备好

...
(task, pid=17433) Waiting for vllm api server to start...
...
(task, pid=17433) INFO:     Started server process [20621]
(task, pid=17433) INFO:     Waiting for application startup.
(task, pid=17433) INFO:     Application startup complete.
(task, pid=17433) INFO:     Uvicorn running on http://0.0.0.0:8081 (Press CTRL+C to quit)
...
(task, pid=17433) Running on local URL:  http://127.0.0.1:8811
(task, pid=17433) Running on public URL: https://xxxxxxxxxx.gradio.live
...
(task, pid=17433) INFO 03-28 04:32:50 metrics.py:218] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 0 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.0%, CPU KV cache usage: 0.0%

🎉 恭喜！ 🎉 您已在您的基础设施上启动了 DBRX Instruct 大语言模型。

您可以通过以下方式与模型交互

标准的 OpenAPI 兼容端点（例如，/v1/chat/completions）
Gradio 用户界面（自动启动）

要使用 curl 访问 /v1/chat/completions

IP=$(sky status --ip dbrx)

curl http://$IP:8081/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "databricks/dbrx-instruct",
    "messages": [
      {
        "role": "system",
        "content": "You are a helpful assistant."
      },
      {
        "role": "user",
        "content": "Who are you?"
      }
    ]
  }'

要使用 Gradio 用户界面，请打开日志中显示的 URL

(task, pid=17433) Running on public URL: https://xxxxxxxxxx.gradio.live

Gradio UI serving DBRX

停止实例

sky stop dbrx

关闭所有资源

sky down dbrx

部署 DBRX：使用 SkyServe 进行扩展#

试用模型后，您可以使用 SkyServe 部署具有自动扩缩容和负载均衡功能的模型。

无需修改 YAML 文件，即可在您的基础设施上启动一个完全托管的服务

HF_TOKEN=xxx sky serve up dbrx.yaml -n dbrx --env HF_TOKEN

等待服务准备就绪

watch -n10 sky serve status dbrx

示例输出

Services
NAME  VERSION  UPTIME  STATUS  REPLICAS  ENDPOINT
dbrx  1        35s     READY   2/2       xx.yy.zz.100:30001

Service Replicas
SERVICE_NAME  ID  VERSION  IP            LAUNCHED     RESOURCES                       STATUS  REGION
dbrx          1   1        xx.yy.zz.121  18 mins ago  1x GCP([Spot]{'A100-80GB': 4})  READY   us-east4
dbrx          2   1        xx.yy.zz.245  18 mins ago  1x GCP([Spot]{'A100-80GB': 4})  READY   us-east4

获取一个跨副本进行负载均衡的单个端点

ENDPOINT=$(sky serve status --endpoint dbrx)

提示： SkyServe 完全管理您副本的生命周期。例如，如果一个 Spot 副本被抢占，控制器将自动替换它。这显著降低了运维负担，同时节省了成本。

要使用 curl 访问端点

curl $ENDPOINT/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "databricks/dbrx-instruct",
    "messages": [
      {
        "role": "system",
        "content": "You are a helpful assistant."
      },
      {
        "role": "user",
        "content": "Who are you?"
      }
    ]
  }'

关闭所有资源

sky serve down dbrx

更多详细信息请参阅 SkyServe 文档。

包含的文件#

dbrx.yaml

# Serving Databricks DBRX on your own infra.
#
# Usage:
#
#  HF_TOKEN=xxx sky launch dbrx.yaml -c dbrx --env HF_TOKEN
#
# curl /v1/chat/completions:
#
#   IP=$(sky status --ip dbrx)
#   curl $IP:8081/v1/models
#   curl http://$IP:8081/v1/chat/completions \
#     -H "Content-Type: application/json" \
#     -d '{
#       "model": "databricks/dbrx-instruct",
#       "messages": [
#         {
#           "role": "system",
#           "content": "You are a helpful assistant."
#         },
#         {
#           "role": "user",
#           "content": "Who are you?"
#         }
#       ]
#     }'
#
# Chat with model with Gradio UI:
#
#   Running on local URL:  http://127.0.0.1:8811
#   Running on public URL: https://<hash>.gradio.live

envs:
  MODEL_NAME: databricks/dbrx-instruct
  HF_TOKEN: # TODO: Fill with your own huggingface token, or use --env to pass.

service:
  replicas: 2
  # An actual request for readiness probe.
  readiness_probe:
    path: /v1/chat/completions
    post_data:
      model: $MODEL_NAME
      messages:
        - role: user
          content: Hello! What is your name?
      max_tokens: 1

resources:
  accelerators: {A100-80GB:8, A100-80GB:4, A100:8, A100:16}
  cpus: 32+
  memory: 512+
  use_spot: True
  disk_size: 512  # Ensure model checkpoints (~246GB) can fit.
  disk_tier: best
  ports: 8081  # Expose to internet traffic.

setup: |
  conda activate vllm
  if [ $? -ne 0 ]; then
    conda create -n vllm python=3.10 -y
    conda activate vllm
  fi

  # DBRX merged on master, 3/27/2024
  pip install git+https://github.com/vllm-project/vllm.git@e24336b5a772ab3aa6ad83527b880f9e5050ea2a

  pip install gradio tiktoken==0.6.0 openai

run: |
  conda activate vllm
  echo 'Starting vllm api server...'

  # https://github.com/vllm-project/vllm/issues/3098
  export PATH=$PATH:/sbin

  # NOTE: --gpu-memory-utilization 0.95 needed for 4-GPU nodes.
  python -u -m vllm.entrypoints.openai.api_server \
    --port 8081 \
    --model $MODEL_NAME \
    --trust-remote-code --tensor-parallel-size $SKYPILOT_NUM_GPUS_PER_NODE \
    --gpu-memory-utilization 0.95 \
    2>&1 | tee api_server.log &

  while ! `cat api_server.log | grep -q 'Uvicorn running on'`; do
    echo 'Waiting for vllm api server to start...'
    sleep 5
  done

  echo 'Starting gradio server...'
  git clone https://github.com/vllm-project/vllm.git || true
  python vllm/examples/gradio_openai_chatbot_webserver.py \
    -m $MODEL_NAME \
    --port 8811 \
    --model-url http://localhost:8081/v1