vLLM：简单、快速、经济的LLM推理#

vLLM

本 README 包含运行 vLLM 演示的说明，vLLM 是一个用于快速 LLM 推理和服务的开源库，与 HuggingFace 相比，吞吐量提高了 高达 24 倍。

前提条件#

安装最新版本的 SkyPilot 并检查云凭证的设置

pip install git+https://github.com/skypilot-org/skypilot.git
sky check

查看 vLLM SkyPilot YAML 文件。

使用 vLLM 的 OpenAI 兼容 API 服务器提供 Llama-2 服务#

在开始之前，您需要访问 huggingface 上的 Llama-2 模型权重。有关更多详细信息，请查看 Llama-2 示例中的前提条件部分。

开始提供 Llama-2 模型服务

sky launch -c vllm-llama2 serve-openai-api.yaml --env HF_TOKEN=YOUR_HUGGING_FACE_API_TOKEN

可选：目前只有 GCP 提供指定的 L4 GPU。要使用其他云，请使用 --gpus 标志请求其他 GPU。例如，使用 H100 GPU

sky launch -c vllm-llama2 serve-openai-api.yaml --gpus H100:1 --env HF_TOKEN=YOUR_HUGGING_FACE_API_TOKEN

提示：您也可以使用 vLLM docker 容器来加快设置速度。更多详细信息请参考 serve-openai-api-docker.yaml。

使用以下命令检查集群的 IP

IP=$(sky status --ip vllm-llama2)

您现在可以使用 OpenAI API 与模型交互。

查询集群上托管的模型

curl http://$IP:8000/v1/models

使用输入提示查询模型进行文本补全

curl http://$IP:8000/v1/completions \
  -H "Content-Type: application/json" \
  -d '{
      "model": "meta-llama/Llama-2-7b-chat-hf",
      "prompt": "San Francisco is a",
      "max_tokens": 7,
      "temperature": 0
  }'

您应该会得到与以下类似的响应

{
    "id":"cmpl-50a231f7f06a4115a1e4bd38c589cd8f",
    "object":"text_completion","created":1692427390,
    "model":"meta-llama/Llama-2-7b-chat-hf",
    "choices":[{
        "index":0,
        "text":"city in Northern California that is known",
        "logprobs":null,"finish_reason":"length"
    }],
    "usage":{"prompt_tokens":5,"total_tokens":12,"completion_tokens":7}
}

使用输入提示查询模型进行聊天补全

curl http://$IP:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "meta-llama/Llama-2-7b-chat-hf",
    "messages": [
      {
        "role": "system",
        "content": "You are a helpful assistant."
      },
      {
        "role": "user",
        "content": "Who are you?"
      }
    ]
  }'

您应该会得到与以下类似的响应

{
  "id": "cmpl-879a58992d704caf80771b4651ff8cb6",
  "object": "chat.completion",
  "created": 1692650569,
  "model": "meta-llama/Llama-2-7b-chat-hf",
  "choices": [{
    "index": 0,
    "message": {
      "role": "assistant",
      "content": " Hello! I'm just an AI assistant, here to help you"
    },
    "finish_reason": "length"
  }],
  "usage": {
    "prompt_tokens": 31,
    "total_tokens": 47,
    "completion_tokens": 16
  }
}

使用 SkyServe 通过 vLLM 提供 Llama-2 服务以处理更多流量#

为了扩展模型服务以处理更多流量，我们引入了 SkyServe，使用户能够轻松部署模型的多个副本

在上述 serve-openai-api.yaml 文件中添加一个 service 部分，使其成为 SkyServe Service YAML

# The newly-added `service` section to the `serve-openai-api.yaml` file.
service:
  # Specifying the path to the endpoint to check the readiness of the service.
  readiness_probe: /v1/models
  # How many replicas to manage.
  replicas: 2

完整的 Service YAML 可在此处找到：service.yaml。

使用 SkyServe CLI 启动服务

sky serve up -n vllm-llama2 service.yaml

使用 sky serve status 检查服务的状态

sky serve status vllm-llama2

您应该会得到与以下类似的输出

Services
NAME           UPTIME     STATUS    REPLICAS   ENDPOINT
vllm-llama2    7m 43s     READY     2/2        3.84.15.251:30001

Service Replicas
SERVICE_NAME   ID   IP             LAUNCHED       RESOURCES          STATUS  REGION
vllm-llama2    1    34.66.255.4    11 mins ago    1x GCP({'L4': 1})  READY   us-central1
vllm-llama2    2    35.221.37.64   15 mins ago    1x GCP({'L4': 1})  READY   us-east4

检查服务的端点

ENDPOINT=$(sky serve status --endpoint vllm-llama2)

一旦状态变为 READY，您就可以使用端点与模型交互

curl $ENDPOINT/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "meta-llama/Llama-2-7b-chat-hf",
    "messages": [
      {
        "role": "system",
        "content": "You are a helpful assistant."
      },
      {
        "role": "user",
        "content": "Who are you?"
      }
    ]
  }'

注意，这与之前的 curl 命令相同。您应该会得到与以下类似的响应

{
  "id": "cmpl-879a58992d704caf80771b4651ff8cb6",
  "object": "chat.completion",
  "created": 1692650569,
  "model": "meta-llama/Llama-2-7b-chat-hf",
  "choices": [{
    "index": 0,
    "message": {
      "role": "assistant",
      "content": " Hello! I'm just an AI assistant, here to help you"
    },
    "finish_reason": "length"
  }],
  "usage": {
    "prompt_tokens": 31,
    "total_tokens": 47,
    "completion_tokens": 16
  }
}

使用 vLLM 提供 Mistral AI 的 Mixtral 8x7b 模型服务#

有关更多详细信息，请参考Mixtral 8x7b 示例。

包含的文件#

serve-openai-api-docker.yaml

envs:
  MODEL_NAME: meta-llama/Llama-2-7b-chat-hf
  HF_TOKEN: # TODO: Fill with your own huggingface token, or use --env to pass.

resources:
  image_id: docker:vllm/vllm-openai:latest
  accelerators: {L4:1, A10G:1, A10:1, A100:1, A100-80GB:1}
  ports:
    - 8000

setup: |
  conda deactivate
  python3 -c "import huggingface_hub; huggingface_hub.login('${HF_TOKEN}')"

run: |
  conda deactivate
  echo 'Starting vllm openai api server...'
  python -m vllm.entrypoints.openai.api_server \
    --model $MODEL_NAME --tokenizer hf-internal-testing/llama-tokenizer \
    --host 0.0.0.0

serve-openai-api.yaml

envs:
  MODEL_NAME: meta-llama/Llama-2-7b-chat-hf
  HF_TOKEN: # TODO: Fill with your own huggingface token, or use --env to pass.

resources:
  accelerators: {L4:1, A10G:1, A10:1, A100:1, A100-80GB:1}
  ports:
    - 8000

setup: |
  conda activate vllm
  if [ $? -ne 0 ]; then
    conda create -n vllm python=3.10 -y
    conda activate vllm
  fi

  pip install transformers==4.38.0
  pip install vllm==0.3.2

  python -c "import huggingface_hub; huggingface_hub.login('${HF_TOKEN}')"


run: |
  conda activate vllm
  echo 'Starting vllm openai api server...'
  python -m vllm.entrypoints.openai.api_server \
    --model $MODEL_NAME --tokenizer hf-internal-testing/llama-tokenizer \
    --host 0.0.0.0

serve.yaml

envs:
  MODEL_NAME: decapoda-research/llama-65b-hf

resources:
  accelerators: A100-80GB:8

setup: |
  conda activate vllm
  if [ $? -ne 0 ]; then
    conda create -n vllm python=3.10 -y
    conda activate vllm
  fi

  # Install fschat and accelerate for chat completion
  git clone https://github.com/vllm-project/vllm.git || true
  pip install transformers==4.38.0
  pip install vllm==0.3.2

  pip install gradio


run: |
  conda activate vllm
  echo 'Starting vllm api server...'
  python -u -m vllm.entrypoints.api_server \
                   --model $MODEL_NAME \
                   --tensor-parallel-size $SKYPILOT_NUM_GPUS_PER_NODE \
                   --tokenizer hf-internal-testing/llama-tokenizer 2>&1 | tee api_server.log &

  echo 'Waiting for vllm api server to start...'
  while ! `cat api_server.log | grep -q 'Uvicorn running on'`; do sleep 1; done

  echo 'Starting gradio server...'
  python vllm/examples/gradio_webserver.py

service-with-auth.yaml

# service.yaml
# The newly-added `service` section to the `serve-openai-api.yaml` file.
service:
  # Specifying the path to the endpoint to check the readiness of the service.
  readiness_probe:
    path: /v1/models
    # Set authorization headers here if needed.
    headers:
      Authorization: Bearer $AUTH_TOKEN
  # How many replicas to manage.
  replicas: 1

# Fields below are the same with `serve-openai-api.yaml`.
envs:
  MODEL_NAME: meta-llama/Llama-2-7b-chat-hf
  HF_TOKEN: # TODO: Fill with your own huggingface token, or use --env to pass.
  AUTH_TOKEN: # TODO: Fill with your own auth token (a random string), or use --env to pass.

resources:
  accelerators: {L4:1, A10G:1, A10:1, A100:1, A100-80GB:1}
  ports: 8000

setup: |
  conda activate vllm
  if [ $? -ne 0 ]; then
    conda create -n vllm python=3.10 -y
    conda activate vllm
  fi

  pip install transformers==4.38.0
  pip install vllm==0.3.2

  python -c "import huggingface_hub; huggingface_hub.login('${HF_TOKEN}')"


run: |
  conda activate vllm
  echo 'Starting vllm openai api server...'
  python -m vllm.entrypoints.openai.api_server \
    --model $MODEL_NAME --tokenizer hf-internal-testing/llama-tokenizer \
    --host 0.0.0.0 --port 8000 --api-key $AUTH_TOKEN

service.yaml

# service.yaml
# The newly-added `service` section to the `serve-openai-api.yaml` file.
service:
  # Specifying the path to the endpoint to check the readiness of the service.
  readiness_probe: /v1/models
  # How many replicas to manage.
  replicas: 2

# Fields below are the same with `serve-openai-api.yaml`.
envs:
  MODEL_NAME: meta-llama/Llama-2-7b-chat-hf
  HF_TOKEN: # TODO: Fill with your own huggingface token, or use --env to pass.

resources:
  accelerators: {L4:1, A10G:1, A10:1, A100:1, A100-80GB:1}
  ports:
    - 8000

setup: |
  conda activate vllm
  if [ $? -ne 0 ]; then
    conda create -n vllm python=3.10 -y
    conda activate vllm
  fi

  pip install transformers==4.38.0
  pip install vllm==0.3.2

  python -c "import huggingface_hub; huggingface_hub.login('${HF_TOKEN}')"


run: |
  conda activate vllm
  echo 'Starting vllm openai api server...'
  python -m vllm.entrypoints.openai.api_server \
    --model $MODEL_NAME --tokenizer hf-internal-testing/llama-tokenizer \
    --host 0.0.0.0