来源:llm/vllm
vLLM:简单、快速、经济的LLM推理#
本 README 包含运行 vLLM 演示的说明,vLLM 是一个用于快速 LLM 推理和服务的开源库,与 HuggingFace 相比,吞吐量提高了 高达 24 倍。
前提条件#
安装最新版本的 SkyPilot 并检查云凭证的设置
pip install git+https://github.com/skypilot-org/skypilot.git
sky check
查看 vLLM SkyPilot YAML 文件。
使用 vLLM 的 OpenAI 兼容 API 服务器提供 Llama-2 服务#
在开始之前,您需要访问 huggingface 上的 Llama-2 模型权重。有关更多详细信息,请查看 Llama-2 示例中的前提条件部分。
开始提供 Llama-2 模型服务
sky launch -c vllm-llama2 serve-openai-api.yaml --env HF_TOKEN=YOUR_HUGGING_FACE_API_TOKEN
可选:目前只有 GCP 提供指定的 L4 GPU。要使用其他云,请使用 --gpus
标志请求其他 GPU。例如,使用 H100 GPU
sky launch -c vllm-llama2 serve-openai-api.yaml --gpus H100:1 --env HF_TOKEN=YOUR_HUGGING_FACE_API_TOKEN
提示:您也可以使用 vLLM docker 容器来加快设置速度。更多详细信息请参考 serve-openai-api-docker.yaml。
使用以下命令检查集群的 IP
IP=$(sky status --ip vllm-llama2)
您现在可以使用 OpenAI API 与模型交互。
查询集群上托管的模型
curl http://$IP:8000/v1/models
使用输入提示查询模型进行文本补全
curl http://$IP:8000/v1/completions \
-H "Content-Type: application/json" \
-d '{
"model": "meta-llama/Llama-2-7b-chat-hf",
"prompt": "San Francisco is a",
"max_tokens": 7,
"temperature": 0
}'
您应该会得到与以下类似的响应
{
"id":"cmpl-50a231f7f06a4115a1e4bd38c589cd8f",
"object":"text_completion","created":1692427390,
"model":"meta-llama/Llama-2-7b-chat-hf",
"choices":[{
"index":0,
"text":"city in Northern California that is known",
"logprobs":null,"finish_reason":"length"
}],
"usage":{"prompt_tokens":5,"total_tokens":12,"completion_tokens":7}
}
使用输入提示查询模型进行聊天补全
curl http://$IP:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "meta-llama/Llama-2-7b-chat-hf",
"messages": [
{
"role": "system",
"content": "You are a helpful assistant."
},
{
"role": "user",
"content": "Who are you?"
}
]
}'
您应该会得到与以下类似的响应
{
"id": "cmpl-879a58992d704caf80771b4651ff8cb6",
"object": "chat.completion",
"created": 1692650569,
"model": "meta-llama/Llama-2-7b-chat-hf",
"choices": [{
"index": 0,
"message": {
"role": "assistant",
"content": " Hello! I'm just an AI assistant, here to help you"
},
"finish_reason": "length"
}],
"usage": {
"prompt_tokens": 31,
"total_tokens": 47,
"completion_tokens": 16
}
}
使用 SkyServe 通过 vLLM 提供 Llama-2 服务以处理更多流量#
为了扩展模型服务以处理更多流量,我们引入了 SkyServe,使用户能够轻松部署模型的多个副本
在上述
serve-openai-api.yaml
文件中添加一个service
部分,使其成为SkyServe Service YAML
# The newly-added `service` section to the `serve-openai-api.yaml` file.
service:
# Specifying the path to the endpoint to check the readiness of the service.
readiness_probe: /v1/models
# How many replicas to manage.
replicas: 2
完整的 Service YAML 可在此处找到:service.yaml。
使用 SkyServe CLI 启动服务
sky serve up -n vllm-llama2 service.yaml
使用
sky serve status
检查服务的状态
sky serve status vllm-llama2
您应该会得到与以下类似的输出
Services
NAME UPTIME STATUS REPLICAS ENDPOINT
vllm-llama2 7m 43s READY 2/2 3.84.15.251:30001
Service Replicas
SERVICE_NAME ID IP LAUNCHED RESOURCES STATUS REGION
vllm-llama2 1 34.66.255.4 11 mins ago 1x GCP({'L4': 1}) READY us-central1
vllm-llama2 2 35.221.37.64 15 mins ago 1x GCP({'L4': 1}) READY us-east4
检查服务的端点
ENDPOINT=$(sky serve status --endpoint vllm-llama2)
一旦状态变为
READY
,您就可以使用端点与模型交互
curl $ENDPOINT/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "meta-llama/Llama-2-7b-chat-hf",
"messages": [
{
"role": "system",
"content": "You are a helpful assistant."
},
{
"role": "user",
"content": "Who are you?"
}
]
}'
注意,这与之前的 curl 命令相同。您应该会得到与以下类似的响应
{
"id": "cmpl-879a58992d704caf80771b4651ff8cb6",
"object": "chat.completion",
"created": 1692650569,
"model": "meta-llama/Llama-2-7b-chat-hf",
"choices": [{
"index": 0,
"message": {
"role": "assistant",
"content": " Hello! I'm just an AI assistant, here to help you"
},
"finish_reason": "length"
}],
"usage": {
"prompt_tokens": 31,
"total_tokens": 47,
"completion_tokens": 16
}
}
使用 vLLM 提供 Mistral AI 的 Mixtral 8x7b 模型服务#
有关更多详细信息,请参考Mixtral 8x7b 示例。
包含的文件#
serve-openai-api-docker.yaml
envs:
MODEL_NAME: meta-llama/Llama-2-7b-chat-hf
HF_TOKEN: # TODO: Fill with your own huggingface token, or use --env to pass.
resources:
image_id: docker:vllm/vllm-openai:latest
accelerators: {L4:1, A10G:1, A10:1, A100:1, A100-80GB:1}
ports:
- 8000
setup: |
conda deactivate
python3 -c "import huggingface_hub; huggingface_hub.login('${HF_TOKEN}')"
run: |
conda deactivate
echo 'Starting vllm openai api server...'
python -m vllm.entrypoints.openai.api_server \
--model $MODEL_NAME --tokenizer hf-internal-testing/llama-tokenizer \
--host 0.0.0.0
serve-openai-api.yaml
envs:
MODEL_NAME: meta-llama/Llama-2-7b-chat-hf
HF_TOKEN: # TODO: Fill with your own huggingface token, or use --env to pass.
resources:
accelerators: {L4:1, A10G:1, A10:1, A100:1, A100-80GB:1}
ports:
- 8000
setup: |
conda activate vllm
if [ $? -ne 0 ]; then
conda create -n vllm python=3.10 -y
conda activate vllm
fi
pip install transformers==4.38.0
pip install vllm==0.3.2
python -c "import huggingface_hub; huggingface_hub.login('${HF_TOKEN}')"
run: |
conda activate vllm
echo 'Starting vllm openai api server...'
python -m vllm.entrypoints.openai.api_server \
--model $MODEL_NAME --tokenizer hf-internal-testing/llama-tokenizer \
--host 0.0.0.0
serve.yaml
envs:
MODEL_NAME: decapoda-research/llama-65b-hf
resources:
accelerators: A100-80GB:8
setup: |
conda activate vllm
if [ $? -ne 0 ]; then
conda create -n vllm python=3.10 -y
conda activate vllm
fi
# Install fschat and accelerate for chat completion
git clone https://github.com/vllm-project/vllm.git || true
pip install transformers==4.38.0
pip install vllm==0.3.2
pip install gradio
run: |
conda activate vllm
echo 'Starting vllm api server...'
python -u -m vllm.entrypoints.api_server \
--model $MODEL_NAME \
--tensor-parallel-size $SKYPILOT_NUM_GPUS_PER_NODE \
--tokenizer hf-internal-testing/llama-tokenizer 2>&1 | tee api_server.log &
echo 'Waiting for vllm api server to start...'
while ! `cat api_server.log | grep -q 'Uvicorn running on'`; do sleep 1; done
echo 'Starting gradio server...'
python vllm/examples/gradio_webserver.py
service-with-auth.yaml
# service.yaml
# The newly-added `service` section to the `serve-openai-api.yaml` file.
service:
# Specifying the path to the endpoint to check the readiness of the service.
readiness_probe:
path: /v1/models
# Set authorization headers here if needed.
headers:
Authorization: Bearer $AUTH_TOKEN
# How many replicas to manage.
replicas: 1
# Fields below are the same with `serve-openai-api.yaml`.
envs:
MODEL_NAME: meta-llama/Llama-2-7b-chat-hf
HF_TOKEN: # TODO: Fill with your own huggingface token, or use --env to pass.
AUTH_TOKEN: # TODO: Fill with your own auth token (a random string), or use --env to pass.
resources:
accelerators: {L4:1, A10G:1, A10:1, A100:1, A100-80GB:1}
ports: 8000
setup: |
conda activate vllm
if [ $? -ne 0 ]; then
conda create -n vllm python=3.10 -y
conda activate vllm
fi
pip install transformers==4.38.0
pip install vllm==0.3.2
python -c "import huggingface_hub; huggingface_hub.login('${HF_TOKEN}')"
run: |
conda activate vllm
echo 'Starting vllm openai api server...'
python -m vllm.entrypoints.openai.api_server \
--model $MODEL_NAME --tokenizer hf-internal-testing/llama-tokenizer \
--host 0.0.0.0 --port 8000 --api-key $AUTH_TOKEN
service.yaml
# service.yaml
# The newly-added `service` section to the `serve-openai-api.yaml` file.
service:
# Specifying the path to the endpoint to check the readiness of the service.
readiness_probe: /v1/models
# How many replicas to manage.
replicas: 2
# Fields below are the same with `serve-openai-api.yaml`.
envs:
MODEL_NAME: meta-llama/Llama-2-7b-chat-hf
HF_TOKEN: # TODO: Fill with your own huggingface token, or use --env to pass.
resources:
accelerators: {L4:1, A10G:1, A10:1, A100:1, A100-80GB:1}
ports:
- 8000
setup: |
conda activate vllm
if [ $? -ne 0 ]; then
conda create -n vllm python=3.10 -y
conda activate vllm
fi
pip install transformers==4.38.0
pip install vllm==0.3.2
python -c "import huggingface_hub; huggingface_hub.login('${HF_TOKEN}')"
run: |
conda activate vllm
echo 'Starting vllm openai api server...'
python -m vllm.entrypoints.openai.api_server \
--model $MODEL_NAME --tokenizer hf-internal-testing/llama-tokenizer \
--host 0.0.0.0