TGI: Hugging Face 文本生成推理#

文本生成推理 (TGI) 是 Hugging Face 用于部署和提供大型语言模型 (LLM) 文本生成任务的工具包。

启动单实例 TGI 服务#

我们可以使用服务 YAML 文件在单个实例上托管模型

sky launch -c tgi serve.yaml

用户可以使用以下命令访问模型

ENDPOINT=$(sky status --endpoint 8080 tgi)

curl $(sky serve status tgi --endpoint)/generate \
    -H 'Content-Type: application/json' \
    -d '{
      "inputs": "What is Deep Learning?",
      "parameters": {
        "max_new_tokens": 20
      }
    }'

输出应类似于以下内容

{
  "generated_text": "What is Deep Learning? Deep Learning is a subfield of machine learning that is concerned with algorithms inspired by the structure and function of the brain called artificial neural networks."
}

使用 SkyPilot Serve 扩展服务#

使用相同的 YAML 文件，我们可以轻松地通过 SkyServe 在多个实例、区域和云之间扩展模型服务

sky serve up -n tgi serve.yaml

服务启动后，我们可以使用以下命令访问模型

ENDPOINT=$(sky serve status --endpoint tgi)

curl $ENDPOINT/generate \
    -H 'Content-Type: application/json' \
    -d '{
      "inputs": "What is Deep Learning?",
      "parameters": {
        "max_new_tokens": 20
      }
    }'

包含的文件#

serve.yaml

# SkyServe YAML to run HuggingFace TGI
#
# Usage:
#   sky serve up -n tgi huggingface-tgi.yaml \
#     [--env MODEL_ID=<model-id-on-huggingface>]
# Then visit the endpoint printed in the console. You could also
# check the endpoint by running:
#   sky serve status --endpoint tgi

envs:
  MODEL_ID: lmsys/vicuna-13b-v1.5

service:
  readiness_probe: /health
  replicas: 2

resources:
  ports: 8080
  accelerators: A100:1

run: |
  docker run --gpus all --shm-size 1g -p 8080:80 \
    -v ~/data:/data ghcr.io/huggingface/text-generation-inference \
    --model-id $MODEL_ID