来源：llm/gemma

Gemma：开源 Gemini#

Google 发布了 Gemma，在 AI 社区引起了巨大反响。这为开源社区提供了服务和微调私有 Gemini 的机会。

在任何云上部署 Gemma#

使用 SkyPilot 在任何云上部署 Gemma 非常容易。使用此目录中的 serve.yaml 文件，您可以通过一个命令在任何云上部署模型。

先决条件#

申请访问 Gemma 模型

访问申请页面并点击 Acknowledge license（确认许可）以申请访问模型权重。

从 huggingface 获取访问令牌

在 huggingface 此处生成一个只读访问令牌，并确保您的 huggingface 账户可以此处访问 Gemma 模型。

安装 SkyPilot

pip install "skypilot-nightly[all]"

有关详细安装说明，请参阅安装指南。

在单个实例上托管#

我们可以使用单个实例托管模型

HF_TOKEN="xxx" sky launch -c gemma serve.yaml --env HF_TOKEN

集群启动后，我们可以使用以下命令访问模型

IP=$(sky status --ip gemma)

curl http://$IP:8000/v1/completions \
  -H "Content-Type: application/json" \
  -d '{
      "model": "google/gemma-7b-it",
      "prompt": "My favourite condiment is",
      "max_tokens": 25
  }' | jq .

聊天 API 也受支持

IP=$(sky status --ip gemma)

curl http://$IP:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
      "model": "google/gemma-7b-it",
      "messages": [
        {
          "role": "user",
          "content": "Hello! What is your name?"
        }
      ],
      "max_tokens": 25
  }'

使用 SkyServe 扩缩服务#

使用相同的 YAML 文件，我们可以轻松地通过 SkyServe 在多个实例、区域和云上扩缩模型服务

HF_TOKEN="xxx" sky serve up -n gemma serve.yaml --env HF_TOKEN

请注意，唯一的更改是将 sky launch 改为 sky serve up。相同的 YAML 文件无需修改即可使用。

集群启动后，我们可以使用以下命令访问模型

ENDPOINT=$(sky serve status --endpoint gemma)

curl http://$ENDPOINT/v1/completions \
  -H "Content-Type: application/json" \
  -d '{
      "model": "google/gemma-7b-it",
      "prompt": "My favourite condiment is",
      "max_tokens": 25
  }' | jq .

聊天 API 也受支持

curl http://$ENDPOINT/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
      "model": "google/gemma-7b-it",
      "messages": [
        {
          "role": "user",
          "content": "Hello! What is your name?"
        }
      ],
      "max_tokens": 25
  }'

包含的文件#

serve.yaml

# A example yaml for serving Gemma model from Google with an OpenAI API.
# Usage:
#  1. Launch on a single instance: `sky launch -c gemma ./serve.yaml`
#  2. Scale up to multiple instances with a single endpoint:
#     `sky serve up -n gemma ./serve.yaml`
service:
  readiness_probe:
    path: /v1/chat/completions
    post_data:
      model: $MODEL_NAME
      messages:
        - role: user
          content: Hello! What is your name?
      max_tokens: 1
    initial_delay_seconds: 1200
  replicas: 2

envs:
  MODEL_NAME: google/gemma-7b-it
  HF_TOKEN: # TODO: Fill with your own huggingface token, or use --env to pass.

resources: 
  accelerators: {L4, A10g, A10, L40, A40, A100, A100-80GB}
  ports: 8000
  disk_tier: best

setup: |
  conda activate gemma
  if [ $? -ne 0 ]; then
    conda create -n gemma -y python=3.10
    conda activate gemma
  fi
  pip install vllm==0.3.2
  pip install transformers==4.38.1
  python -c "import huggingface_hub; huggingface_hub.login('${HF_TOKEN}')"

run: |
  conda activate gemma
  export PATH=$PATH:/sbin
  # --max-model-len is set to 1024 to avoid taking too much GPU memory on L4 and
  # A10g with small memory.
  python -u -m vllm.entrypoints.openai.api_server \
    --host 0.0.0.0 \
    --model $MODEL_NAME \
    --tensor-parallel-size $SKYPILOT_NUM_GPUS_PER_NODE \
    --max-model-len 1024 | tee ~/openai_api_server.log