来源：llm/llama-2

Llama 2：来自 Meta 的开放式大型语言模型#

Llama-2 是目前开放式大型语言模型排行榜上排名最高的开源模型。它以允许商业用途的许可发布。您只需一个简单的命令，即可使用 SkyPilot 在您自己的云中部署一个私有的 Llama-2 聊天机器人。

为什么选择 SkyPilot 而非商业托管解决方案进行部署？#

无锁定：可在任何支持的云上运行 - AWS, Azure, GCP, Lambda Cloud, IBM, Samsung, OCI
所有内容保留在您的云账户中（您的虚拟机和存储桶）
没有其他人可以看到您的聊天记录
支付绝对最低成本 — 无托管解决方案加价
根据规模和预算自由选择您自己的模型大小、GPU 类型、GPU 数量等。

...只需一键即可获得所有这些 — 让 SkyPilot 自动化基础设施。

先决条件#

申请访问 Llama-2 模型

前往申请页面，申请访问模型权重。

从 Hugging Face 获取访问令牌

在 Hugging Face 此处生成一个只读访问令牌，并确保您的 Hugging Face 账户可以访问 Llama-2 模型（此处查看）。

将访问令牌填写到 chatbot-hf.yaml 和 chatbot-meta.yaml 文件中。

envs:
  MODEL_SIZE: 7
  HF_TOKEN: # TODO: Fill with your own huggingface token, or use --env to pass.

使用 SkyPilot 运行您自己的 Llama-2 聊天机器人#

您现在可以使用 SkyPilot 一键托管您自己的 Llama-2 聊天机器人。

在单个 A100 GPU 上开始提供 LLaMA-7B-Chat 2 模型服务

sky launch -c llama-serve -s chatbot-hf.yaml

检查命令输出。将会有一个可共享的 Gradio 链接（类似于以下内容的最后一行）。在浏览器中打开它即可与 Llama-2 聊天。

(task, pid=20933) 2023-04-12 22:08:49 | INFO | gradio_web_server | Namespace(host='0.0.0.0', port=None, controller_url='https://:21001', concurrency_count=10, model_list_mode='once', share=True, moderate=False)
(task, pid=20933) 2023-04-12 22:08:49 | INFO | stdout | Running on local URL:  http://0.0.0.0:7860
(task, pid=20933) 2023-04-12 22:08:51 | INFO | stdout | Running on public URL: https://<random-hash>.gradio.live

Llama-2 Demo

可选：尝试其他 GPU

sky launch -c llama-serve-l4 -s chatbot-hf.yaml --gpus L4

L4 是专为大型推理 AI 工作负载构建的最新一代 GPU。请在此处查找更多详细信息。

可选：提供 13B 模型服务，而不是默认的 7B 模型

sky launch -c llama-serve -s chatbot-hf.yaml --env MODEL_SIZE=13

可选：提供 70B Llama-2 模型服务

sky launch -c llama-serve-70b -s chatbot-hf.yaml --env MODEL_SIZE=70 --gpus A100-80GB:2

70B model

如何使用 FAIR 模型运行 Llama-2 聊天机器人？#

您也可以托管官方 FAIR 模型，而无需使用 Hugging Face 和 Gradio。

在云上启动 Llama-2 聊天机器人
```
sky launch -c llama chatbot-meta.yaml
```
打开另一个终端并运行
```
ssh -L 7681:localhost:7681 llama
```
在浏览器中打开 https://:7681 并开始聊天！

包含的文件#

chatbot-hf.yaml

resources:
  accelerators: A100:1
  disk_size: 1024
  disk_tier: best
  memory: 32+

envs:
  MODEL_SIZE: 7
  HF_TOKEN: # TODO: Fill with your own huggingface token, or use --env to pass.

setup: |
  conda activate chatbot
  if [ $? -ne 0 ]; then
    conda create -n chatbot python=3.9 -y
    conda activate chatbot
  fi

  # Install dependencies
  pip install "fschat[model_worker,webui]"

  python -c "import huggingface_hub; huggingface_hub.login('${HF_TOKEN}')"

run: |
  conda activate chatbot
  
  echo 'Starting controller...'
  python -u -m fastchat.serve.controller --host 127.0.0.1 > ~/controller.log 2>&1 &
  sleep 10
  echo 'Starting model worker...'
  python -u -m fastchat.serve.model_worker \
            --model-path meta-llama/Llama-2-${MODEL_SIZE}b-chat-hf \
            --num-gpus $SKYPILOT_NUM_GPUS_PER_NODE 2>&1 \
            --host 127.0.0.1 \
            | tee model_worker.log &

  echo 'Waiting for model worker to start...'
  while ! `cat model_worker.log | grep -q 'Uvicorn running on'`; do sleep 1; done

  echo 'Starting gradio server...'
  python -u -m fastchat.serve.gradio_web_server --share | tee ~/gradio.log

chatbot-meta.yaml

resources:
  memory: 32+
  accelerators: A100:1
  disk_size: 1024
  disk_tier: best

envs:
  MODEL_SIZE: 7
  HF_TOKEN: # TODO: Fill with your own huggingface token, or use --env to pass.

setup: |
  set -ex

  git clone https://github.com/facebookresearch/llama.git || true
  cd ./llama
  pip install -e .
  cd -

  git clone https://github.com/skypilot-org/sky-llama.git || true
  cd sky-llama
  pip install torch==1.12.1+cu113 --extra-index-url https://download.pytorch.org/whl/cu113
  pip install -r requirements.txt
  pip install -e .
  cd -

  # Download the model weights from the huggingface hub, as the official
  # download script has some problem.
  git config --global credential.helper cache
  sudo apt -y install git-lfs
  pip install transformers
  python -c "import huggingface_hub; huggingface_hub.login('${HF_TOKEN}', add_to_git_credential=True)"
  git clone https://hugging-face.cn/meta-llama/Llama-2-${MODEL_SIZE}b-chat

  wget https://github.com/tsl0922/ttyd/releases/download/1.7.2/ttyd.x86_64
  sudo mv ttyd.x86_64 /usr/local/bin/ttyd
  sudo chmod +x /usr/local/bin/ttyd

run: |
  cd sky-llama
  ttyd /bin/bash -c "torchrun --nproc_per_node $SKYPILOT_NUM_GPUS_PER_NODE chat.py --ckpt_dir ~/sky_workdir/Llama-2-${MODEL_SIZE}b-chat --tokenizer_path ~/sky_workdir/Llama-2-${MODEL_SIZE}b-chat/tokenizer.model"