来源：llm/codellama

Code Llama：部署您的私有代码模型#

Code Llama 是 Llama 2 的一个代码专用版本，通过在 Llama 2 的代码特定数据集上进行进一步训练并从同一数据集中采样更多数据来创建。2024 年 1 月 29 日，Meta 发布了 Code Llama 70B，这是 Code Llama 系列中最大、性能最佳的模型。

以下是由 SkyPilot Serve（也称为 SkyServe）托管的 Code Llama 70B 演示（更多设置细节请参阅后续章节）

演示#

Coding Assistant: Connect to hosted Code Llama with Tabby in VScode — 编码助手：在 VScode 中使用 Tabby 连接到托管的 Code Llama

Chat: Connect to hosted Code Llama with FastChat — 聊天：使用 FastChat 连接到托管的 Code Llama

参考资料#

为何选择 SkyPilot 而非商业托管解决方案进行部署？#

通过利用跨多个区域和云的多个资源池，获得最佳的 GPU 可用性。
支付绝对最低的成本 — SkyPilot 会在跨区域和云中选择最便宜的资源。没有托管解决方案加价。
在不同位置和加速器上扩展到多个副本，全部通过单个端点提供服务
一切都保留在您的云账户中（您的虚拟机和存储桶）
完全私密 - 没有其他人可以看到您的聊天记录

使用 SkyPilot 运行您自己的 Code Llama#

在安装 SkyPilot后，使用 SkyPilot 通过 1 次点击在 vLLM 上运行您自己的 Code Llama

在单个实例上启动 Code Llama 70B 服务，使用 endpoint.yaml 中指定的列表中任何可用的 GPU，并使用由 vLLM 提供支持的 OpenAI 兼容端点。

sky launch -c code-llama -s endpoint.yaml

----------------------------------------------------------------------------------------------------------
CLOUD   INSTANCE                    vCPUs   Mem(GB)   ACCELERATORS   REGION/ZONE     COST ($)   CHOSEN   
----------------------------------------------------------------------------------------------------------
 Azure   Standard_NC48ads_A100_v4    48      440       A100-80GB:2    eastus          7.35          ✔     
 GCP     g2-standard-96              96      384       L4:8           us-east4-a      7.98                
 GCP     a2-ultragpu-2g              24      340       A100-80GB:2    us-central1-a   10.06               
 Azure   Standard_NC96ads_A100_v4    96      880       A100-80GB:4    eastus          14.69               
 GCP     a2-highgpu-4g               48      340       A100:4         us-central1-a   14.69               
 AWS     g5.48xlarge                 192     768       A10G:8         us-east-1       16.29               
 GCP     a2-ultragpu-4g              48      680       A100-80GB:4    us-central1-a   20.11               
 Azure   Standard_ND96asr_v4         96      900       A100:8         eastus          27.20               
 GCP     a2-highgpu-8g               96      680       A100:8         us-central1-a   29.39               
 AWS     p4d.24xlarge                96      1152      A100:8         us-east-1       32.77               
 Azure   Standard_ND96amsr_A100_v4   96      1924      A100-80GB:8    eastus          32.77               
 GCP     a2-ultragpu-8g              96      1360      A100-80GB:8    us-central1-a   40.22               
 AWS     p4de.24xlarge               96      1152      A100-80GB:8    us-east-1       40.97               
----------------------------------------------------------------------------------------------------------

Launching a cluster 'code-llama'. Proceed? [Y/n]: 

向端点发送代码补全请求

IP=$(sky status --ip code-llama)

curl http://$IP:8000/v1/completions \
    -H "Content-Type: application/json" \
    -d '{
      "model": "codellama/CodeLlama-70b-Instruct-hf",
      "prompt": "def quick_sort(a: List[int]):",
      "max_tokens": 512
  }' | jq -r '.choices[0].text'

这将返回以下补全结果

    if len(a) <= 1:
        return a
    pivot = a.pop(len(a)//2)
    b = []
    c = []
    for i in a:
        if i > pivot:
            b.append(i)
        else:
            c.append(i)
    b = quick_sort(b)
    c = quick_sort(c)
    res = []
    res.extend(c)
    res.append(pivot)
    res.extend(b)
    return res

使用 SkyServe 扩展服务#

借助 SkyServe（一个构建在 SkyPilot 之上的服务库），扩展 Code Llama 服务就像运行以下命令一样简单

sky serve up -n code-llama ./endpoint.yaml

这将以多个副本的形式在最便宜的可用位置和加速器上启动服务。SkyServe 将自动管理副本，监控它们的健康状况，根据负载进行自动扩缩容，并在需要时重启它们。

将返回一个单个端点，发送到该端点的任何请求都将路由到就绪的副本。

要检查服务的状态，请运行

sky serve status code-llama

过一段时间后，您将看到以下输出

Services
NAME        VERSION  UPTIME  STATUS        REPLICAS  ENDPOINT            
code-llama  1        -       READY         2/2       3.85.107.228:30002  

Service Replicas
SERVICE_NAME  ID  VERSION  IP  LAUNCHED    RESOURCES                   STATUS REGION  
code-llama    1   1        -   2 mins ago  1x Azure({'A100-80GB': 2}) READY  eastus  
code-llama    2   1        -   2 mins ago  1x GCP({'L4': 8})          READY  us-east4-a 

如图所示，该服务现在由 2 个副本支持，一个在 Azure 上，一个在 GCP 上，并且加速器类型被选为云上最便宜的可用类型。也就是说，它在最大化服务可用性的同时最小化了成本。

要访问模型，我们使用相同的 curl 命令向端点发送请求

ENDPOINT=$(sky serve status --endpoint code-llama)

curl http://$ENDPOINT/v1/completions \
    -H "Content-Type: application/json" \
    -d '{
      "model": "codellama/CodeLlama-70b-Instruct-hf",
      "prompt": "def quick_sort(a: List[int]):",
      "max_tokens": 512
  }' | jq -r '.choices[0].text'

可选：使用聊天 API 访问 Code Llama#

我们还可以使用 openAI 聊天 API 访问 Code Llama 服务。

ENDPOINT=$(sky serve status --endpoint code-llama)

curl http://$ENDPOINT/v1/chat/completions \
    -H "Content-Type: application/json" \
    -d '{
      "model": "codellama/CodeLlama-70b-Instruct-hf",
      "messages": [
        {
          "role": "system",
          "content": "You are a helpful and honest code assistant expert in Python."
        },
        {
          "role": "user",
          "content": "Show me the python code for quick sorting a list of integers."
        }
      ],
      "max_tokens": 512
  }' | jq -r '.choices[0].message.content'

您将看到类似以下的输出

```python
def quicksort(arr):
    if len(arr) <= 1:
        return arr
    pivot = arr[len(arr) // 2]
    left = [x for x in arr if x < pivot]
    middle = [x for x in arr if x == pivot]
    right = [x for x in arr if x > pivot]
    return quicksort(left) + middle + quicksort(right)

# Example usage:
numbers = [10, 2, 44, 15, 30, 11, 50]
sorted_numbers = quicksort(numbers)
print(sorted_numbers)
```

This code defines a function `quicksort` that takes a list of integers as input. It divides the list into three parts based on the pivot element, which is the middle element of the list. It then recursively sorts the left and right partitions and combines them with the middle partition.

或者，我们可以通过 Python 使用 OpenAI 的 API 访问模型（参见 complete.py）

python complete.py

可选：使用聊天 GUI 访问 Code Llama#

也可以使用 FastChat 通过 GUI 访问 Code Llama 服务。请查看顶部的演示。

启动聊天 Web UI

sky launch -c code-llama-gui ./gui.yaml --env ENDPOINT=$(sky serve status --endpoint code-llama)

然后，我们可以通过返回的 gradio 链接访问 GUI

| INFO | stdout | Running on public URL: https://6141e84201ce0bb4ed.gradio.live

请注意，使用更高的 temperature 和 top_p 值可能会获得更好的结果。

可选：在 VScode 中使用 Code Llama 作为编码助手#

Tabby 是一个开源的自托管 AI 编码助手。它允许您连接到自己的 AI 模型并在 VScode 中将其用作编码助手。请查看顶部的演示。

要启动连接到 Code Llama 服务的 Tabby 服务器，请运行

sky launch -c tabby ./tabby.yaml --env ENDPOINT=$(sky serve status --endpoint code-llama)

要获取 Tabby 服务器的端点，请运行

IP=$(sky status --ip tabby)
echo Endpoint: http://$IP:8080

然后，您可以通过安装 Tabby 扩展并在 Tabby 设置下配置 API 端点，从 VScode 连接到 Tabby 服务器。

请注意，Code Llama 70B 不具备完整的 infiling 功能 [1]，因此 Tabby 与 Code Llama 结合使用时的性能可能会受到限制。

要获得 infiling 功能，您可以使用较小的 Code Llama 模型，例如 Code Llama 7B 和 13B，并在 yaml 或上述命令中将 prompt_template 替换为 "<｜fim begin｜>{prefix}<｜fim hole｜>{suffix}<｜fim end｜>"。

为了获得更好的性能，我们建议在 Tabby 文档和我们的 Tabby 示例中使用 Tabby 推荐的模型。

包含的文件#

complete.py

import openai

import sky

service_records = sky.serve.status('code-llama')
endpoint = service_records[0]['endpoint']

print('Using endpoint:', endpoint)

client = openai.OpenAI(
    base_url=f'http://{endpoint}/v1',
    # No API key is required when self-hosted.
    api_key='EMPTY')

chat_completion = client.chat.completions.create(
    model='codellama/CodeLlama-70b-Instruct-hf',
    messages=[{
        'role': 'system',
        'content': 'You are a helpful and honest code assistant expert in Python.'
    }, {
        'role': 'user',
        'content': 'Show me the code for quick sort a list of integers.'
    }],
    max_tokens=300,
)

print(chat_completion.model_dump())

endpoint.yaml

# An example yaml for serving Code Llama model from Meta with an OpenAI API.
# Usage:
#  1. Launch on a single instance: `sky launch -c code-llama ./endpoint.yaml`
#  2. Scale up to multiple replicas with a single endpoint:
#     `sky serve up -n code-llama ./endpoint.yaml`
service:
  readiness_probe:
    path: /v1/completions
    post_data:
      model: codellama/CodeLlama-70b-Instruct-hf
      prompt: "def hello_world():"
      max_tokens: 1
    initial_delay_seconds: 1800
  replicas: 2

resources:
  accelerators: {L4:8, A10g:8, A10:8, A100:4, A100:8, A100-80GB:2, A100-80GB:4, A100-80GB:8}
  disk_size: 1024
  disk_tier: best
  memory: 32+
  ports: 8000

setup: |
  conda activate codellama
  if [ $? -ne 0 ]; then
    conda create -n codellama python=3.10 -y
    conda activate codellama
  fi

  pip install transformers==4.38.0
  pip install vllm==0.3.2

run: |
  conda activate codellama
  export PATH=$PATH:/sbin
  # Reduce --max-num-seqs to avoid OOM during loading model on L4:8
  python -u -m vllm.entrypoints.openai.api_server \
    --host 0.0.0.0 \
    --model codellama/CodeLlama-70b-Instruct-hf \
    --tensor-parallel-size $SKYPILOT_NUM_GPUS_PER_NODE \
    --max-num-seqs 64 | tee ~/openai_api_server.log

gui.yaml

# Starts a GUI server that connects to the Code Llama OpenAI API server.
# This works with the endpoint.yaml, please refer to llm/codellama/README.md
# for more details.
# Usage:
#  1. If you have a endpoint started on a cluster (sky launch):
#     `sky launch -c code-llama-gui ./gui.yaml --env ENDPOINT=$(sky status --ip code-llama):8000`
#  2. If you have a SkyPilot Service started (sky serve up) called code-llama:
#     `sky launch -c code-llama-gui ./gui.yaml --env ENDPOINT=$(sky serve status --endpoint code-llama)`
# After the GUI server is started, you will see a gradio link in the output and
# you can click on it to open the GUI.

envs:
  ENDPOINT: x.x.x.x:3031 # Address of the API server running codellama. 

resources:
  cpus: 2

setup: |
  conda activate codellama
  if [ $? -ne 0 ]; then
    conda create -n codellama python=3.10 -y
    conda activate codellama
  fi

  pip install "fschat[model_worker,webui]"
  pip install "openai<1"

run: |
  conda activate codellama
  export PATH=$PATH:/sbin
  WORKER_IP=$(hostname -I | cut -d' ' -f1)
  CONTROLLER_PORT=21001
  WORKER_PORT=21002

  cat <<EOF > ~/model_info.json
  {
    "codellama/CodeLlama-70b-Instruct-hf": {
      "model_name": "codellama/CodeLlama-70b-Instruct-hf",
      "api_base": "http://${ENDPOINT}/v1",
      "api_key": "empty",
      "model_path": "codellama/CodeLlama-70b-Instruct-hf",
      "anony_only": false,
      "api_type": "openai"
    }
  }
  EOF

  python3 -m fastchat.serve.controller --host 0.0.0.0 --port ${CONTROLLER_PORT} > ~/controller.log 2>&1 &

  echo 'Starting gradio server...'
  python -u -m fastchat.serve.gradio_web_server --share \
    --register ~/model_info.json | tee ~/gradio.log

tabby.yaml

# Starts a Tabby server that connects to the Code Llama OpenAI API server.
# This works with the endpoint.yaml, please refer to llm/codellama/README.md
# for more details.
# Usage:
#  1. If you have a endpoint started on a cluster (sky launch):
#     `sky launch -c tabby ./tabby.yaml --env ENDPOINT=$(sky status --ip code-llama):8000`
#  2. If you have a SkyPilot Service started (sky serve up) called code-llama:
#     `sky launch -c tabby ./gui.yaml --env ENDPOINT=$(sky serve status --endpoint code-llama)`
# After the Tabby server is started, you can add the endpoint (URL:port) to VSCode
# Tabby extension and start using it.

envs:
  ENDPOINT: x.x.x.x:3031 # Address of the API server running codellama. 

resources:
  cpus: 2
  ports: 8080

setup: |
  wget https://github.com/TabbyML/tabby/releases/download/v0.8.0-rc.1/tabby_x86_64-manylinux2014 -O tabby
  chmod +x tabby

run: |
  ./tabby serve --device experimental-http \
    --model "{\"kind\": \"openai\", \"model_name\": \"codellama/CodeLlama-70b-Instruct-hf\", \"api_endpoint\": \"http://$ENDPOINT/v1/completions\", \"prompt_template\": \"{prefix}\"}"