Code Llama:部署您的私有代码模型#
Code Llama 是 Llama 2 的一个代码专用版本,通过在 Llama 2 的代码特定数据集上进行进一步训练并从同一数据集中采样更多数据来创建。2024 年 1 月 29 日,Meta 发布了 Code Llama 70B,这是 Code Llama 系列中最大、性能最佳的模型。
以下是由 SkyPilot Serve(也称为 SkyServe)托管的 Code Llama 70B 演示(更多设置细节请参阅后续章节)
演示#


参考资料#
为何选择 SkyPilot 而非商业托管解决方案进行部署?#
通过利用跨多个区域和云的多个资源池,获得最佳的 GPU 可用性。
支付绝对最低的成本 — SkyPilot 会在跨区域和云中选择最便宜的资源。没有托管解决方案加价。
在不同位置和加速器上扩展到多个副本,全部通过单个端点提供服务
一切都保留在您的云账户中(您的虚拟机和存储桶)
完全私密 - 没有其他人可以看到您的聊天记录
使用 SkyPilot 运行您自己的 Code Llama#
在安装 SkyPilot后,使用 SkyPilot 通过 1 次点击在 vLLM 上运行您自己的 Code Llama
在单个实例上启动 Code Llama 70B 服务,使用 endpoint.yaml 中指定的列表中任何可用的 GPU,并使用由 vLLM 提供支持的 OpenAI 兼容端点。
sky launch -c code-llama -s endpoint.yaml
----------------------------------------------------------------------------------------------------------
CLOUD INSTANCE vCPUs Mem(GB) ACCELERATORS REGION/ZONE COST ($) CHOSEN
----------------------------------------------------------------------------------------------------------
Azure Standard_NC48ads_A100_v4 48 440 A100-80GB:2 eastus 7.35 ✔
GCP g2-standard-96 96 384 L4:8 us-east4-a 7.98
GCP a2-ultragpu-2g 24 340 A100-80GB:2 us-central1-a 10.06
Azure Standard_NC96ads_A100_v4 96 880 A100-80GB:4 eastus 14.69
GCP a2-highgpu-4g 48 340 A100:4 us-central1-a 14.69
AWS g5.48xlarge 192 768 A10G:8 us-east-1 16.29
GCP a2-ultragpu-4g 48 680 A100-80GB:4 us-central1-a 20.11
Azure Standard_ND96asr_v4 96 900 A100:8 eastus 27.20
GCP a2-highgpu-8g 96 680 A100:8 us-central1-a 29.39
AWS p4d.24xlarge 96 1152 A100:8 us-east-1 32.77
Azure Standard_ND96amsr_A100_v4 96 1924 A100-80GB:8 eastus 32.77
GCP a2-ultragpu-8g 96 1360 A100-80GB:8 us-central1-a 40.22
AWS p4de.24xlarge 96 1152 A100-80GB:8 us-east-1 40.97
----------------------------------------------------------------------------------------------------------
Launching a cluster 'code-llama'. Proceed? [Y/n]:
向端点发送代码补全请求
IP=$(sky status --ip code-llama)
curl http://$IP:8000/v1/completions \
-H "Content-Type: application/json" \
-d '{
"model": "codellama/CodeLlama-70b-Instruct-hf",
"prompt": "def quick_sort(a: List[int]):",
"max_tokens": 512
}' | jq -r '.choices[0].text'
这将返回以下补全结果
if len(a) <= 1:
return a
pivot = a.pop(len(a)//2)
b = []
c = []
for i in a:
if i > pivot:
b.append(i)
else:
c.append(i)
b = quick_sort(b)
c = quick_sort(c)
res = []
res.extend(c)
res.append(pivot)
res.extend(b)
return res
使用 SkyServe 扩展服务#
借助 SkyServe(一个构建在 SkyPilot 之上的服务库),扩展 Code Llama 服务就像运行以下命令一样简单
sky serve up -n code-llama ./endpoint.yaml
这将以多个副本的形式在最便宜的可用位置和加速器上启动服务。SkyServe 将自动管理副本,监控它们的健康状况,根据负载进行自动扩缩容,并在需要时重启它们。
将返回一个单个端点,发送到该端点的任何请求都将路由到就绪的副本。
要检查服务的状态,请运行
sky serve status code-llama
过一段时间后,您将看到以下输出
Services
NAME VERSION UPTIME STATUS REPLICAS ENDPOINT
code-llama 1 - READY 2/2 3.85.107.228:30002
Service Replicas
SERVICE_NAME ID VERSION IP LAUNCHED RESOURCES STATUS REGION
code-llama 1 1 - 2 mins ago 1x Azure({'A100-80GB': 2}) READY eastus
code-llama 2 1 - 2 mins ago 1x GCP({'L4': 8}) READY us-east4-a
如图所示,该服务现在由 2 个副本支持,一个在 Azure 上,一个在 GCP 上,并且加速器类型被选为云上最便宜的可用类型。也就是说,它在最大化服务可用性的同时最小化了成本。
要访问模型,我们使用相同的 curl 命令向端点发送请求
ENDPOINT=$(sky serve status --endpoint code-llama)
curl http://$ENDPOINT/v1/completions \
-H "Content-Type: application/json" \
-d '{
"model": "codellama/CodeLlama-70b-Instruct-hf",
"prompt": "def quick_sort(a: List[int]):",
"max_tokens": 512
}' | jq -r '.choices[0].text'
可选:使用聊天 API 访问 Code Llama#
我们还可以使用 openAI 聊天 API 访问 Code Llama 服务。
ENDPOINT=$(sky serve status --endpoint code-llama)
curl http://$ENDPOINT/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "codellama/CodeLlama-70b-Instruct-hf",
"messages": [
{
"role": "system",
"content": "You are a helpful and honest code assistant expert in Python."
},
{
"role": "user",
"content": "Show me the python code for quick sorting a list of integers."
}
],
"max_tokens": 512
}' | jq -r '.choices[0].message.content'
您将看到类似以下的输出
```python
def quicksort(arr):
if len(arr) <= 1:
return arr
pivot = arr[len(arr) // 2]
left = [x for x in arr if x < pivot]
middle = [x for x in arr if x == pivot]
right = [x for x in arr if x > pivot]
return quicksort(left) + middle + quicksort(right)
# Example usage:
numbers = [10, 2, 44, 15, 30, 11, 50]
sorted_numbers = quicksort(numbers)
print(sorted_numbers)
```
This code defines a function `quicksort` that takes a list of integers as input. It divides the list into three parts based on the pivot element, which is the middle element of the list. It then recursively sorts the left and right partitions and combines them with the middle partition.
或者,我们可以通过 Python 使用 OpenAI 的 API 访问模型(参见 complete.py)
python complete.py
可选:使用聊天 GUI 访问 Code Llama#
也可以使用 FastChat 通过 GUI 访问 Code Llama 服务。请查看顶部的演示。
启动聊天 Web UI
sky launch -c code-llama-gui ./gui.yaml --env ENDPOINT=$(sky serve status --endpoint code-llama)
然后,我们可以通过返回的 gradio 链接访问 GUI
| INFO | stdout | Running on public URL: https://6141e84201ce0bb4ed.gradio.live
请注意,使用更高的 temperature 和 top_p 值可能会获得更好的结果。
可选:在 VScode 中使用 Code Llama 作为编码助手#
Tabby 是一个开源的自托管 AI 编码助手。它允许您连接到自己的 AI 模型并在 VScode 中将其用作编码助手。请查看顶部的演示。
要启动连接到 Code Llama 服务的 Tabby 服务器,请运行
sky launch -c tabby ./tabby.yaml --env ENDPOINT=$(sky serve status --endpoint code-llama)
要获取 Tabby 服务器的端点,请运行
IP=$(sky status --ip tabby)
echo Endpoint: http://$IP:8080
然后,您可以通过安装 Tabby 扩展并在 Tabby 设置下配置 API 端点,从 VScode 连接到 Tabby 服务器。
包含的文件#
import openai
import sky
service_records = sky.serve.status('code-llama')
endpoint = service_records[0]['endpoint']
print('Using endpoint:', endpoint)
client = openai.OpenAI(
base_url=f'http://{endpoint}/v1',
# No API key is required when self-hosted.
api_key='EMPTY')
chat_completion = client.chat.completions.create(
model='codellama/CodeLlama-70b-Instruct-hf',
messages=[{
'role': 'system',
'content': 'You are a helpful and honest code assistant expert in Python.'
}, {
'role': 'user',
'content': 'Show me the code for quick sort a list of integers.'
}],
max_tokens=300,
)
print(chat_completion.model_dump())
endpoint.yaml
# An example yaml for serving Code Llama model from Meta with an OpenAI API.
# Usage:
# 1. Launch on a single instance: `sky launch -c code-llama ./endpoint.yaml`
# 2. Scale up to multiple replicas with a single endpoint:
# `sky serve up -n code-llama ./endpoint.yaml`
service:
readiness_probe:
path: /v1/completions
post_data:
model: codellama/CodeLlama-70b-Instruct-hf
prompt: "def hello_world():"
max_tokens: 1
initial_delay_seconds: 1800
replicas: 2
resources:
accelerators: {L4:8, A10g:8, A10:8, A100:4, A100:8, A100-80GB:2, A100-80GB:4, A100-80GB:8}
disk_size: 1024
disk_tier: best
memory: 32+
ports: 8000
setup: |
conda activate codellama
if [ $? -ne 0 ]; then
conda create -n codellama python=3.10 -y
conda activate codellama
fi
pip install transformers==4.38.0
pip install vllm==0.3.2
run: |
conda activate codellama
export PATH=$PATH:/sbin
# Reduce --max-num-seqs to avoid OOM during loading model on L4:8
python -u -m vllm.entrypoints.openai.api_server \
--host 0.0.0.0 \
--model codellama/CodeLlama-70b-Instruct-hf \
--tensor-parallel-size $SKYPILOT_NUM_GPUS_PER_NODE \
--max-num-seqs 64 | tee ~/openai_api_server.log
gui.yaml
# Starts a GUI server that connects to the Code Llama OpenAI API server.
# This works with the endpoint.yaml, please refer to llm/codellama/README.md
# for more details.
# Usage:
# 1. If you have a endpoint started on a cluster (sky launch):
# `sky launch -c code-llama-gui ./gui.yaml --env ENDPOINT=$(sky status --ip code-llama):8000`
# 2. If you have a SkyPilot Service started (sky serve up) called code-llama:
# `sky launch -c code-llama-gui ./gui.yaml --env ENDPOINT=$(sky serve status --endpoint code-llama)`
# After the GUI server is started, you will see a gradio link in the output and
# you can click on it to open the GUI.
envs:
ENDPOINT: x.x.x.x:3031 # Address of the API server running codellama.
resources:
cpus: 2
setup: |
conda activate codellama
if [ $? -ne 0 ]; then
conda create -n codellama python=3.10 -y
conda activate codellama
fi
pip install "fschat[model_worker,webui]"
pip install "openai<1"
run: |
conda activate codellama
export PATH=$PATH:/sbin
WORKER_IP=$(hostname -I | cut -d' ' -f1)
CONTROLLER_PORT=21001
WORKER_PORT=21002
cat <<EOF > ~/model_info.json
{
"codellama/CodeLlama-70b-Instruct-hf": {
"model_name": "codellama/CodeLlama-70b-Instruct-hf",
"api_base": "http://${ENDPOINT}/v1",
"api_key": "empty",
"model_path": "codellama/CodeLlama-70b-Instruct-hf",
"anony_only": false,
"api_type": "openai"
}
}
EOF
python3 -m fastchat.serve.controller --host 0.0.0.0 --port ${CONTROLLER_PORT} > ~/controller.log 2>&1 &
echo 'Starting gradio server...'
python -u -m fastchat.serve.gradio_web_server --share \
--register ~/model_info.json | tee ~/gradio.log
tabby.yaml
# Starts a Tabby server that connects to the Code Llama OpenAI API server.
# This works with the endpoint.yaml, please refer to llm/codellama/README.md
# for more details.
# Usage:
# 1. If you have a endpoint started on a cluster (sky launch):
# `sky launch -c tabby ./tabby.yaml --env ENDPOINT=$(sky status --ip code-llama):8000`
# 2. If you have a SkyPilot Service started (sky serve up) called code-llama:
# `sky launch -c tabby ./gui.yaml --env ENDPOINT=$(sky serve status --endpoint code-llama)`
# After the Tabby server is started, you can add the endpoint (URL:port) to VSCode
# Tabby extension and start using it.
envs:
ENDPOINT: x.x.x.x:3031 # Address of the API server running codellama.
resources:
cpus: 2
ports: 8080
setup: |
wget https://github.com/TabbyML/tabby/releases/download/v0.8.0-rc.1/tabby_x86_64-manylinux2014 -O tabby
chmod +x tabby
run: |
./tabby serve --device experimental-http \
--model "{\"kind\": \"openai\", \"model_name\": \"codellama/CodeLlama-70b-Instruct-hf\", \"api_endpoint\": \"http://$ENDPOINT/v1/completions\", \"prompt_template\": \"{prefix}\"}"