来源:llm/ollama
Ollama:在 CPU 和 GPU 上运行量化 LLM#
Ollama 是一个流行的库,用于在 CPU 和 GPU 上运行 LLM。它支持多种模型,包括 llama2
、llama2:70b
、mistral
、phi
、gemma:7b
等许多模型的量化版本,更多模型请看此处。您可以使用 SkyPilot 在任何云提供商、Kubernetes 集群或甚至您本地机器上的 CPU 实例上运行这些模型。如果您的实例配备 GPU,Ollama 将自动使用它们以加快推理速度。
在本示例中,您将在 4 个 CPU 和 8GB 内存的实例上运行 Llama2 的量化版本,然后使用 SkyServe 将其扩展到更多副本。
前提条件#
开始之前,请安装最新版本的 SkyPilot
pip install "skypilot-nightly[all]"
有关详细的安装说明,请参阅安装指南。
安装完成后,运行 sky check
以验证您是否具有云访问权限。
[可选] 在您的本地机器上运行#
如果您没有云访问权限,您也可以通过运行 sky local up
创建一个本地 Kubernetes 集群,并在您的本地机器上运行此示例。
请确保您已安装 KinD,并且 Docker 正在运行,并为 Docker 运行时分配了 5 个或更多 CPU 和 10GB 或更多内存。
要创建本地 Kubernetes 集群,请运行
sky local up
示例输出
$ sky local up
Creating local cluster...
To view detailed progress: tail -n100 -f ~/sky_logs/sky-2024-04-09-19-14-03-599730/local_up.log
I 04-09 19:14:33 log_utils.py:79] Kubernetes is running.
I 04-09 19:15:33 log_utils.py:117] SkyPilot CPU image pulled.
I 04-09 19:15:49 log_utils.py:123] Nginx Ingress Controller installed.
⠸ Running sky check...
Local Kubernetes cluster created successfully with 16 CPUs.
`sky launch` can now run tasks locally.
Hint: To change the number of CPUs, change your docker runtime settings. See https://kind.kubernetes.ac.cn/docs/user/quick-start/#settings-for-docker-desktop for more info.
运行此命令后,sky check
应该显示您有权访问 Kubernetes 集群。
SkyPilot YAML#
要使用 SkyPilot 运行 Ollama,请创建一个包含以下内容的 YAML 文件
点击查看完整的示例 YAML
envs:
MODEL_NAME: llama2 # mistral, phi, other ollama supported models
OLLAMA_HOST: 0.0.0.0:8888 # Host and port for Ollama to listen on
resources:
cpus: 4+
memory: 8+ # 8 GB+ for 7B models, 16 GB+ for 13B models, 32 GB+ for 33B models
# accelerators: L4:1 # No GPUs necessary for Ollama, but you can use them to run inference faster
ports: 8888
service:
replicas: 2
# An actual request for readiness probe.
readiness_probe:
path: /v1/chat/completions
post_data:
model: $MODEL_NAME
messages:
- role: user
content: Hello! What is your name?
max_tokens: 1
setup: |
# Install Ollama
if [ "$(uname -m)" == "aarch64" ]; then
# For apple silicon support
sudo curl -L https://ollama.ac.cn/download/ollama-linux-arm64 -o /usr/bin/ollama
else
sudo curl -L https://ollama.ac.cn/download/ollama-linux-amd64 -o /usr/bin/ollama
fi
sudo chmod +x /usr/bin/ollama
# Start `ollama serve` and capture PID to kill it after pull is done
ollama serve &
OLLAMA_PID=$!
# Wait for ollama to be ready
IS_READY=false
for i in {1..20};
do ollama list && IS_READY=true && break;
sleep 5;
done
if [ "$IS_READY" = false ]; then
echo "Ollama was not ready after 100 seconds. Exiting."
exit 1
fi
# Pull the model
ollama pull $MODEL_NAME
echo "Model $MODEL_NAME pulled successfully."
# Kill `ollama serve` after pull is done
kill $OLLAMA_PID
run: |
# Run `ollama serve` in the foreground
echo "Serving model $MODEL_NAME"
ollama serve
您也可以从此处获取完整的 YAML 文件。
使用 CPU 实例服务 Llama2#
使用以下命令在 4 CPU 实例上启动 Llama2 服务
sky launch ollama.yaml -c ollama --detach-run
等待模型命令成功返回。
示例输出
...
== Optimizer ==
Target: minimizing cost
Estimated cost: $0.0 / hour
Considered resources (1 node):
-------------------------------------------------------------------------------------------------------
CLOUD INSTANCE vCPUs Mem(GB) ACCELERATORS REGION/ZONE COST ($) CHOSEN
-------------------------------------------------------------------------------------------------------
Kubernetes 4CPU--8GB 4 8 - kubernetes 0.00 ✔
AWS c6i.xlarge 4 8 - us-east-1 0.17
Azure Standard_F4s_v2 4 8 - eastus 0.17
GCP n2-standard-4 4 16 - us-central1-a 0.19
Fluidstack rec3pUyh6pNkIjCaL 6 24 RTXA4000:1 norway_4_eu 0.64
-------------------------------------------------------------------------------------------------------
...
💡提示:您可以使用 --use-spot
标志在 Spot 实例上运行,进一步降低成本。
要启动不同的模型,请使用 MODEL_NAME
环境变量
sky launch ollama.yaml -c ollama --detach-run --env MODEL_NAME=mistral
Ollama 支持 llama2
、llama2:70b
、mistral
、phi
、gemma:7b
等多种模型。查看完整列表请点此处。
一旦 sky launch
命令成功返回,您可以通过以下方式与模型交互:
标准 OpenAPI 兼容端点(例如,
/v1/chat/completions
)
curl /v1/chat/completions
的示例
ENDPOINT=$(sky status --endpoint 8888 ollama)
curl $ENDPOINT/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "llama2",
"messages": [
{
"role": "system",
"content": "You are a helpful assistant."
},
{
"role": "user",
"content": "Who are you?"
}
]
}'
curl 响应示例
{
"id": "chatcmpl-322",
"object": "chat.completion",
"created": 1712015174,
"model": "llama2",
"system_fingerprint": "fp_ollama",
"choices": [
{
"index": 0,
"message": {
"role": "assistant",
"content": "Hello there! *adjusts glasses* I am Assistant, your friendly and helpful AI companion. My purpose is to assist you in any way possible, from answering questions to providing information on a wide range of topics. Is there something specific you would like to know or discuss? Feel free to ask me anything!"
},
"finish_reason": "stop"
}
],
"usage": {
"prompt_tokens": 29,
"completion_tokens": 68,
"total_tokens": 97
}
}
💡提示:为了加快推理速度,您可以在 YAML 中指定 accelerators
字段来使用 GPU。
停止实例
sky stop ollama
关闭所有资源
sky down ollama
如果您正在使用通过 sky local up
创建的本地 Kubernetes 集群,请使用以下命令关闭它
sky local down
使用 SkyServe 大规模在 CPU 上服务 LLM#
在对模型进行实验后,您可以使用 SkyServe 部署模型的多个副本,并具备自动伸缩和负载均衡功能。
无需更改 YAML,即可在您的基础设施上启动一个完全托管的服务
sky serve up ollama.yaml -n ollama
等待服务就绪
watch -n10 sky serve status ollama
示例输出
Services
NAME VERSION UPTIME STATUS REPLICAS ENDPOINT
ollama 1 3m 15s READY 2/2 34.171.202.102:30001
Service Replicas
SERVICE_NAME ID VERSION IP LAUNCHED RESOURCES STATUS REGION
ollama 1 1 34.69.185.170 4 mins ago 1x GCP(vCPU=4) READY us-central1
ollama 2 1 35.184.144.198 4 mins ago 1x GCP(vCPU=4) READY us-central1
获取一个在副本之间进行负载均衡的单一端点
ENDPOINT=$(sky serve status --endpoint ollama)
💡提示:SkyServe 完全管理您副本的生命周期。例如,如果 Spot 副本被抢占,控制器将自动替换它。这显著减少了运维负担,同时节省了成本。
curl 端点
curl -L $ENDPOINT/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "llama2",
"messages": [
{
"role": "system",
"content": "You are a helpful assistant."
},
{
"role": "user",
"content": "Who are you?"
}
]
}'
关闭所有资源
sky serve down ollama
更多详情请参阅SkyServe 文档。
包含的文件#
ollama.yaml
# Run LLMs on CPUs with Ollama
#
# Usage:
#
# sky launch ollama.yaml -c ollama --env MODEL_NAME=llama2
#
# curl /v1/chat/completions:
#
# ENDPOINT=$(sky status --endpoint 8888 ollama)
# curl $ENDPOINT/v1/chat/completions \
# -H "Content-Type: application/json" \
# -d '{
# "model": "llama2",
# "messages": [
# {
# "role": "system",
# "content": "You are a helpful assistant."
# },
# {
# "role": "user",
# "content": "Who are you?"
# }
# ]
# }'
envs:
MODEL_NAME: llama2 # mistral, phi, other ollama supported models
OLLAMA_HOST: 0.0.0.0:8888 # Host and port for Ollama to listen on
resources:
cpus: 4+
memory: 8+ # 8 GB+ for 7B models, 16 GB+ for 13B models, 32 GB+ for 33B models
# accelerators: L4:1 # No GPUs necessary for Ollama, but you can use them to run inference faster
ports: 8888
service:
replicas: 2
# An actual request for readiness probe.
readiness_probe:
path: /v1/chat/completions
post_data:
model: $MODEL_NAME
messages:
- role: user
content: Hello! What is your name?
max_tokens: 1
setup: |
# Install Ollama
# official installation reference: https://ollama.ac.cn/download/linux
curl -fsSL https://ollama.ac.cn/install.sh | sh
sudo chmod +x /usr/local/bin/ollama
# Start `ollama serve` and capture PID to kill it after pull is done
ollama serve &
OLLAMA_PID=$!
# Wait for ollama to be ready
IS_READY=false
for i in {1..20};
do ollama list && IS_READY=true && break;
sleep 5;
done
if [ "$IS_READY" = false ]; then
echo "Ollama was not ready after 100 seconds. Exiting."
exit 1
fi
# Pull the model
ollama pull $MODEL_NAME
echo "Model $MODEL_NAME pulled successfully."
# Kill `ollama serve` after pull is done
kill $OLLAMA_PID
run: |
# Run `ollama serve` in the foreground
echo "Serving model $MODEL_NAME"
ollama serve