来源: llm/gpt-2
在 SkyPilot 的 llm.c 上运行 GPT-2,支持任意云平台#
这是由 @karpathy ([karpathy/llm.c#481](https://github.com/karpathy/llm.c/discussions/481)) 在 llm.c 中进行的 GPT-2 (124M) 训练的可重现包。借助 SkyPilot,您可以在任何云平台上运行 GPT-2 (124M) 训练。SkyPilot 会寻找用户已启用的云平台上可用的最便宜资源,启动并管理整个数据处理和训练流程,从而实现 @karpathy 在讨论中提到的接近 ~$20 的目标成本。
先决条件#
安装 SkyPilot
pip install "skypilot-nightly[aws,gcp,azure,kubernetes,lambda,fluidstack]" # Choose the clouds you want to enable
为 SkyPilot 启用云平台
sky check
请查看 SkyPilot 文档中有关启用云平台的说明。
下载用于启动训练的 YAML 文件
wget https://raw.githubusercontent.com/skypilot-org/skypilot/blob/master/llm/gpt-2/gpt2.yaml
运行 GPT-2 训练#
运行以下命令在具有 8 个 A100 GPU 的 GPU 虚拟机上启动 GPT-2 (124M) 训练(将 your-bucket-name
替换为您的存储桶名称)
sky launch -c gpt2 gpt2.yaml
或者,您可以通过添加 --gpus A100
使用单个 A100 训练模型
sky launch -c gpt2 gpt2.yaml --gpus A100
也可以在 8 个 H100 上加快模型训练速度(比 8 个 A100 快 2.3 倍 tok/s)
sky launch -c gpt2 gpt2.yaml --gpus H100:8
下载日志和可视化#
训练完成后,您可以使用以下命令下载日志和可视化文件
scp -r gpt2:~/llm.c/log124M .
我们可以使用 llm.c 中提供的 notebook 来可视化训练进度。(注意:我们在 10K 步后中断了训练,这已经达到了与 OpenAI GPT-2 checkpoint 相似的验证损失。)

是的!我们能够在任何云平台上使用 SkyPilot 重现 GPT-2 (124M) 的训练。
高级:分两阶段运行 GPT-2 训练#
GPT-2 训练的数据处理是 CPU 密集型的,而训练是 GPU 密集型的。在 GPU 虚拟机上进行数据处理成本效益不高。使用 SkyPilot,您可以轻松地将数据处理和训练分成两个阶段,并手动按顺序执行,或者让 SkyPilot 管理两个阶段之间的依赖关系。
这样,数据处理可以在更便宜的 CPU 虚拟机上运行(例如,约 $0.4/小时),而训练可以在更昂贵的 GPU 虚拟机上运行(例如,单个 A100 GPU 约 $1.3-$3.6/小时,或 8 个 A100 GPU 约 $10.3-$32.8/小时)。
我们可以在 CPU 虚拟机上运行数据处理,并将处理后的数据存储在云存储桶中。然后,我们可以在 GPU 虚拟机上使用处理后的数据进行训练。
wget https://raw.githubusercontent.com//skypilot-org/skypilot/blob/master/llm/gpt-2/gpt2-data.yaml
wget https://raw.githubusercontent.com/skypilot-org/skypilot/blob/master/llm/gpt-2/gpt2-train.yaml
手动运行两阶段#
数据处理#
运行以下命令在 CPU 虚拟机上处理训练数据并将其存储在云存储桶中供以后使用(将 your-bucket-name
替换为您的存储桶名称)
sky launch -c gpt2-data gpt2-data.yaml --env BUCKET_NAME=your-bucket-name
训练#
数据处理完成后,您可以在具有 8 个 A100 GPU 的 GPU 虚拟机上训练模型(将 your-bucket-name
替换为您的存储桶名称)
sky launch -c gpt2-train gpt2-train.yaml --env BUCKET_NAME=your-bucket-name
或者,您可以通过添加 --gpus A100
使用单个 A100 训练模型
sky launch -c gpt2-train gpt2-train.yaml --gpus A100 --env BUCKET_NAME=your-bucket-name
在流水线中运行#
我们还可以将这两个步骤组合到一个 SkyPilot 作业中,并让 SkyPilot 处理两个步骤之间的依赖关系。以下是如何实现此目的的示例(将 your-bucket-name
替换为您的存储桶名称)
sky jobs launch -n gpt2 gpt2-pipeline.yaml --env BUCKET_NAME=your-bucket-name
注意:可以通过以下命令获取流水线 yaml 文件
cat gpt2-data.yaml > gpt2-pipeline.yaml; echo "---" >> gpt2-pipeline.yaml; cat gpt2-train.yaml >> gpt2-pipeline.yaml
SkyPilot 将首先在 CPU 虚拟机上下载和处理数据集,并将处理后的数据存储在 GCS 存储桶中。然后,它将在 GPU 虚拟机上启动一个 GPT-2 训练作业。该训练作业将使用处理后的数据训练 GPT-2 (124M)。
包含的文件#
gpt2-data.yaml
name: gpt2-data
envs:
BUCKET_NAME: # TODO: Fill in your bucket name
BUCKET_STORE: s3 # Can be s3, gcs, or r2.
resources:
cpus: 8+
file_mounts:
/cache:
name: $BUCKET_NAME
store: $BUCKET_STORE
mode: MOUNT
setup: |
pip install tqdm tiktoken requests datasets
git clone https://github.com/karpathy/llm.c.git@ed37d9261ba13ef212c01e2de8b309cbb46a2aa7 || true
# Adding revision to fix the dataset version, as the latest fineweb
# dataset removed the samples, causing error:
# Please pass `features` or at least one example when writing data
sed -i 's/fw = load_dataset("HuggingFaceFW\/fineweb", name=remote_name, split="train")/fw = load_dataset("HuggingFaceFW\/fineweb", name=remote_name, split="train", revision="9767af12bf8f0f7d3c91e0345b89bc6b9cbe1a94")/' dev/data/fineweb.py
run: |
cd llm.c
# tokenize the FineWeb dataset 10B tokens sample (takes ~1 hour, get lunch?)
# writes ~19GB of raw GPT-2 tokens to dev/data/fineweb10B
# and ~46GB in ~/.cache/huggingface/datasets/HuggingFaceFW___fineweb
python dev/data/fineweb.py --version 10B
rsync -Pavz --exclude "datasets/downloads/" ~/.cache/huggingface /cache/
rsync -Pavz dev/data/fineweb10B /cache/
gpt2-pipeline.yaml
name: gpt2-data
envs:
BUCKET_NAME: # TODO: Fill in your bucket name
BUCKET_STORE: s3 # Can be s3, gcs, or r2.
resources:
cpus: 8+
file_mounts:
/cache:
name: $BUCKET_NAME
store: $BUCKET_STORE
mode: MOUNT
setup: |
pip install tqdm tiktoken requests datasets
git clone https://github.com/karpathy/llm.c.git@ed37d9261ba13ef212c01e2de8b309cbb46a2aa7 || true
# Adding revision to fix the dataset version, as the latest fineweb
# dataset removed the samples, causing error:
# Please pass `features` or at least one example when writing data
sed -i 's/fw = load_dataset("HuggingFaceFW\/fineweb", name=remote_name, split="train")/fw = load_dataset("HuggingFaceFW\/fineweb", name=remote_name, split="train", revision="9767af12bf8f0f7d3c91e0345b89bc6b9cbe1a94")/' dev/data/fineweb.py
run: |
cd llm.c
# tokenize the FineWeb dataset 10B tokens sample (takes ~1 hour, get lunch?)
# writes ~19GB of raw GPT-2 tokens to dev/data/fineweb10B
# and ~46GB in ~/.cache/huggingface/datasets/HuggingFaceFW___fineweb
python dev/data/fineweb.py --version 10B
rsync -Pavz --exclude "datasets/downloads/" ~/.cache/huggingface /cache/
rsync -Pavz dev/data/fineweb10B /cache/
---
name: gpt2-train
envs:
BUCKET_NAME: # TODO: Fill in your bucket name
BUCKET_STORE: s3 # Can be s3, gcs, or r2.
resources:
accelerators: A100:8
# Use docker image for latest version g++ to enable the compilation of llm.c.
image_id: docker:nvidia/cuda:12.4.1-cudnn-devel-ubuntu22.04
any_of:
# Avoid using docker image for lambda due to the docker is not supported on
# Lambda yet, but the base image works.
- cloud: lambda
image_id: null
- cloud: aws
- cloud: gcp
- cloud: azure
- cloud: fluidstack
- cloud: kubernetes
file_mounts:
~/.cache/huggingface:
name: $BUCKET_NAME
store: $BUCKET_STORE
mode: COPY
setup: |
cd ~
# install cudnn so we can use FlashAttention and run fast (optional)
# https://developer.nvidia.com/cudnn-downloads
# for me, CUDA 12 (run `nvcc --version`) running on Linux x86_64 Ubuntu 22.04
if [ -f ./CUDNN_INSTALLED ]; then
echo "cudnn already installed"
else
system=$(lsb_release -si | tr '[:upper:]' '[:lower:]')
# Get version and remove the dot
version=$(lsb_release -sr | tr -d .)
export system_version="${system}${version}"
wget https://developer.download.nvidia.com/compute/cudnn/9.1.1/local_installers/cudnn-local-repo-${system_version}-9.1.1_1.0-1_amd64.deb -O cudnn-installer.deb
sudo dpkg -i cudnn-installer.deb
sudo cp /var/cudnn-local-repo-${system_version}-9.1.1/cudnn-*-keyring.gpg /usr/share/keyrings/
# Remove problematic kubernetes.list source
sudo apt-get update --allow-releaseinfo-change || true
sudo apt-get -y install cudnn-cuda-12
touch ./CUDNN_INSTALLED
fi
# "install" cudnn-frontend to ~/
sudo apt -y install git
git clone https://github.com/NVIDIA/cudnn-frontend.git || true
# install MPI (optional, if you intend to use multiple GPUs)
# SkyPilot do not install MPI as that requires NCCL which needs to be manually
# installed.
sudo apt install -y openmpi-bin openmpi-doc libopenmpi-dev
# install nccl
pip install nvidia-nccl-cu12
export LIBRARY_PATH=$LIBRARY_PATH:/usr/local/nccl2/lib
export CPLUS_INCLUDE_PATH=$CPLUS_INCLUDE_PATH:/usr/local/nccl2/include
git clone https://github.com/karpathy/llm.c.git || true
cd llm.c
ln -s ~/.cache/huggingface/fineweb10B dev/data/
# compile llm.c (mixed precision, with cuDNN flash-attention)
# first compilation is ~1 minute, mostly due to cuDNN
make train_gpt2cu USE_CUDNN=1
run: |
cd ~/llm.c
# train on multiple GPUs
mpirun -np $SKYPILOT_NUM_GPUS_PER_NODE --allow-run-as-root ./train_gpt2cu \
-i "dev/data/fineweb10B/fineweb_train_*.bin" \
-j "dev/data/fineweb10B/fineweb_val_*.bin" \
-o log124M \
-e "d12" \
-b 64 -t 1024 \
-d 524288 \
-r 1 \
-z 1 \
-c 0.1 \
-l 0.0006 \
-q 0.0 \
-u 700 \
-n 5000 \
-v 250 -s 20000 \
-h 1
# Upload the log and model to the bucket
rsync -Pavz log124M ~/.cache/huggingface
gpt2-train.yaml
name: gpt2-train
envs:
BUCKET_NAME: # TODO: Fill in your bucket name
BUCKET_STORE: s3 # Can be s3, gcs, or r2.
resources:
accelerators: A100:8
# Use docker image for latest version g++ to enable the compilation of llm.c.
image_id: docker:nvidia/cuda:12.4.1-cudnn-devel-ubuntu22.04
any_of:
# Avoid using docker image for lambda due to the docker is not supported on
# Lambda yet, but the base image works.
- cloud: lambda
image_id: null
- cloud: aws
- cloud: gcp
- cloud: azure
- cloud: fluidstack
- cloud: kubernetes
file_mounts:
~/.cache/huggingface:
name: $BUCKET_NAME
store: $BUCKET_STORE
mode: COPY
setup: |
cd ~
# install cudnn so we can use FlashAttention and run fast (optional)
# https://developer.nvidia.com/cudnn-downloads
# for me, CUDA 12 (run `nvcc --version`) running on Linux x86_64 Ubuntu 22.04
if [ -f ./CUDNN_INSTALLED ]; then
echo "cudnn already installed"
else
system=$(lsb_release -si | tr '[:upper:]' '[:lower:]')
# Get version and remove the dot
version=$(lsb_release -sr | tr -d .)
export system_version="${system}${version}"
wget https://developer.download.nvidia.com/compute/cudnn/9.1.1/local_installers/cudnn-local-repo-${system_version}-9.1.1_1.0-1_amd64.deb -O cudnn-installer.deb
sudo dpkg -i cudnn-installer.deb
sudo cp /var/cudnn-local-repo-${system_version}-9.1.1/cudnn-*-keyring.gpg /usr/share/keyrings/
# Remove problematic kubernetes.list source
sudo apt-get update --allow-releaseinfo-change || true
sudo apt-get -y install cudnn-cuda-12
touch ./CUDNN_INSTALLED
fi
# "install" cudnn-frontend to ~/
sudo apt -y install git
git clone https://github.com/NVIDIA/cudnn-frontend.git || true
# install MPI (optional, if you intend to use multiple GPUs)
# SkyPilot do not install MPI as that requires NCCL which needs to be manually
# installed.
sudo apt install -y openmpi-bin openmpi-doc libopenmpi-dev
# install nccl
pip install nvidia-nccl-cu12
export LIBRARY_PATH=$LIBRARY_PATH:/usr/local/nccl2/lib
export CPLUS_INCLUDE_PATH=$CPLUS_INCLUDE_PATH:/usr/local/nccl2/include
git clone https://github.com/karpathy/llm.c.git || true
cd llm.c
ln -s ~/.cache/huggingface/fineweb10B dev/data/
# compile llm.c (mixed precision, with cuDNN flash-attention)
# first compilation is ~1 minute, mostly due to cuDNN
make train_gpt2cu USE_CUDNN=1
run: |
cd ~/llm.c
# train on multiple GPUs
mpirun -np $SKYPILOT_NUM_GPUS_PER_NODE --allow-run-as-root ./train_gpt2cu \
-i "dev/data/fineweb10B/fineweb_train_*.bin" \
-j "dev/data/fineweb10B/fineweb_val_*.bin" \
-o log124M \
-e "d12" \
-b 64 -t 1024 \
-d 524288 \
-r 1 \
-z 1 \
-c 0.1 \
-l 0.0006 \
-q 0.0 \
-u 700 \
-n 5000 \
-v 250 -s 20000 \
-h 1
# Upload the log and model to the bucket
rsync -Pavz log124M ~/.cache/huggingface
gpt2.yaml
name: train
resources:
accelerators: A100:8
# Use docker image for latest version g++ to enable the compilation of llm.c.
image_id: docker:nvidia/cuda:12.4.1-cudnn-devel-ubuntu22.04
any_of:
# Avoid using docker image for lambda due to the docker is not supported on
# Lambda yet, but the base image works.
- cloud: lambda
image_id: null
- cloud: aws
- cloud: gcp
- cloud: azure
- cloud: fluidstack
- cloud: kubernetes
setup: |
cd ~
pip install tqdm tiktoken requests datasets
# Training dependencies
# install cudnn so we can use FlashAttention and run fast (optional)
# https://developer.nvidia.com/cudnn-downloads
# for me, CUDA 12 (run `nvcc --version`) running on Linux x86_64 Ubuntu 22.04
if [ -f ./CUDNN_INSTALLED ]; then
echo "cudnn already installed"
else
system=$(lsb_release -si | tr '[:upper:]' '[:lower:]')
# Get version and remove the dot
version=$(lsb_release -sr | tr -d .)
export system_version="${system}${version}"
wget https://developer.download.nvidia.com/compute/cudnn/9.1.1/local_installers/cudnn-local-repo-${system_version}-9.1.1_1.0-1_amd64.deb -O cudnn-installer.deb
sudo dpkg -i cudnn-installer.deb
sudo cp /var/cudnn-local-repo-${system_version}-9.1.1/cudnn-*-keyring.gpg /usr/share/keyrings/
# Remove problematic kubernetes.list source
sudo apt-get update --allow-releaseinfo-change || true
sudo apt-get -y install cudnn-cuda-12
touch ./CUDNN_INSTALLED
fi
# "install" cudnn-frontend to ~/
sudo apt -y install git
git clone https://github.com/NVIDIA/cudnn-frontend.git || true
# install MPI (optional, if you intend to use multiple GPUs)
# SkyPilot do not install MPI as that requires NCCL which needs to be manually
# installed.
sudo apt install -y openmpi-bin openmpi-doc libopenmpi-dev
# install nccl
pip install nvidia-nccl-cu12
export LIBRARY_PATH=$LIBRARY_PATH:/usr/local/nccl2/lib
export CPLUS_INCLUDE_PATH=$CPLUS_INCLUDE_PATH:/usr/local/nccl2/include
git clone https://github.com/karpathy/llm.c.git || true
cd llm.c
# add revision to fix the dataset version, as the latest fineweb
# dataset removed the samples, causing error:
# Please pass `features` or at least one example when writing data
sed -i 's/fw = load_dataset("HuggingFaceFW\/fineweb", name=remote_name, split="train")/fw = load_dataset("HuggingFaceFW\/fineweb", name=remote_name, split="train", revision="9767af12bf8f0f7d3c91e0345b89bc6b9cbe1a94")/' dev/data/fineweb.py
# compile llm.c (mixed precision, with cuDNN flash-attention)
# first compilation is ~1 minute, mostly due to cuDNN
make train_gpt2cu USE_CUDNN=1
run: |
cd ~/llm.c
# Processing data
# tokenize the FineWeb dataset 10B tokens sample (takes ~1 hour, get lunch?)
# writes ~19GB of raw GPT-2 tokens to dev/data/fineweb10B
# and ~46GB in ~/.cache/huggingface/datasets/HuggingFaceFW___fineweb
python dev/data/fineweb.py --version 10B
# Start training on multiple GPUs
mpirun -np $SKYPILOT_NUM_GPUS_PER_NODE --allow-run-as-root ./train_gpt2cu \
-i "dev/data/fineweb10B/fineweb_train_*.bin" \
-j "dev/data/fineweb10B/fineweb_val_*.bin" \
-o log124M \
-e "d12" \
-b 64 -t 1024 \
-d 524288 \
-r 1 \
-z 1 \
-c 0.1 \
-l 0.0006 \
-q 0.0 \
-u 700 \
-n 5000 \
-v 250 -s 20000 \
-h 1