来源：examples/distributed-pytorch

使用 PyTorch 进行分布式训练#

本示例展示了如何使用 SkyPilot 运行 PyTorch 分布式训练。

本示例基于 PyTorch 的官方 minGPT 示例.

概览#

有两种方法可以使用 PyTorch 运行分布式训练

使用普通的 torchrun
使用 rdzv 后端

对于固定大小的分布式训练，两者主要区别在于 rdzv 后端会自动处理每个节点的 rank，而 torchrun 需要手动设置 rank。

SkyPilot 提供了便利的内置环境变量，帮助您轻松启动分布式训练。

使用普通的 `torchrun`#

以下命令将启动 2 个节点，每个节点配备 2 个 L4 GPU

sky launch -c train train.yaml

在 train.yaml 中，我们使用 torchrun 启动训练，并使用 SkyPilot 提供的环境变量设置分布式训练的参数。

run: |
    cd examples/mingpt
    MASTER_ADDR=$(echo "$SKYPILOT_NODE_IPS" | head -n1)
    torchrun \
    --nnodes=$SKYPILOT_NUM_NODES \
    --nproc_per_node=$SKYPILOT_NUM_GPUS_PER_NODE \
    --master_addr=$MASTER_ADDR \
    --master_port=8008 \
    --node_rank=${SKYPILOT_NODE_RANK} \
    main.py

使用 `rdzv` 后端#

rdzv 是分布式训练的另一种后端

sky launch -c train-rdzv train-rdzv.yaml

在 train-rdzv.yaml 中，我们使用 torchrun 启动训练，并使用 SkyPilot 提供的环境变量设置分布式训练的参数。

run: |
    cd examples/mingpt
    MASTER_ADDR=$(echo "$SKYPILOT_NODE_IPS" | head -n1)
    echo "Starting distributed training, head node: $MASTER_ADDR"

    torchrun \
    --nnodes=$SKYPILOT_NUM_NODES \
    --nproc_per_node=$SKYPILOT_NUM_GPUS_PER_NODE \
    --rdzv_backend=c10d \
    --rdzv_endpoint=$MASTER_ADDR:29500 \
    --rdzv_id $SKYPILOT_TASK_ID \
    main.py

扩展规模#

如果您想扩展训练规模，只需更改资源需求，SkyPilot 的内置环境变量就会自动设置。

例如，以下命令将启动 4 个节点，每个节点配备 4 个 L4 GPU。

sky launch -c train train.yaml --num-nodes 4 --gpus L4:4 --cpus 8+

我们同时将 --cpus 增加到 8+，以避免 CPU 成为性能瓶颈。

包含的文件#

train-rdzv.yaml

name: minGPT-ddp-rdzv

resources:
    cpus: 4+
    accelerators: L4

num_nodes: 2

setup: |
    git clone --depth 1 https://github.com/pytorch/examples || true
    cd examples
    git filter-branch --prune-empty --subdirectory-filter distributed/minGPT-ddp
    # SkyPilot's default image on AWS/GCP has CUDA 11.6 (Azure 11.5).
    uv venv --python 3.10
    source .venv/bin/activate
    uv pip install -r requirements.txt "numpy<2" "torch==1.12.1+cu113" --extra-index-url https://download.pytorch.org/whl/cu113

run: |
    cd examples
    source .venv/bin/activate
    cd mingpt
    export LOGLEVEL=INFO

    MASTER_ADDR=$(echo "$SKYPILOT_NODE_IPS" | head -n1)
    echo "Starting distributed training, head node: $MASTER_ADDR"

    torchrun \
    --nnodes=$SKYPILOT_NUM_NODES \
    --nproc_per_node=$SKYPILOT_NUM_GPUS_PER_NODE \
    --rdzv_backend=c10d \
    --rdzv_endpoint=$MASTER_ADDR:29500 \
    --rdzv_id $SKYPILOT_TASK_ID \
    main.py

train.yaml

name: minGPT-ddp

resources:
    cpus: 4+
    accelerators: L4

num_nodes: 2

setup: |
    git clone --depth 1 https://github.com/pytorch/examples || true
    cd examples
    git filter-branch --prune-empty --subdirectory-filter distributed/minGPT-ddp
    # SkyPilot's default image on AWS/GCP has CUDA 11.6 (Azure 11.5).
    uv venv --python 3.10
    source .venv/bin/activate
    uv pip install -r requirements.txt "numpy<2" "torch==1.12.1+cu113" --extra-index-url https://download.pytorch.org/whl/cu113

run: |
    cd examples
    source .venv/bin/activate
    cd mingpt
    export LOGLEVEL=INFO

    MASTER_ADDR=$(echo "$SKYPILOT_NODE_IPS" | head -n1)
    echo "Starting distributed training, head node: $MASTER_ADDR"

    # Explicit check for torchrun
    if ! command -v torchrun >/dev/null 2>&1; then
        echo "ERROR: torchrun command not found" >&2
        exit 1
    fi

    torchrun \
    --nnodes=$SKYPILOT_NUM_NODES \
    --nproc_per_node=$SKYPILOT_NUM_GPUS_PER_NODE \
    --master_addr=$MASTER_ADDR \
    --master_port=8008 \
    --node_rank=${SKYPILOT_NODE_RANK} \
    main.py

使用 PyTorch 进行分布式训练#

概览#

使用普通的 torchrun#

使用 rdzv 后端#

扩展规模#

包含的文件#

使用普通的 `torchrun`#

使用 `rdzv` 后端#