许多并行作业#

SkyPilot 使您能够轻松地并行运行大量作业并在单个系统中管理它们。这对于超参数调优、数据处理和其他批量作业非常有用。

本指南介绍了使用 SkyPilot 运行许多作业的典型工作流程。

为何使用 SkyPilot 运行大量作业#

统一：使用您自己的任何或多种基础设施（Kubernetes、云虚拟机、预留实例等）。
弹性：根据需求进行扩缩容。
成本效益高：只为最便宜的资源付费。
健壮：自动从故障中恢复作业。
可观测：在单个面板中监控和管理所有作业。

为一个作业编写 YAML#

在扩展到许多作业之前，首先为一个作业编写 SkyPilot YAML 并确保其正确运行。这可以避免同时调试许多作业，从而节省时间。

这里与教程：AI 训练中使用的 YAML 示例相同

点击展开：train.yaml

# train.yaml
name: huggingface

resources:
  accelerators: V100:4

setup: |
  set -e  # Exit if any command failed.
  git clone https://github.com/huggingface/transformers/ || true
  cd transformers
  pip install .
  cd examples/pytorch/text-classification
  pip install -r requirements.txt torch==1.12.1+cu113 --extra-index-url https://download.pytorch.org/whl/cu113

run: |
  set -e  # Exit if any command failed.
  cd transformers/examples/pytorch/text-classification
  python run_glue.py \
    --model_name_or_path bert-base-cased \
    --dataset_name imdb  \
    --do_train \
    --max_seq_length 128 \
    --per_device_train_batch_size 32 \
    --learning_rate 2e-5 \
    --max_steps 50 \
    --output_dir /tmp/imdb/ --overwrite_output_dir \
    --fp16

首先，启动作业以检查其是否成功启动并正确运行

sky launch -c train train.yaml

如果出现任何错误，您可以修复代码和/或 YAML，然后在同一集群上再次启动作业

# Cancel the latest job.
sky cancel train -y
# Run the job again on the same cluster.
sky launch -c train train.yaml

有时，登录到集群并交互式调试作业可能更有效率。您可以通过直接通过 SSH 连接到集群或使用 VSCode 的远程 SSH 来实现。

# Log into the cluster.
ssh train

接下来，在确认作业正常工作后，向作业 YAML 添加（超）参数，以便指定所有作业变体。

1. 添加超参数#

要使用不同的超参数启动作业，请将它们作为环境变量添加到 SkyPilot YAML 中，并使您的主程序读取这些环境变量

更新后的 SkyPilot YAML：train-template.yaml

# train-template.yaml
name: huggingface

envs:
  LR: 2e-5
  MAX_STEPS: 50

resources:
  accelerators: V100:4

setup: |
  set -e  # Exit if any command failed.
  git clone https://github.com/huggingface/transformers/ || true
  cd transformers
  pip install .
  cd examples/pytorch/text-classification
  pip install -r requirements.txt torch==1.12.1+cu113 --extra-index-url https://download.pytorch.org/whl/cu113

run: |
  set -e  # Exit if any command failed.
  cd transformers/examples/pytorch/text-classification
  python run_glue.py \
    --model_name_or_path bert-base-cased \
    --dataset_name imdb  \
    --do_train \
    --max_seq_length 128 \
    --per_device_train_batch_size 32 \
    --learning_rate ${LR} \
    --max_steps ${MAX_STEPS} \
    --output_dir /tmp/imdb/ --overwrite_output_dir \
    --fp16

您现在可以使用--env来启动具有不同超参数的作业

sky launch -c train train-template.yaml \
  --env LR=1e-5 \
  --env MAX_STEPS=100

或者，将环境变量值存储在 dotenv 文件中，并使用--env-file来启动

# configs/job1
LR=1e-5
MAX_STEPS=100

sky launch -c train train-template.yaml \
  --env-file configs/job1

2. 记录作业输出#

当运行大量作业时，记录所有作业的输出非常有用。您可以使用像 W&B 这样的工具来实现此目的

包含 W&B 的 SkyPilot YAML：train-template.yaml

# train-template.yaml
name: huggingface

envs:
  LR: 2e-5
  MAX_STEPS: 50
  WANDB_API_KEY: # Empty field means this field is required when launching the job.

resources:
  accelerators: V100:4

setup: |
  set -e  # Exit if any command failed.
  git clone https://github.com/huggingface/transformers/ || true
  cd transformers
  pip install .
  cd examples/pytorch/text-classification
  pip install -r requirements.txt torch==1.12.1+cu113 --extra-index-url https://download.pytorch.org/whl/cu113
  pip install wandb

run: |
  set -e  # Exit if any command failed.
  cd transformers/examples/pytorch/text-classification
  python run_glue.py \
    --model_name_or_path bert-base-cased \
    --dataset_name imdb  \
    --do_train \
    --max_seq_length 128 \
    --per_device_train_batch_size 32 \
    --learning_rate ${LR} \
    --max_steps ${MAX_STEPS} \
    --output_dir /tmp/imdb/ --overwrite_output_dir \
    --fp16 \
    --report_to wandb

您现在可以使用以下命令启动作业（WANDB_API_KEY应该存在于您的本地环境变量中）。

sky launch -c train train-template.yaml \
  --env-file configs/job1 \
  --env WANDB_API_KEY

扩展到大量作业#

通过以上设置，您现在可以扩展以并行运行大量作业。

要一次运行许多作业，我们将把作业作为SkyPilot 托管作业启动。我们可以独立控制每个托管作业的超参数环境变量。

您可以在 bash 或 Python 中使用普通循环来迭代可能的超参数

CLI

job_idx=0
for lr in 0.01 0.03 0.1 0.3 1.0; do
    for max_steps in 100 300 1000; do
        sky jobs launch -n train-job${job_idx} -y --async \
          train-template.yaml \
          --env LR="${lr}" --env MAX_STEPS="${max_steps}" \
          --env WANDB_API_KEY # pick up from environment
        ((job_idx++))
    done
done

Python

import os
import sky

LR_CANDIDATES = [0.01, 0.03, 0.1, 0.3, 1.0]
MAX_STEPS_CANDIDATES = [100, 300, 1000]
task = sky.Task.from_yaml('train-template.yaml')

job_idx = 1
requests_ids = []
for lr in LR_CANDIDATES:
  for max_steps in MAX_STEPS_CANDIDATES:
    task.update_envs({'LR': lr, 'MAX_STEPS': max_steps})
    requests_ids.append(
      sky.jobs.launch(
        task,
        name=f'train-job{job_idx}',
      )
    )
    job_idx += 1

# Wait for all jobs to finish
for request_id in requests_ids:
  sky.get(request_id)

启动的作业提交后会“分离”（-d），并将并行运行。

可以通过sky jobs queue检查作业状态

$ sky jobs queue

Fetching managed jobs...
Managed jobs
In progress tasks: 10 RUNNING
ID  TASK  NAME        RESOURCES  SUBMITTED    TOT. DURATION  JOB DURATION  #RECOVERIES  STATUS
10  -     train-job10 1x[V100:4] 5 mins ago   5m 5s          1m 12s        0            RUNNING
9   -     train-job9  1x[V100:4] 6 mins ago   6m 11s         2m 23s        0            RUNNING
8   -     train-job8  1x[V100:4] 7 mins ago   7m 15s         3m 31s        0            RUNNING
...

使用配置文件#

为了更精细的控制，您还可以创建特定的环境变量配置文件。

首先，为每个作业创建一个配置文件（例如，在一个configs目录中）

# configs/job-1
LR=1e-5
MAX_STEPS=100

# configs/job-2
LR=2e-5
MAX_STEPS=200

...

一个生成配置文件的 Python 示例脚本

import os

CONFIG_PATH = 'configs'
LR_CANDIDATES = [0.01, 0.03, 0.1, 0.3, 1.0]
MAX_STEPS_CANDIDATES = [100, 300, 1000]

os.makedirs(CONFIG_PATH, exist_ok=True)

job_idx = 1
for lr in LR_CANDIDATES:
  for max_steps in MAX_STEPS_CANDIDATES:
    config_file = f'{CONFIG_PATH}/job-{job_idx}'
    with open(config_file, 'w') as f:
      print(f'LR={lr}', file=f)
      print(f'MAX_STEPS={max_steps}', file=f)
    job_idx += 1

然后，通过遍历配置文件并在每个文件上调用sky jobs launch来提交所有作业

for config_file in configs/*; do
  job_name=$(basename $config_file)
  # -y: yes to all prompts.
  # -d: detach from the job's logging, so the next job can be submitted
  #      without waiting for the previous job to finish.
  sky jobs launch -n train-$job_name -y --async \
    train-template.yaml \
    --env-file $config_file \
    --env WANDB_API_KEY
done

扩展的最佳实践#

默认情况下，可以同时管理大约 90 个作业。然而，通过一些简单的配置，SkyPilot 可以可靠地支持并行运行 2000 个作业。请参阅最佳实践以获取更多信息。