SkyPilot YAML#

SkyPilot 提供了一个直观的 YAML 接口来指定集群、作业或服务（资源需求、设置命令、运行命令、文件挂载、存储挂载等）。

YAML 规范中的所有字段都是可选的。 未指定时，将使用其默认值。您可以仅指定与您的任务相关的字段。

YAML 文件可用于 CLI 或编程 API（例如，sky.Task.from_yaml()）。

语法#

以下是配置语法和一些示例值。请参阅每个字段下的详细信息。

name: my-task

workdir: ~/my-task-code

num_nodes: 4

resources:
  # Location.
  cloud: aws
  region: us-east-1
  zone: us-east-1a

  # Hardware.
  accelerators: H100:8
  accelerator_args:
    runtime_version: tpu-vm-base
  cpus: 4+
  memory: 32+
  instance_type: p3.8xlarge
  use_spot: false
  disk_size: 256
  disk_tier: medium

  # Config.
  image_id: ami-0868a20f5a3bf9702
  ports: 8081
  labels:
    my-label: my-value

  any_of:
    - cloud: aws
      region: us-west-2
      accelerators: H100
    - cloud: gcp
      accelerators: H100

  ordered:
    - cloud: aws
      region: us-east-1
    - cloud: aws
      region: us-west-2

  job_recovery: none

envs:
  MY_BUCKET: skypilot-temp-gcs-test
  MY_LOCAL_PATH: tmp-workdir
  MODEL_SIZE: 13b

file_mounts:
  /remote/path: /local/path
  /checkpoints:
    source: s3://existing-bucket
    mode: MOUNT
  /datasets-s3: s3://my-awesome-dataset

setup: |
  echo "Begin setup."
  pip install -r requirements.txt
  echo "Setup complete."

run: |
  echo "Begin run."
  python train.py
  echo Env var MODEL_SIZE has value: ${MODEL_SIZE}

config:
  kubernetes:
    provision_timeout: 600

字段#

`name`#

任务名称（可选），用于显示目的。

name: my-task

`workdir`#

工作目录（可选），每次使用 yaml 文件运行 launch 或 exec 时，都会同步到远程集群上的 ~/sky_workdir。

setup 和 run 中的命令将在其下执行。

如果使用相对路径，则相对于调用 sky 的位置进行评估。

要排除文件不同步，请参阅 https://docs.skypilot.org.cn/en/latest/examples/syncing-code-artifacts.html#exclude-uploading-files

workdir: ~/my-task-code

或

workdir: ../my-project  # Relative path

`num_nodes`#

要启动的节点数量（可选；默认为 1），包括头节点。

任务可以将其设置为小于集群大小的值。

num_nodes: 4

`resources`#

每个节点的资源要求（可选）。

resources:
  cloud: aws
  instance_type: p3.8xlarge

`resources.cloud`#

要使用的云（可选）。

resources:
  cloud: aws

或

resources:
  cloud: gcp

`resources.region`#

要使用的区域（可选）。

如果指定此项，将禁用自动故障转移。

resources:
  region: us-east-1

`resources.zone`#

要使用的可用区（可选）。

如果指定此项，将禁用自动故障转移。

resources:
  zone: us-east-1a

`resources.accelerators`#

每个节点的加速器名称和数量（可选）。

使用 sky show-gpus 查看可用的加速器配置。

以下三种方法可用于为集群指定加速器

指定单一类型的加速器

格式：<name>:<count>（或仅 <name>，表示数量为 1）。

示例：H100:4
指定有序加速器列表（按指定顺序尝试加速器）

格式：[<name>:<count>, ...]

示例：['L4:1', 'H100:1', 'A100:1']
指定无序加速器集（一起优化所有指定加速器，并优先尝试成本最低的加速器）

格式：{<name>:<count>, ...}

示例：{'L4:1', 'H100:1', 'A100:1'}

resources:
  accelerators: V100:8

或

resources:
  accelerators:
    - A100:1
    - V100:1

或

resources:
  accelerators: {A100:1, V100:1}

`resources.accelerator_args`#

额外的加速器元数据（可选）；仅用于 TPU 节点和 TPU VM。

示例用法

请求 TPU VM

resources:
  accelerator_args:
    tpu_vm: true  # optional, default: True

请求 TPU 节点

resources:
  accelerator_args:
    tpu_name: mytpu
    tpu_vm: false

默认情况下，runtime_version 的值根据请求类型决定，并且应适用于任一情况。如果传入不兼容的版本，GCP 将在配置过程中抛出错误。

示例

resources:
  accelerator_args:
    # Default is "tpu-vm-base" for TPU VM and "2.12.0" for TPU node.
    runtime_version: tpu-vm-base
    # tpu_name: mytpu
    # tpu_vm: false  # True to use TPU VM (the default); False to use TPU node.

`resources.cpus`#

resources.cpus#

每个节点的 vCPU 数量（可选）。

格式
<count>：恰好 <count> 个 vCPU

<count>+：至少 <count> 个 vCPU

resources:
  cpus: 4+

或

resources:
  cpus: 16

示例：`4+` 表示首先尝试寻找具有 >= 4 个 vCPU 的实例类型。如果找不到，则使用下一个具有超过 4 个 vCPU 的最便宜实例。

resources.memory#

每个节点的 vCPU 数量（可选）。

每个节点的内存，单位 GiB（可选）。
<num>：恰好 <num> GiB

<num>+：至少 <num> GiB

resources:
  memory: 32+

或

resources:
  memory: 64

示例：`32+` 表示首先尝试寻找具有 >= 32 GiB 的实例类型。如果找不到，则使用下一个具有超过 32 GiB 的最便宜实例。

resources.instance_type#

要使用的实例类型（可选）。

resources:
  instance_type: p3.8xlarge

如果指定了 `accelerators`，则会自动推断出相应的实例类型。

resources.use_spot#

集群是否应使用 Spot 实例（可选）。

resources:
  use_spot: true

如果未指定，默认为 `false`（按需实例）。

resources.disk_size#

为操作系统分配的磁盘大小，单位 GB（挂载在 /）。

resources:
  disk_size: 256

如果您的工作目录很大或任务会输出大量数据，请增加此值。

resources.disk_tier#

要用于操作系统的磁盘层级（可选）。

可以是 'low'、'medium'、'high'、'ultra' 或 'best' 中的一个（默认：'medium'）。

如果指定了 'best'，则使用启用的最佳磁盘层级。

大致性能估算
low：1000 IOPS；读取 90 MB/s；写入 90 MB/s
medium：3000 IOPS；读取 220 MB/s；写入 220 MB/s
high：6000 IOPS；读取 400 MB/s；写入 400 MB/s

ultra：60000 IOPS；读取 4000 MB/s；写入 3000 MB/s

resources:
  disk_tier: medium

或

resources:
  disk_tier: best

通过 `examples/perf/storage_rawperf.yaml` 测量

resources.ports#

要暴露的端口（可选）。

此处指定的所有端口都将暴露给公共互联网。底层会自动添加防火墙规则/入站规则，以允许入站流量流向这些端口。

适用于使用此字段集创建的集群的所有 VM。

目前仅支持 TCP 协议。

端口生命周期

每当执行 sky launch 时，集群的端口都会更新。启动现有集群时，任何指定的新端口都将为集群开放，而旧端口的防火墙规则在集群终止之前永远不会被移除。

可以是一个整数、一个范围或一个整数和范围的列表
指定单个端口：8081
指定端口范围：10052-10100

resources:
ports:
  - 8080
  - 10022-10040

或

resources:
  ports: 8081

或

resources:
  ports: 10052-10100

或

resources:
  ports:
    - 8080
    - 10022-10040

指定多个端口/端口范围

resources.image_id#

自定义镜像 ID（可选，高级）。

用于启动实例的镜像 ID。仅支持 AWS、GCP、OCI 和 IBM（非 Docker 镜像）。

如果未指定，SkyPilot 将使用适合机器学习任务的默认基于 debian 的镜像。

Docker 支持

resources:
  image_id: docker:ubuntu:latest

您可以通过将 image_id 设置为 docker:<image name> 来指定要用于 Azure、AWS 和 GCP 的 Docker 镜像。例如，

目前，仅支持 debian 和 ubuntu 镜像。

AWS

如果您想使用私有仓库中的 Docker 镜像，可以将用户名、密码和仓库服务器指定为任务环境变量。详情请参阅下面的 envs 部分。

查找 AWS AMI ID：https://leaherb.com/how-to-find-an-aws-marketplace-ami-image-id

resources:
  image_id: skypilot:gpu-ubuntu-2004
  image_id: skypilot:k80-ubuntu-2004
  image_id: skypilot:gpu-ubuntu-1804
  image_id: skypilot:k80-ubuntu-1804

您还可以从 SkyPilot 提供的以下镜像标签中选择，更改默认操作系统版本

resources:
  image_id:
    us-east-1: ami-0729d913a335efca7
    us-west-2: ami-050814f384259894c

GCP

还可以指定每个区域的镜像 ID（故障转移将仅通过指定为键的区域；当您在多个区域拥有自定义镜像时很有用）

resources:
  image_id: projects/deeplearning-platform-release/global/images/common-cpu-v20230615-debian-11-py310

查找 GCP 镜像：https://cloud.google.com/compute/docs/images

resources:
  image_id: projects/my-project/global/machineImages/my-machine-image

或机器镜像：https://cloud.google.com/compute/docs/machine-images

Azure

resources:
  image_id: microsoft-dsvm:ubuntu-2004:2004:21.11.04

查找 Azure 镜像：https://docs.microsoft.com/en-us/azure/virtual-machines/linux/cli-ps-findimage

OCI

查找 OCI 镜像：https://docs.oracle.com/en-us/iaas/images

resources:
  image_id: skypilot:gpu-ubuntu-2204
  image_id: skypilot:gpu-ubuntu-2004
  image_id: skypilot:gpu-oraclelinux9
  image_id: skypilot:gpu-oraclelinux8
  image_id: skypilot:cpu-ubuntu-2204
  image_id: skypilot:cpu-ubuntu-2004
  image_id: skypilot:cpu-oraclelinux9
  image_id: skypilot:cpu-oraclelinux8

您可以从 SkyPilot 提供的以下镜像标签中选择包含操作系统版本的镜像

resources:
  image_id: ocid1.image.oc1.us-sanjose-1.aaaaaaaaywwfvy67wwe7f24juvjwhyjn3u7g7s3wzkhduxcbewzaeki2nt5q:oraclelinux
  image_id: ocid1.image.oc1.us-sanjose-1.aaaaaaaa5tnuiqevhoyfnaa5pqeiwjv6w5vf6w4q2hpj3atyvu3yd6rhlhyq:ubuntu

还可以指定您的自定义镜像的 OCID 和操作系统类型，例如

IBM

resources:
  image_id: <unique_image_id>

创建一个私有 VPC 镜像，并按以下格式粘贴其 ID

手动创建镜像：https://cloud.ibm.com/docs/vpc?topic=vpc-creating-and-using-an-image-from-volume。

使用官方 VPC 镜像创建工具：https://www.ibm.com/cloud/blog/use-ibm-packer-plugin-to-create-custom-images-on-ibm-cloud-vpc-infrastructure

resources:
  image_id: ami-0868a20f5a3bf9702  # AWS example
  # image_id: projects/deeplearning-platform-release/global/images/common-cpu-v20230615-debian-11-py310  # GCP example
  # image_id: docker:pytorch/pytorch:1.13.1-cuda11.6-cudnn8-runtime # Docker example

或

resources:
  image_id:
    us-east-1: ami-123
    us-west-2: ami-456

使用一个功能更有限但更易于管理的工具：IBM/vpc-img-inst

resources.labels#

应用于实例的标签（可选）。

如果指定，这些标签将应用于 SkyPilot 创建的 VM 或 Pod。

这些对于分配可供外部工具使用的元数据非常有用。

各云提供商的实现有所不同
AWS：标签映射到实例标签
GCP：标签映射到实例标签
Kubernetes：标签映射到 Pod 标签

其他：不支持标签，将忽略

示例

resources:
  labels:
    project: my-project
    department: research

注意：标签仅在首次启动集群时应用。后续启动不会更新。

resources.any_of#

候选资源（可选）。

如果指定，SkyPilot 将仅使用这些候选资源来启动集群。

在 any_of 外部指定的字段将用作所有候选资源的默认值，并且在 any_of 内部指定的任何重复字段将覆盖默认值。

示例

resources:
  any_of:
    - cloud: aws
      region: us-west-2
      accelerators: H100
    - cloud: gcp
      accelerators: H100

`any_of` 意味着 SkyPilot 将尝试找到匹配任一候选资源的资源，即故障转移顺序将由优化器决定。

resources.ordered#

有序候选资源（可选）。

如果指定，SkyPilot 将按照指定的顺序进行候选资源的故障转移。

在 ordered 外部指定的字段将用作所有候选资源的默认值，并且在 ordered 内部指定的任何重复字段将覆盖默认值。

示例

resources:
  ordered:
    - cloud: aws
      region: us-east-1
    - cloud: aws
      region: us-west-2

`ordered` 意味着 SkyPilot 将按照指定的顺序进行候选资源的故障转移。

resources.job_recovery#

托管作业的恢复策略（可选）。

对托管作业生效。可能的值为 FAILOVER 和 EAGER_NEXT_REGION。

如果指定 FAILOVER，如果节点失败，作业将在同一区域重新启动，如果在同一区域找不到可用资源，则会转到下一个区域。

如果指定 EAGER_NEXT_REGION，如果节点失败，作业将直接转到下一个区域。这对于 Spot 实例很有用，因为在实践中，区域内的抢占通常表明该区域资源短缺。

示例

resources:
  job_recovery:
    strategy: FAILOVER

或

resources:
  job_recovery:
    strategy: EAGER_NEXT_REGION
    max_restarts_on_errors: 3

默认值：`EAGER_NEXT_REGION`

envs#

环境变量（可选）。

可以在下面的 file_mounts、setup 和 run 部分访问这些值。

此处设置的值可以通过 CLI 标志覆盖：sky launch/exec --env ENV=val（如果 ENV 存在）。

对于 RunPod 中的定制非 root Docker 镜像，您需要设置 SKYPILOT_RUNPOD_DOCKER_USERNAME 来指定 Docker 镜像的登录用户名。更多信息请参阅使用容器作为运行时环境。

envs:
  SKYPILOT_DOCKER_USERNAME: <username>
  SKYPILOT_DOCKER_PASSWORD: <password>
  SKYPILOT_DOCKER_SERVER: <registry server>

如果您想使用私有仓库中的 Docker 镜像作为运行时环境，可以将用户名、密码和仓库服务器指定为任务环境变量。例如

SkyPilot 在拉取 Docker 镜像之前将执行 docker login --username <username> --password <password> <registry server>。关于 docker login，请参阅 https://docs.dockerd.com.cn/engine/reference/commandline/login/

sky launch --env SKYPILOT_DOCKER_PASSWORD=$(aws ecr get-login-password --region us-east-1).

如果您不想将它们存储在 yaml 文件中，或者想为不断变化的密码生成它们，您也可以通过 CLI 标志指定其中任何一个。例如

有关 SkyPilot 中 Docker 支持的更多信息，请参阅上面的 image_id 部分。

envs:
  MY_BUCKET: skypilot-data
  MODEL_SIZE: 13b
  MY_LOCAL_PATH: tmp-workdir

使用 envs 的示例

file_mounts#

示例

file_mounts:
  # Uses rsync to sync local files/directories to all nodes of the cluster.
  #
  # If a relative path is used, it's evaluated relative to the location from
  # which `sky` is called.
  #
  # If symlinks are present, they are copied as symlinks, and their targets
  # must also be synced using file_mounts to ensure correctness.
  /remote/dir1/file: /local/dir1/file
  /remote/dir2: /local/dir2

  # Create a S3 bucket named sky-dataset, uploads the contents of
  # /local/path/datasets to the bucket, and marks the bucket as persistent
  # (it will not be deleted after the completion of this task).
  # Symlinks and their contents are NOT copied.
  #
  # Mounts the bucket at /datasets-storage on every node of the cluster.
  /datasets-storage:
    name: sky-dataset  # Name of storage, optional when source is bucket URI
    source: /local/path/datasets  # Source path, can be local or bucket URI. Optional, do not specify to create an empty bucket.
    store: s3  # Could be either 's3', 'gcs', 'azure', 'r2', 'oci', or 'ibm'; default: None. Optional.
    persistent: True  # Defaults to True; can be set to false to delete bucket after cluster is downed. Optional.
    mode: MOUNT  # MOUNT or COPY or MOUNT_CACHED. Defaults to MOUNT. Optional.

  # Copies a cloud object store URI to the cluster. Can be private buckets.
  /datasets-s3: s3://my-awesome-dataset

  # Demoing env var usage.
  /checkpoint/${MODEL_SIZE}: ~/${MY_LOCAL_PATH}
  /mydir:
    name: ${MY_BUCKET}  # Name of the bucket.
    mode: MOUNT

或

file_mounts:
  /remote/data: ./local_data  # Local to remote
  /remote/output: s3://my-bucket/outputs  # Cloud storage
  /remote/models:
    name: my-models-bucket
    source: ~/local_models
    store: gcs
    mode: MOUNT

文件挂载配置。

setup#

安装脚本（可选），每次 sky launch 时执行。

示例

这在 run 命令之前执行。

setup: pip install -r requirements.txt

指定单个命令

setup: |
  echo "Begin setup."
  pip install -r requirements.txt
  echo "Setup complete."

或

setup: |
  conda create -n myenv python=3.9 -y
  conda activate myenv
  pip install torch torchvision

`|` 分隔符表示多行字符串。

run#

示例

run: |
  echo "Beginning task."
  python train.py

  # Demoing env var usage.
  echo Env var MODEL_SIZE has value: ${MODEL_SIZE}

或

run: |
  conda activate myenv
  python my_script.py --data-dir /remote/data --output-dir /remote/output

要在集群每个节点上运行的主程序（可选，但推荐）。

config#

示例

config:
  docker:
    run_options: ...
  kubernetes:
    pod_config: ...
    provision_timeout: ...
  gcp:
    managed_instance_group: ...
  nvidia_gpus:
    disable_ecc: ...

要应用于任务的高级配置选项。

SkyServe 服务#

要定义用于服务的 YAML，请使用前面提到的字段描述每个副本，然后添加一个 service 部分描述整个服务。

service:
  readiness_probe:
    path: /v1/models
    post_data: {'model_name': 'model'}
    initial_delay_seconds: 1200
    timeout_seconds: 15

  readiness_probe: /v1/models

  replica_policy:
    min_replicas: 1
    max_replicas: 3
    target_qps_per_replica: 5
    upscale_delay_seconds: 300
    downscale_delay_seconds: 1200

  replicas: 2

resources:
  ports: 8080

语法#

字段#

service.readiness_probe#

就绪探针配置（必需）。

SkyServe 用它来检查您的服务副本是否准备好接受流量。

如果就绪探针返回 200，SkyServe 将开始向该副本路由流量。

service:
  readiness_probe: /v1/models

或

service:
  readiness_probe:
    path: /v1/models
    post_data: '{"model_name": "my_model"}'
    initial_delay_seconds: 600
    timeout_seconds: 10

可以定义为路径字符串（用于默认 GET 请求）或详细的字典。

service.readiness_probe.path#

就绪检查的端点路径（必需）。

service:
  readiness_probe:
    path: /v1/models

探测路径。SkyServe 会在初始延迟后向此路径发送周期性请求。

service.readiness_probe.post_data#

POST 请求负载（可选）。

service:
  readiness_probe:
    path: /v1/models
    post_data: '{"model_name": "my_model"}'

如果指定此项，就绪探针将使用 POST 而不是 GET，并且 post data 将作为请求正文发送。

service.readiness_probe.initial_delay_seconds#

启动健康检查前的宽限期（默认：1200）。

初始延迟，单位秒。在此期间发生的任何就绪探针失败都将被忽略。

service:
  readiness_probe:
    initial_delay_seconds: 600

这与您的服务高度相关，因此建议根据您的服务的启动时间设置此值。

service.readiness_probe.timeout_seconds#

每个探针请求的最大等待时间（默认：15）。

就绪探针请求的超时时间，单位秒。

如果就绪探针响应时间超过此时间，则该探针将被视为失败。

当您的服务响应就绪探针请求较慢时，这很有用。

service:
  readiness_probe:
    timeout_seconds: 10

注意，超时设置过高将延迟检测您的服务副本的实际故障。

service.replica_policy#

服务副本的自动扩缩容配置（replica_policy 或 replicas 中必选一个）。

service:
  replica_policy:
    min_replicas: 1
    max_replicas: 5
    target_qps_per_replica: 10

描述 SkyServe 如何根据您的服务的 QPS（每秒查询数）自动扩缩容您的服务。

service.replica_policy.min_replicas#

最小活动副本数（必需）。

service:
  replica_policy:
    min_replicas: 1

服务永远不会缩减到此计数以下。

service.replica_policy.max_replicas#

最大允许副本数（可选）。

service:
  replica_policy:
    max_replicas: 3

如果未指定，SkyServe 将使用固定数量的副本（与 min_replicas 相同），并忽略下面指定的任何 QPS 阈值。

service.replica_policy.target_qps_per_replica#

每个副本的目标每秒查询数（可选）。

SkyServe 将扩缩容您的服务，以便最终每个副本大致处理 target_qps_per_replica 个每秒查询。

service:
  replica_policy:
    target_qps_per_replica: 5

仅当指定此值时才会启用自动扩缩容。

service.replica_policy.upscale_delay_seconds#

添加副本前的稳定期（默认：300）。

service:
  replica_policy:
    upscale_delay_seconds: 300

扩容延迟，单位秒。为避免激进的自动扩缩容，SkyServe 仅在您的服务的 QPS 在一段时间内高于目标 QPS 时才会扩容您的服务。

service.replica_policy.downscale_delay_seconds#

移除副本前的冷却期（默认：1200）。

service:
  replica_policy:
    downscale_delay_seconds: 1200

缩容延迟，单位秒。为避免激进的自动扩缩容，SkyServe 仅在您的服务的 QPS 在一段时间内低于目标 QPS 时才会缩容您的服务。

service.replicas#

固定副本数，作为自动扩缩容的替代方案。

service:
  replicas: 2

replica policy 的简化版本，使用固定数量的副本。

resources.ports#

服务流量所需的暴露端口。

resources:
  ports: 8080

SkyPilot YAML#

语法#

字段#

name#

workdir#

num_nodes#

resources#

resources.cloud#

resources.region#

resources.zone#

resources.accelerators#

resources.accelerator_args#

resources.cpus#

示例：4+ 表示首先尝试寻找具有 >= 4 个 vCPU 的实例类型。如果找不到，则使用下一个具有超过 4 个 vCPU 的最便宜实例。

示例：32+ 表示首先尝试寻找具有 >= 32 GiB 的实例类型。如果找不到，则使用下一个具有超过 32 GiB 的最便宜实例。

如果指定了 accelerators，则会自动推断出相应的实例类型。

如果未指定，默认为 false（按需实例）。