Kubernetes#

当在 Kubernetes 集群外部运行时,SkyPilot 使用您的本地 ~/.kube/config 文件进行身份验证,并在您的 Kubernetes 集群上创建资源。

当在 Kubernetes 集群内部运行时(例如,作为远程 API 服务器、作业控制器或服务控制器),SkyPilot 可以使用以下三种身份验证方法之一运行

  1. 自动创建 Service Account:SkyPilot 可以自动为其自身创建 Service Account 和角色,以管理 Kubernetes 集群中的资源。这是在集群内部运行时的默认方法,无需额外配置。

    有关授予 Service Account 的权限的详细信息,请参阅下面的SkyPilot 所需的最低权限部分。

  2. 使用自定义 Service Account:如果您拥有具有所需权限的自定义 Service Account,您可以通过将其添加到您的 ~/.sky/config.yaml 文件中来配置 SkyPilot 使用它

    kubernetes:
      remote_identity: your-service-account-name
    
  3. 使用本地 kubeconfig 文件:在这种情况下,SkyPilot 会将您的本地 ~/.kube/config 文件复制到控制器 Pod 中并使用它进行身份验证。要使用此方法,请在您的 ~/.sky/config.yaml 文件中将 Kubernetes 配置的 remote_identity: LOCAL_CREDENTIALS 设置为

    kubernetes:
      remote_identity: LOCAL_CREDENTIALS
    

    注意

    如果您的集群在 ~/.kube/config 文件中使用基于 exec 的身份验证(例如,GKE 默认使用 exec 身份验证),则 SkyPilot 可能无法使用此方法进行身份验证。在这种情况下,请考虑使用下面的 Service Account 方法。

注意

基于 Service Account 的身份验证仅适用于远程 SkyPilot 集群(包括 spot 和 serve 控制器)在 Kubernetes 集群内部启动的情况。当在集群外部运行时(例如在 AWS 上),SkyPilot 将使用本地 ~/.kube/config 文件进行身份验证。

以下是 SkyPilot 所需的权限以及可用于创建具有所需权限的 Service Account 的示例 YAML。

SkyPilot 所需的最低权限#

SkyPilot 需要相当于以下角色的权限才能管理 Kubernetes 集群中的资源

# Namespaced role for the service account
# Required for creating pods, services and other necessary resources in the namespace.
# Note these permissions only apply in the namespace where SkyPilot is deployed, and the namespace can be changed below.
kind: Role
apiVersion: rbac.authorization.k8s.io/v1
metadata:
  name: sky-sa-role  # Can be changed if needed
  namespace: default  # Change to your namespace if using a different one.
rules:
  # Required for managing pods and their lifecycle
  - apiGroups: [ "" ]
    resources: [ "pods", "pods/status", "pods/exec", "pods/portforward" ]
    verbs: [ "*" ]
  # Required for managing services for SkyPilot Pods
  - apiGroups: [ "" ]
    resources: [ "services" ]
    verbs: [ "*" ]
  # Required for managing SSH keys
  - apiGroups: [ "" ]
    resources: [ "secrets" ]
    verbs: [ "*" ]
  # Required for retrieving reason when Pod scheduling fails.
  - apiGroups: [ "" ]
    resources: [ "events" ]
    verbs: [ "get", "list", "watch" ]
---
# ClusterRole for accessing cluster-wide resources. Details for each resource below:
kind: ClusterRole
apiVersion: rbac.authorization.k8s.io/v1
metadata:
  name: sky-sa-cluster-role  # Can be changed if needed
  namespace: default  # Change to your namespace if using a different one.
  labels:
    parent: skypilot
rules:
  # Required for getting node resources.
  - apiGroups: [""]
    resources: ["nodes"]
    verbs: ["get", "list", "watch"]
  # Required for autodetecting the runtime class of the nodes.
  - apiGroups: ["node.k8s.io"]
    resources: ["runtimeclasses"]
    verbs: ["get", "list", "watch"]

提示

如果您使用的是 default 以外的命名空间,请确保在上述清单中更改命名空间。

这些角色必须同时适用于 kubeconfig 文件中配置的用户账号和 SkyPilot 使用的 Service Account(如果已配置)。

如果您需要使用 sky show-gpus 查看实时 GPU 可用性、您的任务使用对象存储挂载或您的任务需要访问 ingress 资源,您将需要授予如下所述的额外权限。

sky show-gpus 的权限#

sky show-gpus 需要列出所有命名空间中的所有 Pod,以计算 GPU 可用性。为此,SkyPilot 需要在 ClusterRole 中获得 Pod 的 getlist 权限。

apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
    name: sky-sa-cluster-role-pod-reader
rules:
  - apiGroups: [""]
    resources: ["pods"]
    verbs: ["get", "list"]

提示

如果未向 Service Account 授予此角色,sky show-gpus 仍将工作,但它只会显示节点上的总 GPU 数量,而不是空闲 GPU 数量。

对象存储挂载的权限#

如果您的任务使用对象存储挂载(例如 S3、GCS 等),SkyPilot 将需要在 Kubernetes 集群中运行 DaemonSet,以将 FUSE 设备作为 Kubernetes 资源暴露给 SkyPilot Pod。

为了允许这样做,您还需要创建一个 skypilot-system 命名空间,该命名空间将运行 DaemonSet,并向该命名空间中的 Service Account 授予必要的权限。

# Required only if using object store mounting
# Create namespace for SkyPilot system
apiVersion: v1
kind: Namespace
metadata:
  name: skypilot-system  # Do not change this
  labels:
    parent: skypilot
---
# Role for the skypilot-system namespace to create fusermount-server and
# any other system components required by SkyPilot.
# This role must be bound in the skypilot-system namespace to the service account used for SkyPilot.
kind: Role
apiVersion: rbac.authorization.k8s.io/v1
metadata:
  name: skypilot-system-service-account-role  # Can be changed if needed
  namespace: skypilot-system  # Do not change this namespace
  labels:
    parent: skypilot
rules:
  - apiGroups: [ "*" ]
    resources: [ "apps" ]
    verbs: [ "daemonsets" ]

使用 Ingress 的权限#

如果您的任务使用 Ingress 暴露端口,您将需要向 ingress-nginx 命名空间中的 Service Account 授予必要的权限。

# Required only if using ingresses
# Role for accessing ingress service IP
apiVersion: rbac.authorization.k8s.io/v1
kind: Role
metadata:
  namespace: ingress-nginx  # Do not change this
  name: sky-sa-role-ingress-nginx  # Can be changed if needed
rules:
  - apiGroups: [""]
    resources: ["services"]
    verbs: ["list", "get"]

使用自定义 Service Account 的示例#

要创建一个拥有 SkyPilot 所有必要权限(包括访问对象存储的权限)的 Service Account,您可以使用以下 YAML。

提示

在此示例中,Service Account 的名称为 sky-sa,并在 default 命名空间中创建。请根据需要更改命名空间和 Service Account 名称。

  1 # create-sky-sa.yaml
  2 kind: ServiceAccount
  3 apiVersion: v1
  4 metadata:
  5   name: sky-sa  # Change to your service account name
  6   namespace: default  # Change to your namespace if using a different one.
  7   labels:
  8     parent: skypilot
  9 ---
 10 # Role for the service account
 11 kind: Role
 12 apiVersion: rbac.authorization.k8s.io/v1
 13 metadata:
 14   name: sky-sa-role  # Can be changed if needed
 15   namespace: default  # Change to your namespace if using a different one.
 16   labels:
 17     parent: skypilot
 18 rules:
 19   # Required for managing pods and their lifecycle
 20   - apiGroups: [ "" ]
 21     resources: [ "pods", "pods/status", "pods/exec", "pods/portforward" ]
 22     verbs: [ "*" ]
 23   # Required for managing services for SkyPilot Pods
 24   - apiGroups: [ "" ]
 25     resources: [ "services" ]
 26     verbs: [ "*" ]
 27   # Required for managing SSH keys
 28   - apiGroups: [ "" ]
 29     resources: [ "secrets" ]
 30     verbs: [ "*" ]
 31   # Required for retrieving reason when Pod scheduling fails.
 32   - apiGroups: [ "" ]
 33     resources: [ "events" ]
 34     verbs: [ "get", "list", "watch" ]
 35 ---
 36 # RoleBinding for the service account
 37 kind: RoleBinding
 38 apiVersion: rbac.authorization.k8s.io/v1
 39 metadata:
 40   name: sky-sa-rb  # Can be changed if needed
 41   namespace: default  # Change to your namespace if using a different one.
 42   labels:
 43     parent: skypilot
 44 subjects:
 45   - kind: ServiceAccount
 46     name: sky-sa  # Change to your service account name
 47 roleRef:
 48   kind: Role
 49   name: sky-sa-role  # Use the same name as the role at line 14
 50   apiGroup: rbac.authorization.k8s.io
 51 ---
 52 # ClusterRole for the service account
 53 kind: ClusterRole
 54 apiVersion: rbac.authorization.k8s.io/v1
 55 metadata:
 56   name: sky-sa-cluster-role  # Can be changed if needed
 57   namespace: default  # Change to your namespace if using a different one.
 58   labels:
 59     parent: skypilot
 60 rules:
 61   - apiGroups: [""]
 62     resources: ["nodes"]  # Required for getting node resources.
 63     verbs: ["get", "list", "watch"]
 64   - apiGroups: ["node.k8s.io"]
 65     resources: ["runtimeclasses"]   # Required for autodetecting the runtime class of the nodes.
 66     verbs: ["get", "list", "watch"]
 67   - apiGroups: ["networking.k8s.io"]   # Required for exposing services through ingresses
 68     resources: ["ingressclasses"]
 69     verbs: ["get", "list", "watch"]
 70   - apiGroups: [""]                 # Required for `sky show-gpus` command
 71     resources: ["pods"]
 72     verbs: ["get", "list"]
 73 ---
 74 # ClusterRoleBinding for the service account
 75 apiVersion: rbac.authorization.k8s.io/v1
 76 kind: ClusterRoleBinding
 77 metadata:
 78   name: sky-sa-cluster-role-binding  # Can be changed if needed
 79   namespace: default  # Change to your namespace if using a different one.
 80   labels:
 81     parent: skypilot
 82 subjects:
 83   - kind: ServiceAccount
 84     name: sky-sa  # Change to your service account name
 85     namespace: default  # Change to your namespace if using a different one.
 86 roleRef:
 87   kind: ClusterRole
 88   name: sky-sa-cluster-role  # Use the same name as the cluster role at line 43
 89   apiGroup: rbac.authorization.k8s.io
 90 ---
 91 # Optional: If using object store mounting, create the skypilot-system namespace
 92 apiVersion: v1
 93 kind: Namespace
 94 metadata:
 95   name: skypilot-system  # Do not change this
 96   labels:
 97     parent: skypilot
 98 ---
 99 # Optional: If using object store mounting, create role in the skypilot-system
100 # namespace to create fusermount-server.
101 kind: Role
102 apiVersion: rbac.authorization.k8s.io/v1
103 metadata:
104   name: skypilot-system-service-account-role  # Can be changed if needed
105   namespace: skypilot-system  # Do not change this namespace
106   labels:
107     parent: skypilot
108 rules:
109   - apiGroups: [ "apps" ]
110     resources: [ "daemonsets" ]
111     verbs: [ "*" ]
112 ---
113 # Optional: If using object store mounting, create rolebinding in the skypilot-system
114 # namespace to create fusermount-server.
115 apiVersion: rbac.authorization.k8s.io/v1
116 kind: RoleBinding
117 metadata:
118   name: sky-sa-skypilot-system-role-binding
119   namespace: skypilot-system  # Do not change this namespace
120   labels:
121     parent: skypilot
122 subjects:
123   - kind: ServiceAccount
124     name: sky-sa  # Change to your service account name
125     namespace: default  # Change this to the namespace where the service account is created
126 roleRef:
127   kind: Role
128   name: skypilot-system-service-account-role  # Use the same name as the role above
129   apiGroup: rbac.authorization.k8s.io
130 ---
131 # Optional: Role for accessing ingress resources
132 apiVersion: rbac.authorization.k8s.io/v1
133 kind: Role
134 metadata:
135   name: sky-sa-role-ingress-nginx  # Can be changed if needed
136   namespace: ingress-nginx  # Do not change this namespace
137   labels:
138     parent: skypilot
139 rules:
140   - apiGroups: [""]
141     resources: ["services"]
142     verbs: ["list", "get", "watch"]
143 ---
144 # Optional: RoleBinding for accessing ingress resources
145 apiVersion: rbac.authorization.k8s.io/v1
146 kind: RoleBinding
147 metadata:
148   name: sky-sa-rolebinding-ingress-nginx  # Can be changed if needed
149   namespace: ingress-nginx  # Do not change this namespace
150   labels:
151     parent: skypilot
152 subjects:
153   - kind: ServiceAccount
154     name: sky-sa  # Change to your service account name
155     namespace: default  # Change this to the namespace where the service account is created
156 roleRef:
157   kind: Role
158   name: sky-sa-role-ingress-nginx  # Use the same name as the role above
159   apiGroup: rbac.authorization.k8s.io

使用以下命令创建 Service Account

$ kubectl apply -f create-sky-sa.yaml

创建 Service Account 后,集群管理员可以向需要访问集群的用户分发包含 sky-sa Service Account 的 kubeconfig 文件。

用户还应通过 ~/.sky/config.yaml 配置 SkyPilot 使用 sky-sa Service Account

# ~/.sky/config.yaml
kubernetes:
  remote_identity: sky-sa   # Or your service account name