Kubernetes#
当在 Kubernetes 集群外部运行时,SkyPilot 使用您的本地 ~/.kube/config
文件进行身份验证,并在您的 Kubernetes 集群上创建资源。
当在 Kubernetes 集群内部运行时(例如,作为远程 API 服务器、作业控制器或服务控制器),SkyPilot 可以使用以下三种身份验证方法之一运行
自动创建 Service Account:SkyPilot 可以自动为其自身创建 Service Account 和角色,以管理 Kubernetes 集群中的资源。这是在集群内部运行时的默认方法,无需额外配置。
有关授予 Service Account 的权限的详细信息,请参阅下面的SkyPilot 所需的最低权限部分。
使用自定义 Service Account:如果您拥有具有所需权限的自定义 Service Account,您可以通过将其添加到您的 ~/.sky/config.yaml 文件中来配置 SkyPilot 使用它
kubernetes: remote_identity: your-service-account-name
使用本地 kubeconfig 文件:在这种情况下,SkyPilot 会将您的本地
~/.kube/config
文件复制到控制器 Pod 中并使用它进行身份验证。要使用此方法,请在您的 ~/.sky/config.yaml 文件中将 Kubernetes 配置的remote_identity: LOCAL_CREDENTIALS
设置为kubernetes: remote_identity: LOCAL_CREDENTIALS
注意
如果您的集群在
~/.kube/config
文件中使用基于 exec 的身份验证(例如,GKE 默认使用 exec 身份验证),则 SkyPilot 可能无法使用此方法进行身份验证。在这种情况下,请考虑使用下面的 Service Account 方法。
注意
基于 Service Account 的身份验证仅适用于远程 SkyPilot 集群(包括 spot 和 serve 控制器)在 Kubernetes 集群内部启动的情况。当在集群外部运行时(例如在 AWS 上),SkyPilot 将使用本地 ~/.kube/config
文件进行身份验证。
以下是 SkyPilot 所需的权限以及可用于创建具有所需权限的 Service Account 的示例 YAML。
SkyPilot 所需的最低权限#
SkyPilot 需要相当于以下角色的权限才能管理 Kubernetes 集群中的资源
# Namespaced role for the service account
# Required for creating pods, services and other necessary resources in the namespace.
# Note these permissions only apply in the namespace where SkyPilot is deployed, and the namespace can be changed below.
kind: Role
apiVersion: rbac.authorization.k8s.io/v1
metadata:
name: sky-sa-role # Can be changed if needed
namespace: default # Change to your namespace if using a different one.
rules:
# Required for managing pods and their lifecycle
- apiGroups: [ "" ]
resources: [ "pods", "pods/status", "pods/exec", "pods/portforward" ]
verbs: [ "*" ]
# Required for managing services for SkyPilot Pods
- apiGroups: [ "" ]
resources: [ "services" ]
verbs: [ "*" ]
# Required for managing SSH keys
- apiGroups: [ "" ]
resources: [ "secrets" ]
verbs: [ "*" ]
# Required for retrieving reason when Pod scheduling fails.
- apiGroups: [ "" ]
resources: [ "events" ]
verbs: [ "get", "list", "watch" ]
---
# ClusterRole for accessing cluster-wide resources. Details for each resource below:
kind: ClusterRole
apiVersion: rbac.authorization.k8s.io/v1
metadata:
name: sky-sa-cluster-role # Can be changed if needed
namespace: default # Change to your namespace if using a different one.
labels:
parent: skypilot
rules:
# Required for getting node resources.
- apiGroups: [""]
resources: ["nodes"]
verbs: ["get", "list", "watch"]
# Required for autodetecting the runtime class of the nodes.
- apiGroups: ["node.k8s.io"]
resources: ["runtimeclasses"]
verbs: ["get", "list", "watch"]
提示
如果您使用的是 default
以外的命名空间,请确保在上述清单中更改命名空间。
这些角色必须同时适用于 kubeconfig 文件中配置的用户账号和 SkyPilot 使用的 Service Account(如果已配置)。
如果您需要使用 sky show-gpus
查看实时 GPU 可用性、您的任务使用对象存储挂载或您的任务需要访问 ingress 资源,您将需要授予如下所述的额外权限。
sky show-gpus
的权限#
sky show-gpus
需要列出所有命名空间中的所有 Pod,以计算 GPU 可用性。为此,SkyPilot 需要在 ClusterRole
中获得 Pod 的 get
和 list
权限。
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
name: sky-sa-cluster-role-pod-reader
rules:
- apiGroups: [""]
resources: ["pods"]
verbs: ["get", "list"]
提示
如果未向 Service Account 授予此角色,sky show-gpus
仍将工作,但它只会显示节点上的总 GPU 数量,而不是空闲 GPU 数量。
对象存储挂载的权限#
如果您的任务使用对象存储挂载(例如 S3、GCS 等),SkyPilot 将需要在 Kubernetes 集群中运行 DaemonSet,以将 FUSE 设备作为 Kubernetes 资源暴露给 SkyPilot Pod。
为了允许这样做,您还需要创建一个 skypilot-system
命名空间,该命名空间将运行 DaemonSet,并向该命名空间中的 Service Account 授予必要的权限。
# Required only if using object store mounting
# Create namespace for SkyPilot system
apiVersion: v1
kind: Namespace
metadata:
name: skypilot-system # Do not change this
labels:
parent: skypilot
---
# Role for the skypilot-system namespace to create fusermount-server and
# any other system components required by SkyPilot.
# This role must be bound in the skypilot-system namespace to the service account used for SkyPilot.
kind: Role
apiVersion: rbac.authorization.k8s.io/v1
metadata:
name: skypilot-system-service-account-role # Can be changed if needed
namespace: skypilot-system # Do not change this namespace
labels:
parent: skypilot
rules:
- apiGroups: [ "*" ]
resources: [ "apps" ]
verbs: [ "daemonsets" ]
使用 Ingress 的权限#
如果您的任务使用 Ingress 暴露端口,您将需要向 ingress-nginx
命名空间中的 Service Account 授予必要的权限。
# Required only if using ingresses
# Role for accessing ingress service IP
apiVersion: rbac.authorization.k8s.io/v1
kind: Role
metadata:
namespace: ingress-nginx # Do not change this
name: sky-sa-role-ingress-nginx # Can be changed if needed
rules:
- apiGroups: [""]
resources: ["services"]
verbs: ["list", "get"]
使用自定义 Service Account 的示例#
要创建一个拥有 SkyPilot 所有必要权限(包括访问对象存储的权限)的 Service Account,您可以使用以下 YAML。
提示
在此示例中,Service Account 的名称为 sky-sa
,并在 default
命名空间中创建。请根据需要更改命名空间和 Service Account 名称。
1 # create-sky-sa.yaml
2 kind: ServiceAccount
3 apiVersion: v1
4 metadata:
5 name: sky-sa # Change to your service account name
6 namespace: default # Change to your namespace if using a different one.
7 labels:
8 parent: skypilot
9 ---
10 # Role for the service account
11 kind: Role
12 apiVersion: rbac.authorization.k8s.io/v1
13 metadata:
14 name: sky-sa-role # Can be changed if needed
15 namespace: default # Change to your namespace if using a different one.
16 labels:
17 parent: skypilot
18 rules:
19 # Required for managing pods and their lifecycle
20 - apiGroups: [ "" ]
21 resources: [ "pods", "pods/status", "pods/exec", "pods/portforward" ]
22 verbs: [ "*" ]
23 # Required for managing services for SkyPilot Pods
24 - apiGroups: [ "" ]
25 resources: [ "services" ]
26 verbs: [ "*" ]
27 # Required for managing SSH keys
28 - apiGroups: [ "" ]
29 resources: [ "secrets" ]
30 verbs: [ "*" ]
31 # Required for retrieving reason when Pod scheduling fails.
32 - apiGroups: [ "" ]
33 resources: [ "events" ]
34 verbs: [ "get", "list", "watch" ]
35 ---
36 # RoleBinding for the service account
37 kind: RoleBinding
38 apiVersion: rbac.authorization.k8s.io/v1
39 metadata:
40 name: sky-sa-rb # Can be changed if needed
41 namespace: default # Change to your namespace if using a different one.
42 labels:
43 parent: skypilot
44 subjects:
45 - kind: ServiceAccount
46 name: sky-sa # Change to your service account name
47 roleRef:
48 kind: Role
49 name: sky-sa-role # Use the same name as the role at line 14
50 apiGroup: rbac.authorization.k8s.io
51 ---
52 # ClusterRole for the service account
53 kind: ClusterRole
54 apiVersion: rbac.authorization.k8s.io/v1
55 metadata:
56 name: sky-sa-cluster-role # Can be changed if needed
57 namespace: default # Change to your namespace if using a different one.
58 labels:
59 parent: skypilot
60 rules:
61 - apiGroups: [""]
62 resources: ["nodes"] # Required for getting node resources.
63 verbs: ["get", "list", "watch"]
64 - apiGroups: ["node.k8s.io"]
65 resources: ["runtimeclasses"] # Required for autodetecting the runtime class of the nodes.
66 verbs: ["get", "list", "watch"]
67 - apiGroups: ["networking.k8s.io"] # Required for exposing services through ingresses
68 resources: ["ingressclasses"]
69 verbs: ["get", "list", "watch"]
70 - apiGroups: [""] # Required for `sky show-gpus` command
71 resources: ["pods"]
72 verbs: ["get", "list"]
73 ---
74 # ClusterRoleBinding for the service account
75 apiVersion: rbac.authorization.k8s.io/v1
76 kind: ClusterRoleBinding
77 metadata:
78 name: sky-sa-cluster-role-binding # Can be changed if needed
79 namespace: default # Change to your namespace if using a different one.
80 labels:
81 parent: skypilot
82 subjects:
83 - kind: ServiceAccount
84 name: sky-sa # Change to your service account name
85 namespace: default # Change to your namespace if using a different one.
86 roleRef:
87 kind: ClusterRole
88 name: sky-sa-cluster-role # Use the same name as the cluster role at line 43
89 apiGroup: rbac.authorization.k8s.io
90 ---
91 # Optional: If using object store mounting, create the skypilot-system namespace
92 apiVersion: v1
93 kind: Namespace
94 metadata:
95 name: skypilot-system # Do not change this
96 labels:
97 parent: skypilot
98 ---
99 # Optional: If using object store mounting, create role in the skypilot-system
100 # namespace to create fusermount-server.
101 kind: Role
102 apiVersion: rbac.authorization.k8s.io/v1
103 metadata:
104 name: skypilot-system-service-account-role # Can be changed if needed
105 namespace: skypilot-system # Do not change this namespace
106 labels:
107 parent: skypilot
108 rules:
109 - apiGroups: [ "apps" ]
110 resources: [ "daemonsets" ]
111 verbs: [ "*" ]
112 ---
113 # Optional: If using object store mounting, create rolebinding in the skypilot-system
114 # namespace to create fusermount-server.
115 apiVersion: rbac.authorization.k8s.io/v1
116 kind: RoleBinding
117 metadata:
118 name: sky-sa-skypilot-system-role-binding
119 namespace: skypilot-system # Do not change this namespace
120 labels:
121 parent: skypilot
122 subjects:
123 - kind: ServiceAccount
124 name: sky-sa # Change to your service account name
125 namespace: default # Change this to the namespace where the service account is created
126 roleRef:
127 kind: Role
128 name: skypilot-system-service-account-role # Use the same name as the role above
129 apiGroup: rbac.authorization.k8s.io
130 ---
131 # Optional: Role for accessing ingress resources
132 apiVersion: rbac.authorization.k8s.io/v1
133 kind: Role
134 metadata:
135 name: sky-sa-role-ingress-nginx # Can be changed if needed
136 namespace: ingress-nginx # Do not change this namespace
137 labels:
138 parent: skypilot
139 rules:
140 - apiGroups: [""]
141 resources: ["services"]
142 verbs: ["list", "get", "watch"]
143 ---
144 # Optional: RoleBinding for accessing ingress resources
145 apiVersion: rbac.authorization.k8s.io/v1
146 kind: RoleBinding
147 metadata:
148 name: sky-sa-rolebinding-ingress-nginx # Can be changed if needed
149 namespace: ingress-nginx # Do not change this namespace
150 labels:
151 parent: skypilot
152 subjects:
153 - kind: ServiceAccount
154 name: sky-sa # Change to your service account name
155 namespace: default # Change this to the namespace where the service account is created
156 roleRef:
157 kind: Role
158 name: sky-sa-role-ingress-nginx # Use the same name as the role above
159 apiGroup: rbac.authorization.k8s.io
使用以下命令创建 Service Account
$ kubectl apply -f create-sky-sa.yaml
创建 Service Account 后,集群管理员可以向需要访问集群的用户分发包含 sky-sa
Service Account 的 kubeconfig 文件。
用户还应通过 ~/.sky/config.yaml
配置 SkyPilot 使用 sky-sa
Service Account
# ~/.sky/config.yaml
kubernetes:
remote_identity: sky-sa # Or your service account name