Dask 分布式集群部署 (物理机和 k8s) 及实战
一、部署集群
1、启动主服务
dask-scheduler
启动日志:
PS C:\Users\40754> dask-scheduler
2024-05-28 16:11:28,123 - distributed.scheduler - INFO - -----------------------------------------------
2024-05-28 16:11:28,843 - distributed.http.proxy - INFO - To route to workers diagnostics web server please install jupyter-server-proxy: python -m pip install jupyter-server-proxy
2024-05-28 16:11:28,888 - distributed.scheduler - INFO - State start
2024-05-28 16:11:28,906 - distributed.scheduler - INFO - -----------------------------------------------
2024-05-28 16:11:28,906 - distributed.scheduler - INFO - Clear task state
2024-05-28 16:11:28,907 - distributed.scheduler - INFO - Scheduler at: tcp://192.168.10.62:8786
2024-05-28 16:11:28,907 - distributed.scheduler - INFO - dashboard at: :8787
启动work节点
在同一台电脑启动scheler 主服务及 worker节点服务,打开新的终端,启动服务:
dask-worker 192.168.10.62:8786
日志打印:
PS C:\Users\40754> dask-worker 192.168.10.62:8786
2024-05-28 16:18:38,703 - distributed.nanny - INFO - Start Nanny at: 'tcp://192.168.10.62:59899'
2024-05-28 16:18:40,380 - distributed.diskutils - INFO - Found stale lock file and directory 'C:\\Users\\40754\\AppData\\Local\\Temp\\dask-worker-space\\worker-5ol4ypq3', purging
...
'C:\\Users\\40754\\AppData\\Local\\Temp\\dask-worker-space\\worker-st_zgbrw', purging
2024-05-28 16:18:41,806 - distributed.worker - INFO - Start worker at: tcp://192.168.10.62:59906
2024-05-28 16:18:41,807 - distributed.worker - INFO - Listening to: tcp://192.168.10.62:59906
2024-05-28 16:18:41,807 - distributed.worker - INFO - dashboard at: 192.168.10.62:59907
2024-05-28 16:18:41,807 - distributed.worker - INFO - Waiting to connect to: tcp://192.168.10.62:8786
2024-05-28 16:18:41,807 - distributed.worker - INFO - -------------------------------------------------
2024-05-28 16:18:41,807 - distributed.worker - INFO - Threads: 16
2024-05-28 16:18:41,808 - distributed.worker - INFO - Memory: 15.73 GiB
2024-05-28 16:18:41,808 - distributed.worker - INFO - Local Directory: C:\Users\40754\AppData\Local\Temp\dask-worker-space\worker-sc0gqvhr
2024-05-28 16:18:41,808 - distributed.worker - INFO - -------------------------------------------------
2024-05-28 16:18:41,826 - distributed.worker - INFO - Registered to: tcp://192.168.10.62:8786
2024-05-28 16:18:41,826 - distributed.worker - INFO - -------------------------------------------------
2024-05-28 16:18:41,829 - distributed.core - INFO - Starting established connection
可以看到 worker 服务已经注册到主服务了。
查看主服务的终端日志打印:
PS C:\Users\40754> dask-scheduler
2024-05-28 16:11:28,123 - distributed.scheduler - INFO - -----------------------------------------------
2024-05-28 16:11:28,843 - distributed.http.proxy - INFO - To route to workers diagnostics web server please install jupyter-server-proxy: python -m pip install jupyter-server-proxy
2024-05-28 16:11:28,888 - distributed.scheduler - INFO - State start
2024-05-28 16:11:28,906 - distributed.scheduler - INFO - -----------------------------------------------
2024-05-28 16:11:28,906 - distributed.scheduler - INFO - Clear task state
2024-05-28 16:11:28,907 - distributed.scheduler - INFO - Scheduler at: tcp://192.168.10.62:8786
2024-05-28 16:11:28,907 - distributed.scheduler - INFO - dashboard at: :8787
2024-05-28 16:18:41,822 - distributed.scheduler - INFO - Register worker <WorkerState 'tcp://192.168.10.62:59906', status: init, memory: 0, processing: 0>
2024-05-28 16:18:41,826 - distributed.scheduler - INFO - Starting worker compute stream, tcp://192.168.10.62:59906
2024-05-28 16:18:41,826 - distributed.core - INFO - Starting established connection
二、功能测试
1、测试Dask
from dask.distributed import Client
client = Client('192.168.10.62:8786')
def square(x):
return x ** 2
def neg(x):
return -x
A = client.map(square, range(10))
B = client.map(neg, A)
total = client.submit(sum, B)
total.result()
2、测试Pycaret集群计算
import pandas as pd
df=pd.read_csv('train.csv')
# init setup
from pycaret.classification import *
clf1 = setup(data = df, target = 'Survived',n_jobs = -1)
# import parallel back-end
from pycaret.parallel import FugueBackend
compare_models(n_select=3, parallel=FugueBackend("dask"),verbose=True)
k8s部署
创建k8s集群 11.0.1.10
# 创建集群节点
./kk create cluster -f config-sample-single-master.yaml
# 清除集群节点
./kk delete cluster -f config-sample-single-master.yaml
# 升级k8s版本
# export KKZONE=cn
# ./kk upgrade --with-kubernetes v1.20.15 -f config-sample-single-master.yaml
安装 kubesphere v3.4.1
的版本:
# 安装:
curl -sfL https://get-kk.kubesphere.io | VERSION=v3.0.13 sh -
chmod +x kk
# 创建配置文件,然后配置文件
./kk create config --with-kubesphere v3.4.1
# 部署集群
export KKZONE=cn
./kk create cluster --with-kubernetes v1.23.10 --with-kubesphere v3.4.1
[root@k8s-master01 k8s-install]# kubectl get nodes
NAME STATUS ROLES AGE VERSION
k8s-master01 Ready control-plane,master 6h38m v1.23.10
k8s-node01 Ready worker 6h38m v1.23.10
k8s-node02 Ready worker 6h38m v1.23.10
[root@k8s-master01 k8s-install]#
查看节点信息:
[root@k8s-master01 ~]# kubectl get nodes
NAME STATUS ROLES AGE VERSION
k8s-master01 Ready master 23h v1.19.9
k8s-node01 Ready worker 23h v1.19.9
k8s-node02 Ready worker 23h v1.19.9
[root@k8s-master01 ~]#
查看配置信息:
cd /k8s/softwares
测试环境的k8s版本:
[wangchao@bxyh-master-node1 ~]$ kubectl get nodes
NAME STATUS ROLES AGE VERSION
bxyh-master-node1 Ready,SchedulingDisabled control-plane,etcd,lb,master 71d v1.23.17
bxyh-master-node2 Ready,SchedulingDisabled control-plane,etcd,lb,master 71d v1.23.17
bxyh-worker-node1 Ready worker 71d v1.23.17
bxyh-worker-node2 Ready worker 71d v1.23.17
bxyh-worker-node3 Ready worker 71d v1.23.17
bxyh-worker-node4 Ready application,worker 69d v1.23.17
bxyh-worker-node5 Ready feature,worker 67d v1.23.17
配置文件:
[root@k8s-master01 k8s-install]# cat config-sample.yaml
apiVersion: kubekey.kubesphere.io/v1alpha2
kind: Cluster
metadata:
name: sample
spec:
hosts:
- {name: k8s-master01, address: 11.0.1.10, internalAddress: 11.0.1.10, user: root, password: "123456"}
- {name: k8s-node01, address: 11.0.1.20, internalAddress: 11.0.1.20, user: root, password: "123456"}
- {name: k8s-node02, address: 11.0.1.21, internalAddress: 11.0.1.21, user: root, password: "123456"}
roleGroups:
etcd:
- k8s-master01
control-plane:
- k8s-master01
worker:
- k8s-node01
- k8s-node02
controlPlaneEndpoint:
## Internal loadbalancer for apiservers
# internalLoadbalancer: haproxy
domain: lb.kubesphere.local
address: ""
port: 6443
kubernetes:
version: v1.23.10
clusterName: cluster.local
autoRenewCerts: true
containerManager: docker
etcd:
type: kubekey
network:
plugin: calico
kubePodsCIDR: 10.233.64.0/18
kubeServiceCIDR: 10.233.0.0/18
## multus support. https://github.com/k8snetworkplumbingwg/multus-cni
multusCNI:
enabled: false
registry:
privateRegistry: ""
namespaceOverride: ""
registryMirrors: []
insecureRegistries: []
addons: []
---
apiVersion: installer.kubesphere.io/v1alpha1
kind: ClusterConfiguration
metadata:
name: ks-installer
namespace: kubesphere-system
labels:
version: v3.4.1
spec:
persistence:
storageClass: ""
authentication:
jwtSecret: ""
local_registry: ""
# dev_tag: ""
etcd:
monitoring: false
endpointIps: localhost
port: 2379
tlsEnable: true
common:
core:
console:
enableMultiLogin: true
port: 30880
type: NodePort
# apiserver:
# resources: {}
# controllerManager:
# resources: {}
redis:
enabled: false
enableHA: false
volumeSize: 2Gi
openldap:
enabled: false
volumeSize: 2Gi
minio:
volumeSize: 20Gi
monitoring:
# type: external
endpoint: http://prometheus-operated.kubesphere-monitoring-system.svc:9090
GPUMonitoring:
enabled: false
gpu:
kinds:
- resourceName: "nvidia.com/gpu"
resourceType: "GPU"
default: true
es:
# master:
# volumeSize: 4Gi
# replicas: 1
# resources: {}
# data:
# volumeSize: 20Gi
# replicas: 1
# resources: {}
enabled: false
logMaxAge: 7
elkPrefix: logstash
basicAuth:
enabled: false
username: ""
password: ""
externalElasticsearchHost: ""
externalElasticsearchPort: ""
opensearch:
# master:
# volumeSize: 4Gi
# replicas: 1
# resources: {}
# data:
# volumeSize: 20Gi
# replicas: 1
# resources: {}
enabled: true
logMaxAge: 7
opensearchPrefix: whizard
basicAuth:
enabled: true
username: "admin"
password: "admin"
externalOpensearchHost: ""
externalOpensearchPort: ""
dashboard:
enabled: false
alerting:
enabled: false
# thanosruler:
# replicas: 1
# resources: {}
auditing:
enabled: false
# operator:
# resources: {}
# webhook:
# resources: {}
devops:
enabled: false
jenkinsCpuReq: 0.5
jenkinsCpuLim: 1
jenkinsMemoryReq: 4Gi
jenkinsMemoryLim: 4Gi
jenkinsVolumeSize: 16Gi
events:
enabled: false
# operator:
# resources: {}
# exporter:
# resources: {}
ruler:
enabled: true
replicas: 2
# resources: {}
logging:
enabled: false
logsidecar:
enabled: true
replicas: 2
# resources: {}
metrics_server:
enabled: false
monitoring:
storageClass: ""
node_exporter:
port: 9100
# resources: {}
# kube_rbac_proxy:
# resources: {}
# kube_state_metrics:
# resources: {}
# prometheus:
# replicas: 1
# volumeSize: 20Gi
# resources: {}
# operator:
# resources: {}
# alertmanager:
# replicas: 1
# resources: {}
# notification_manager:
# resources: {}
# operator:
# resources: {}
# proxy:
# resources: {}
gpu:
nvidia_dcgm_exporter:
enabled: false
# resources: {}
multicluster:
clusterRole: none
network:
networkpolicy:
enabled: false
ippool:
type: none
topology:
type: none
openpitrix:
store:
enabled: false
servicemesh:
enabled: false
istio:
components:
ingressGateways:
- name: istio-ingressgateway
enabled: false
cni:
enabled: false
edgeruntime:
enabled: false
kubeedge:
enabled: false
cloudCore:
cloudHub:
advertiseAddress:
- ""
service:
cloudhubNodePort: "30000"
cloudhubQuicNodePort: "30001"
cloudhubHttpsNodePort: "30002"
cloudstreamNodePort: "30003"
tunnelNodePort: "30004"
# resources: {}
# hostNetWork: false
iptables-manager:
enabled: true
mode: "external"
# resources: {}
# edgeService:
# resources: {}
gatekeeper:
enabled: false
# controller_manager:
# resources: {}
# audit:
# resources: {}
terminal:
timeout: 600
安装helm
# 安装helm
wget https://get.helm.sh/helm-v3.10.0-linux-amd64.tar.gz
tar -zxvf helm-v3.10.0-linux-amd64.tar.gz
mv linux-amd64/helm /usr/local/bin/helm
echo "source <(helm completion bash)" >> .bashrc
#添加存储库:
helm repo add stable http://mirror.azure.cn/kubernetes/charts
helm repo add aliyun https://kubernetes.oss-cn-hangzhou.aliyuncs.com/charts
helm repo update #更新一下
#列出存储库:
helm repo list
安装 dask
# 安装task集群
helm install --repo https://helm.dask.org --create-namespace -n dask-operator --generate-name dask-kubernetes-operator
[root@k8s-master01 helm]# kubectl get pods -n dask-operator
NAME READY STATUS RESTARTS AGE
dask-kubernetes-operator-1716913996-84f6f544d7-vcxjk 1/1 Running 0 3m51s
安装 Jupyter
vim jph-config.yaml
# 安装依赖
helm repo add jupyterhub https://hub.jupyter.org/helm-chart/
helm repo update
# 创建jhub命名空间
kubectl create ns jhub
helm upgrade --cleanup-on-fail \
--install jupyterhubtest jupyterhub/jupyterhub \
--namespace jhub \
--version=1.0.1 \
--values jph-config.yaml
# 删除:
# helm uninstall jupyterhubtest -n jhub
[root@k8s-master01 helm]# kubectl get pods -n jhub
NAME READY STATUS RESTARTS AGE
continuous-image-puller-hvpmw 1/1 Running 0 7m31s
continuous-image-puller-s7wqx 1/1 Running 0 7m31s
hub-6768db5ddd-6fxhr 1/1 Running 0 7m30s
proxy-8554987959-z8m5c 1/1 Running 0 7m30s
user-scheduler-69cf8bd99d-6mftc 1/1 Running 0 7m30s
user-scheduler-69cf8bd99d-88hzc 1/1 Running 0 7m31s
[root@k8s-master01 helm]#
查看svc:
[root@k8s-master01 helm]# kubectl get svc -n jhub
NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE
hub ClusterIP 10.233.4.184 <none> 8081/TCP 7h4m
proxy-api ClusterIP 10.233.4.5 <none> 8001/TCP 7h4m
proxy-public LoadBalancer 10.233.21.205 <pending> 80:31243/TCP 7h4m
[root@k8s-master01 helm]#
修改 proxy-public svc
将 type: LoadBalancer
改成 type: NodePort
kubectl edit svc proxy-public -n jhub
再次查看:
[root@k8s-master01 helm]# kubectl get svc -n jhub
NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE
hub ClusterIP 10.233.4.184 <none> 8081/TCP 7h15m
proxy-api ClusterIP 10.233.4.5 <none> 8001/TCP 7h15m
proxy-public NodePort 10.233.21.205 <none> 80:31243/TCP 7h15m
进入jup pod:
[root@k8s-master01 helm]# kubectl exec -it hub-6768db5ddd-6fxhr /bin/bash -n jhub
kubectl exec [POD] [COMMAND] is DEPRECATED and will be removed in a future version. Use kubectl exec [POD] -- [COMMAND] instead.
jovyan@hub-6768db5ddd-6fxhr:/srv/jupyterhub$
在 容器中安装 dask-kubernetes
# Install dask-kubernetes
# https://github.com/dask/dask-kubernetes
# https://pypi.org/project/dask-kubernetes/2023.3.2/
# k8s==v1.23.10, dask-k8s==2023.1.0
pip install dask-kubernetes==2023.1.0
# pip uninstall dask-kubernetes
# ---
from dask_kubernetes.operator import KubeCluster
cluster = KubeCluster(name="my-dask-cluster", image='ghcr.io/dask/dask:latest')
# cluster.scale(10)
cluster.scale(3)
上边的执行报错:
n@hub-6768db5ddd-6fxhr:~/work$ /usr/bin/python3.8 dask_cluster.py
╭─────────────────── Creating KubeCluster 'my-dask-cluster' ───────────────────╮
│ │
│ DaskCluster - │
│ Scheduler Pod - │
│ Scheduler Service - │
│ Default Worker Group - │
│ │
│ ⠋ │
╰──────────────────────────────────────────────────────────────────────────────╯
Traceback (most recent call last):
File "/home/jovyan/.local/lib/python3.8/site-packages/dask_kubernetes/operator/kubecluster/kubecluster.py", line 318, in _create_cluster
await custom_objects_api.create_namespaced_custom_object(
File "/home/jovyan/.local/lib/python3.8/site-packages/kubernetes_asyncio/client/api_client.py", line 192, in __call_api
raise e
File "/home/jovyan/.local/lib/python3.8/site-packages/kubernetes_asyncio/client/api_client.py", line 185, in __call_api
response_data = await self.request(
File "/home/jovyan/.local/lib/python3.8/site-packages/kubernetes_asyncio/client/rest.py", line 238, in POST
return (await self.request("POST", url,
File "/home/jovyan/.local/lib/python3.8/site-packages/kubernetes_asyncio/client/rest.py", line 195, in request
raise ApiException(http_resp=r)
kubernetes_asyncio.client.exceptions.ApiException: (403)
Reason: Forbidden
HTTP response headers: <CIMultiDictProxy('Cache-Control': 'no-cache, private', 'Content-Type': 'application/json', 'X-Content-Type-Options': 'nosniff', 'Date': 'Wed, 29 May 2024 16:45:32 GMT', 'Content-Length': '367')>
HTTP response body: {"kind":"Status","apiVersion":"v1","metadata":{},"status":"Failure","message":"daskclusters.kubernetes.dask.org is forbidden: User \"system:serviceaccount:jhub:hub\" cannot create resource \"daskclusters\" in API group \"kubernetes.dask.org\" in the namespace \"jhub\"","reason":"Forbidden","details":{"group":"kubernetes.dask.org","kind":"daskclusters"},"code":403}
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "dask_cluster.py", line 2, in <module>
cluster = KubeCluster(name="my-dask-cluster", image='ghcr.io/dask/dask:latest')
File "/home/jovyan/.local/lib/python3.8/site-packages/dask_kubernetes/operator/kubecluster/kubecluster.py", line 244, in __init__
self.sync(self._start)
...
File "/home/jovyan/.local/lib/python3.8/site-packages/dask_kubernetes/operator/kubecluster/kubecluster.py", line 280, in _start
await self._create_cluster()
File "/home/jovyan/.local/lib/python3.8/site-packages/dask_kubernetes/operator/kubecluster/kubecluster.py", line 326, in _create_cluster
raise RuntimeError(
RuntimeError: Failed to create DaskCluster resource. Are the Dask Custom Resource Definitions installed? https://kubernetes.dask.org/en/latest/operator.html#installing-the-operator
Error in atexit._run_exitfuncs:
Traceback (most recent call last):
File "/usr/lib/python3.8/asyncio/runners.py", line 44, in run
return loop.run_until_complete(main)
File "/usr/lib/python3.8/asyncio/base_events.py", line 616, in run_until_complete
return future.result()
File "/home/jovyan/.local/lib/python3.8/site-packages/dask_kubernetes/operator/kubecluster/kubecluster.py", line 1054, in _reap_clusters
cluster.close(timeout=10)
File "/home/jovyan/.local/lib/python3.8/site-packages/dask_kubernetes/operator/kubecluster/kubecluster.py", line 702, in close
return self.sync(self._close, timeout=timeout)
...
File "/home/jovyan/.local/lib/python3.8/site-packages/kubernetes_asyncio/client/rest.py", line 228, in DELETE
return (await self.request("DELETE", url,
File "/home/jovyan/.local/lib/python3.8/site-packages/kubernetes_asyncio/client/rest.py", line 195, in request
raise ApiException(http_resp=r)
kubernetes_asyncio.client.exceptions.ApiException: (403)
Reason: Forbidden
HTTP response headers: <CIMultiDictProxy('Cache-Control': 'no-cache, private', 'Content-Type': 'application/json', 'X-Content-Type-Options': 'nosniff', 'Date': 'Wed, 29 May 2024 16:45:32 GMT', 'Content-Length': '412')>
HTTP response body: {"kind":"Status","apiVersion":"v1","metadata":{},"status":"Failure","message":"daskclusters.kubernetes.dask.org \"my-dask-cluster\" is forbidden: User \"system:serviceaccount:jhub:hub\" cannot delete resource \"daskclusters\" in API group \"kubernetes.dask.org\" in the namespace \"jhub\"","reason":"Forbidden","details":{"name":"my-dask-cluster","group":"kubernetes.dask.org","kind":"daskclusters"},"code":403}
据您提供的信息来看,dask-operator
已经成功部署在集群中了。这意味着 Dask Custom Resource Definitions 已经可以在您的集群中找到。考虑到这一点,报错信息中提到 daskclusters.kubernetes.dask.org "my-dask-cluster"
是被禁止的错误可能是由于用户权限问题造成的。
报错信息中显示 "User 'system:serviceaccount:jhub:hub' cannot delete resource 'daskclusters' in API group 'kubernetes.dask.org' in the namespace 'jhub'"
,这表明在命名空间 "jhub" 中的 "system:serviceaccount:jhub:hub" 用户没有权限删除 "daskclusters" 资源。
您可以尝试通过以下方式解决该问题:
确认当前用户是否拥有足够的权限来操作 "daskclusters" 资源。您可以联系 Kubernetes 集群管理员,请求为当前用户授予相关权限。
检查用户是否正确设置了访问权限以及所在的命名空间。
ai回复:
要为命名空间 "jhub" 中的 "system:serviceaccount:jhub:hub" 用户授予权限来删除 "daskclusters" 资源,您可以创建一个 RBAC 授权规则并将其绑定到该服务账号上。以下是一个示例 YAML 配置文件,用于创建 RBAC 规则并将其绑定到指定的服务账号上:
apiVersion: rbac.authorization.k8s.io/v1
kind: Role
metadata:
namespace: jhub
name: daskclusters-operator
rules:
- apiGroups: ["kubernetes.dask.org"]
resources: ["daskclusters"]
verbs: ["get", "list", "watch", "create", "update", "patch", "delete"]
- apiGroups: [""]
resources: ["pods", "deployments", "services"]
verbs: ["create", "get", "list", "update", "delete"]
---
apiVersion: rbac.authorization.k8s.io/v1
kind: RoleBinding
metadata:
name: daskclusters-operator-binding
namespace: jhub
roleRef:
apiGroup: rbac.authorization.k8s.io
kind: Role
name: daskclusters-operator
subjects:
- kind: ServiceAccount
name: hub
namespace: jhub
保存这些配置到一个文件中,例如 daskclusters-rbac.yaml
,然后使用 kubectl apply
命令应用这些配置。
kubectl apply -f daskclusters-rbac.yaml
这样做会授予指定服务账号在指定命名空间中对 "daskclusters" 资源的 get、list、watch、create、update、patch 和 delete 等操作权限。
执行:
[root@k8s-master01 helm]# vim daskclusters-rbac.yaml
[root@k8s-master01 helm]# kubectl apply -f daskclusters-rbac.yaml
role.rbac.authorization.k8s.io/daskclusters-operator created
rolebinding.rbac.authorization.k8s.io/daskclusters-operator-binding created
然后再次执行创建集群的命令:
> cd /home/jovyan/work
jovyan@hub-6768db5ddd-6fxhr:~/work$ /usr/bin/python3.8 dask_cluster.py
╭─────────────────── Creating KubeCluster 'my-dask-cluster' ───────────────────╮
│ │
│ DaskCluster - │
│ Scheduler Pod - │
│ Scheduler Service - │
│ Default Worker Group - │
│ │
│ ⠴ Waiting for controller to action cluster │
╰──────────────────────────────────────────────────────────────────────────────╯
Traceback (most recent call last):
File "dask_cluster.py", line 2, in <module>
cluster = KubeCluster(name="my-dask-cluster", image='ghcr.io/dask/dask:latest')
File "/home/jovyan/.local/lib/python3.8/site-packages/dask_kubernetes/operator/kubecluster/kubecluster.py", line 244, in __init__
self.sync(self._start)
File "/home/jovyan/.local/lib/python3.8/site-packages/distributed/utils.py", line 351, in sync
return sync(
File "/home/jovyan/.local/lib/python3.8/site-packages/distributed/utils.py", line 418, in sync
raise exc.with_traceback(tb)
File "/home/jovyan/.local/lib/python3.8/site-packages/distributed/utils.py", line 391, in f
result = yield future
File "/usr/local/lib/python3.8/dist-packages/tornado/gen.py", line 762, in run
value = future.result()
File "/home/jovyan/.local/lib/python3.8/site-packages/dask_kubernetes/operator/kubecluster/kubecluster.py", line 280, in _start
await self._create_cluster()
File "/home/jovyan/.local/lib/python3.8/site-packages/dask_kubernetes/operator/kubecluster/kubecluster.py", line 340, in _create_cluster
await wait_for_scheduler(
File "/home/jovyan/.local/lib/python3.8/site-packages/dask_kubernetes/common/networking.py", line 199, in wait_for_scheduler
pod = await Pod.objects(api, namespace=namespace).get_by_name(
File "/home/jovyan/.local/lib/python3.8/site-packages/dask_kubernetes/aiopykube/query.py", line 19, in get_by_name
return await self._sync(super().get_by_name, name=name)
File "/home/jovyan/.local/lib/python3.8/site-packages/dask_kubernetes/aiopykube/mixins.py", line 9, in _sync
return await asyncio.get_event_loop().run_in_executor(
File "/usr/lib/python3.8/concurrent/futures/thread.py", line 57, in run
result = self.fn(*self.args, **self.kwargs)
File "/home/jovyan/.local/lib/python3.8/site-packages/pykube/query.py", line 117, in get_by_name
raise ObjectDoesNotExist(f"{name} does not exist.")
pykube.exceptions.ObjectDoesNotExist: my-dask-cluster-scheduler does not exist.
查看状态:
[root@k8s-master01 helm]# kubectl describe rolebinding daskclusters-operator-binding -n jhub
Name: daskclusters-operator-binding
Labels: <none>
Annotations: <none>
Role:
Kind: Role
Name: daskclusters-operator
Subjects:
Kind Name Namespace
---- ---- ---------
ServiceAccount hub jhub
官网的权限:
daskclusters-rbac.yaml
kind: ClusterRole
apiVersion: rbac.authorization.k8s.io/v1
metadata:
namespace: jhub
name: dask-cluster-role
rules:
# Application: watching & handling for the custom resource we declare.
- apiGroups: [kubernetes.dask.org]
resources: [daskclusters, daskworkergroups, daskworkergroups/scale, daskjobs, daskautoscalers]
verbs: [get, list, watch, patch, create, delete]
# Application: other resources it needs to watch and get information from.
- apiGroups:
- "" # indicates the core API group
resources: [pods, pods/status]
verbs:
- "get"
- "list"
- "watch"
- apiGroups:
- "" # indicates the core API group
resources: [services]
verbs:
- "get"
- "list"
- "watch"
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRoleBinding
metadata:
namespace: jhub
name: dask-cluster-role-binding
roleRef:
apiGroup: rbac.authorization.k8s.io
kind: ClusterRole
name: dask-cluster-role
subjects:
- kind: ServiceAccount
name: hub # adjust name based on the service account you created
namespace: jhub
https://kubernetes.dask.org/en/latest/operator_kubecluster.html
使用 Custom Resources 创建dask集群
cluster.yaml
# cluster.yaml
apiVersion: kubernetes.dask.org/v1
kind: DaskCluster
metadata:
name: simple
spec:
worker:
replicas: 2
spec:
containers:
- name: worker
image: "ghcr.io/dask/dask:latest"
imagePullPolicy: "IfNotPresent"
args:
- dask-worker
- --name
- $(DASK_WORKER_NAME)
- --dashboard
- --dashboard-address
- "8788"
ports:
- name: http-dashboard
containerPort: 8788
protocol: TCP
scheduler:
spec:
containers:
- name: scheduler
image: "ghcr.io/dask/dask:latest"
imagePullPolicy: "IfNotPresent"
args:
- dask-scheduler
ports:
- name: tcp-comm
containerPort: 8786
protocol: TCP
- name: http-dashboard
containerPort: 8787
protocol: TCP
readinessProbe:
httpGet:
port: http-dashboard
path: /health
initialDelaySeconds: 5
periodSeconds: 10
livenessProbe:
httpGet:
port: http-dashboard
path: /health
initialDelaySeconds: 15
periodSeconds: 20
service:
type: NodePort
selector:
dask.org/cluster-name: simple
dask.org/component: scheduler
ports:
- name: tcp-comm
protocol: TCP
port: 8786
targetPort: "tcp-comm"
- name: http-dashboard
protocol: TCP
port: 8787
targetPort: "http-dashboard"
创建:
kubectl apply -f cluster.yaml
查看部署的服务:
[root@k8s-master01 helm]# kubectl get pods
NAME READY STATUS RESTARTS AGE
simple-default-worker-441dfc8e17-7df786b74c-fpqkq 1/1 Running 0 55m
simple-default-worker-8359eff942-766d557f7f-c7vqb 1/1 Running 0 55m
simple-scheduler-b4f46d497-8bmqw 1/1 Running 0 55m
[root@k8s-master01 helm]#
[root@k8s-master01 helm]# kubectl get svc
NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE
kubernetes ClusterIP 10.233.0.1 <none> 443/TCP 5d11h
simple-scheduler NodePort 10.233.59.125 <none> 8786:32389/TCP,8787:31256/TCP 56m
内部域名:
simple-scheduler.jhub.cluster.local
10.233.57.184:8786
from dask.distributed import Client
client = Client("10.233.59.125:8786")
进入worker 测试:
(base) root@simple-default-worker-441dfc8e17-7df786b74c-fpqkq:/# ps -ef
UID PID PPID C STIME TTY TIME CMD
root 1 0 0 02:24 ? 00:00:00 tini -g -- /usr/bin/prepare.sh dask-worker --name simple-default-worker-441dfc8e17 --dashboard --dashboard-address 8788
root 7 1 2 02:24 ? 00:01:00 /opt/conda/bin/python3.10 /opt/conda/bin/dask-worker --name simple-default-worker-441dfc8e17 --dashboard --dashboard-addres
root 11 7 0 02:24 ? 00:00:00 /opt/conda/bin/python3.10 -c from multiprocessing.resource_tracker import main;main(11)
root 14 7 3 02:24 ? 00:01:46 /opt/conda/bin/python3.10 -c from multiprocessing.spawn import spawn_main; spawn_main(tracker_fd=12, pipe_handle=18) --mult
root 23 0 0 03:13 pts/0 00:00:00 /bin/bash
root 41 23 0 03:13 pts/0 00:00:00 ps -ef
(base) root@simple-default-worker-441dfc8e17-7df786b74c-fpqkq:/# python
Python 3.10.12 | packaged by conda-forge | (main, Jun 23 2023, 22:40:32) [GCC 12.3.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> from dask.distributed import Client
>>> client = Client("10.233.59.125:8786")
>>> print(client)
<Client: 'tcp://10.233.123.40:8786' processes=2 threads=4, memory=5.48 GiB>
>>>
由于Jupyter hub 中有网络策略限制,所以,在Jupyterhub pod 中无法访问dask 集群服务,所以删掉:
kubectl delete networkpolicy --all -n jhub
dask-worker python版本:
(base) root@simple-default-worker-23c530bbbd-6b4c8fd7bf-zkpgb:/# pip list | grep dask
dask 2024.5.1
dask-expr 1.1.1
(base) root@simple-default-worker-23c530bbbd-6b4c8fd7bf-zkpgb:/# python version
python: can't open file '//version': [Errno 2] No such file or directory
(base) root@simple-default-worker-23c530bbbd-6b4c8fd7bf-zkpgb:/# python --version
Python 3.10.12
(base) root@simple-default-worker-23c530bbbd-6b4c8fd7bf-zkpgb:/# python
Python 3.10.12 | packaged by conda-forge | (main, Jun 23 2023, 22:40:32) [GCC 12.3.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> from dask.distributed import Client
>>>
python 版本为 Python 3.10.12,目前不支持 python 3.11 版本;
helm jupyter安装
helm repo add jupyterhub https://hub.jupyter.org/helm-chart/
helm repo update
# 创建jhub命名空间
kubectl create ns jhub
helm upgrade --cleanup-on-fail \
--install jupyterhubtest jupyterhub/jupyterhub \
--namespace jhub \
--version=3.2.0 \
--values config.yaml
config.yaml文件:
hub:
config:
Authenticator:
admin_users: # jupyterhub 的管理员用户
- admin
allowed_users: # 不配置的话,允许用户注册
- kaiyi
DummyAuthenticator:
password: hub123 #配置通用密码(基本没什么安全性)
JupyterHub:
admin_access: true #配置是否允许管理者账户存在
authenticator_class: dummy #指定所有账户授权类型(默认是傻瓜式)
admin管理员权限配置:
https://z2jh.jupyter.org/en/stable/administrator/authentication.html
查看helm 更新:
helm upgrade --cleanup-on-fail \
--install jupyterhubtest jupyterhub/jupyterhub \
--namespace jhub \
--version=3.2.0 \
--values config.yaml
查看资源:
[root@k8s-master01 jupyterhub]# kubectl get pods -n jhub
NAME READY STATUS RESTARTS AGE
continuous-image-puller-7s4h6 1/1 Running 0 5m
continuous-image-puller-b89hn 1/1 Running 0 5m
hub-db8677cdc-mplwq 1/1 Running 0 5m
proxy-5bb68bd87-dw665 1/1 Running 0 5m
user-scheduler-786796cbd-4rlsz 1/1 Running 0 5m
user-scheduler-786796cbd-zbr6d 1/1 Running 0 5m
[root@k8s-master01 jupyterhub]# kubectl get svc -n jhub
NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE
hub ClusterIP 10.233.33.86 <none> 8081/TCP 5m18s
proxy-api ClusterIP 10.233.7.91 <none> 8001/TCP 5m18s
proxy-public LoadBalancer 10.233.36.149 <pending> 80:30241/TCP 5m18s
[root@k8s-master01 jupyterhub]# kubectl edit svc proxy-public -n jhub
service/proxy-public edited
[root@k8s-master01 jupyterhub]# kubectl get svc -n jhub
NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE
hub ClusterIP 10.233.33.86 <none> 8081/TCP 8m41s
proxy-api ClusterIP 10.233.7.91 <none> 8001/TCP 8m41s
proxy-public NodePort 10.233.36.149 <none> 80:30241/TCP 8m41s
[root@k8s-master01 jupyterhub]#
登录jupyter:
http://11.0.1.10:30241/hub/login?next=%2Fhub%2F
jupyterhub的默认用户:bi:bi
管理员:admin/hub123
管理员用户会起一个pod:
[root@k8s-master01 jupyterhub]# kubectl get pods -A
NAMESPACE NAME READY STATUS RESTARTS AGE
dask-operator dask-kubernetes-operator-1718350109-9c976d4f6-rkxqs 0/1 CrashLoopBackOff 103 (4m37s ago) 5d19h
jhub continuous-image-puller-bkbhz 1/1 Running 0 3d11h
jhub continuous-image-puller-qvtzf 1/1 Running 0 3d11h
jhub hub-7d47c446bb-4xnzw 1/1 Running 2 (15m ago) 28m
jhub jupyter-admin 0/1 Pending 0 106s
jhub jupyter-bi 1/1 Running 0 10m
jhub proxy-6d56fcc89c-h48v8 1/1 Running 0 28m
jhub simple-default-worker-23c530bbbd-6b4c8fd7bf-zkpgb 1/1 Running 0 9h
jhub simple-default-worker-4aa3828538-54c4488c5-6fbmv 1/1 Running 0 10h
jhub simple-scheduler-5df9969c5b-sqpb5 1/1 Running 0 9h
jhub user-scheduler-786796cbd-cjpbv 1/1 Running 91 (4m23s ago) 3d11h
jhub user-scheduler-786796cbd-k2cdl 1/1 Running 89 (5m44s ago) 10h
相关文章:
Dask on k8s
Jupyter Helm Chart 版本
官网| helm 方式安装 jupyterhub
K8S+Helm 安装 Jupyterhub
Dask on k8s Custom Resources
Python dask 使用 k8s 做分布式计算
kubeflow-with-the-dask-kubernetes-operator
python | dask-kubernetes 版本
K8S 部署 JupyterHub 集群 | 配置管理员
Digtime | K8S 离线部署 Dask 分布式集群及 JupyterHub
为者常成,行者常至
自由转载-非商用-非衍生-保持署名(创意共享3.0许可证)