Dask 分布式集群部署 (物理机和 k8s) 及实战

一、部署集群

1、启动主服务

 dask-scheduler

启动日志:

PS C:\Users\40754> dask-scheduler
2024-05-28 16:11:28,123 - distributed.scheduler - INFO - -----------------------------------------------
2024-05-28 16:11:28,843 - distributed.http.proxy - INFO - To route to workers diagnostics web server please install jupyter-server-proxy: python -m pip install jupyter-server-proxy
2024-05-28 16:11:28,888 - distributed.scheduler - INFO - State start
2024-05-28 16:11:28,906 - distributed.scheduler - INFO - -----------------------------------------------
2024-05-28 16:11:28,906 - distributed.scheduler - INFO - Clear task state
2024-05-28 16:11:28,907 - distributed.scheduler - INFO -   Scheduler at:  tcp://192.168.10.62:8786
2024-05-28 16:11:28,907 - distributed.scheduler - INFO -   dashboard at:                     :8787

启动work节点

在同一台电脑启动scheler 主服务及 worker节点服务,打开新的终端,启动服务:

dask-worker 192.168.10.62:8786

日志打印:

PS C:\Users\40754> dask-worker 192.168.10.62:8786
2024-05-28 16:18:38,703 - distributed.nanny - INFO -         Start Nanny at: 'tcp://192.168.10.62:59899'
2024-05-28 16:18:40,380 - distributed.diskutils - INFO - Found stale lock file and directory 'C:\\Users\\40754\\AppData\\Local\\Temp\\dask-worker-space\\worker-5ol4ypq3', purging
...

'C:\\Users\\40754\\AppData\\Local\\Temp\\dask-worker-space\\worker-st_zgbrw', purging
2024-05-28 16:18:41,806 - distributed.worker - INFO -       Start worker at:  tcp://192.168.10.62:59906
2024-05-28 16:18:41,807 - distributed.worker - INFO -          Listening to:  tcp://192.168.10.62:59906
2024-05-28 16:18:41,807 - distributed.worker - INFO -          dashboard at:        192.168.10.62:59907
2024-05-28 16:18:41,807 - distributed.worker - INFO - Waiting to connect to:   tcp://192.168.10.62:8786
2024-05-28 16:18:41,807 - distributed.worker - INFO - -------------------------------------------------
2024-05-28 16:18:41,807 - distributed.worker - INFO -               Threads:                         16
2024-05-28 16:18:41,808 - distributed.worker - INFO -                Memory:                  15.73 GiB
2024-05-28 16:18:41,808 - distributed.worker - INFO -       Local Directory: C:\Users\40754\AppData\Local\Temp\dask-worker-space\worker-sc0gqvhr
2024-05-28 16:18:41,808 - distributed.worker - INFO - -------------------------------------------------
2024-05-28 16:18:41,826 - distributed.worker - INFO -         Registered to:   tcp://192.168.10.62:8786
2024-05-28 16:18:41,826 - distributed.worker - INFO - -------------------------------------------------
2024-05-28 16:18:41,829 - distributed.core - INFO - Starting established connection

可以看到 worker 服务已经注册到主服务了。

查看主服务的终端日志打印:

PS C:\Users\40754> dask-scheduler
2024-05-28 16:11:28,123 - distributed.scheduler - INFO - -----------------------------------------------
2024-05-28 16:11:28,843 - distributed.http.proxy - INFO - To route to workers diagnostics web server please install jupyter-server-proxy: python -m pip install jupyter-server-proxy
2024-05-28 16:11:28,888 - distributed.scheduler - INFO - State start
2024-05-28 16:11:28,906 - distributed.scheduler - INFO - -----------------------------------------------
2024-05-28 16:11:28,906 - distributed.scheduler - INFO - Clear task state
2024-05-28 16:11:28,907 - distributed.scheduler - INFO -   Scheduler at:  tcp://192.168.10.62:8786
2024-05-28 16:11:28,907 - distributed.scheduler - INFO -   dashboard at:                     :8787
2024-05-28 16:18:41,822 - distributed.scheduler - INFO - Register worker <WorkerState 'tcp://192.168.10.62:59906', status: init, memory: 0, processing: 0>
2024-05-28 16:18:41,826 - distributed.scheduler - INFO - Starting worker compute stream, tcp://192.168.10.62:59906
2024-05-28 16:18:41,826 - distributed.core - INFO - Starting established connection

二、功能测试

1、测试Dask

from dask.distributed import Client
client = Client('192.168.10.62:8786') 

def square(x):
    return x ** 2

def neg(x):
    return -x

A = client.map(square, range(10))
B = client.map(neg, A)
total = client.submit(sum, B)

total.result()

2、测试Pycaret集群计算

import pandas as pd
df=pd.read_csv('train.csv')
# init setup
from pycaret.classification import *
clf1 = setup(data = df, target = 'Survived',n_jobs = -1)
# import parallel back-end
from pycaret.parallel import FugueBackend
compare_models(n_select=3, parallel=FugueBackend("dask"),verbose=True)

k8s部署

创建k8s集群 11.0.1.10

# 创建集群节点
 ./kk create cluster -f config-sample-single-master.yaml

# 清除集群节点
./kk delete cluster -f config-sample-single-master.yaml

# 升级k8s版本
# export KKZONE=cn
# ./kk upgrade --with-kubernetes v1.20.15 -f config-sample-single-master.yaml

安装 kubesphere v3.4.1 的版本:

 # 安装:
curl -sfL https://get-kk.kubesphere.io | VERSION=v3.0.13 sh -

chmod +x kk

# 创建配置文件,然后配置文件
./kk create config --with-kubesphere v3.4.1

# 部署集群
export KKZONE=cn
./kk create cluster --with-kubernetes v1.23.10 --with-kubesphere v3.4.1

[root@k8s-master01 k8s-install]# kubectl get nodes
NAME           STATUS   ROLES                  AGE     VERSION
k8s-master01   Ready    control-plane,master   6h38m   v1.23.10
k8s-node01     Ready    worker                 6h38m   v1.23.10
k8s-node02     Ready    worker                 6h38m   v1.23.10
[root@k8s-master01 k8s-install]# 

查看节点信息:

[root@k8s-master01 ~]# kubectl get nodes
NAME           STATUS   ROLES    AGE   VERSION
k8s-master01   Ready    master   23h   v1.19.9
k8s-node01     Ready    worker   23h   v1.19.9
k8s-node02     Ready    worker   23h   v1.19.9
[root@k8s-master01 ~]# 

查看配置信息:

cd /k8s/softwares

测试环境的k8s版本:

[wangchao@bxyh-master-node1 ~]$ kubectl get nodes
NAME                STATUS                     ROLES                          AGE   VERSION
bxyh-master-node1   Ready,SchedulingDisabled   control-plane,etcd,lb,master   71d   v1.23.17
bxyh-master-node2   Ready,SchedulingDisabled   control-plane,etcd,lb,master   71d   v1.23.17
bxyh-worker-node1   Ready                      worker                         71d   v1.23.17
bxyh-worker-node2   Ready                      worker                         71d   v1.23.17
bxyh-worker-node3   Ready                      worker                         71d   v1.23.17
bxyh-worker-node4   Ready                      application,worker             69d   v1.23.17
bxyh-worker-node5   Ready                      feature,worker                 67d   v1.23.17

配置文件:

[root@k8s-master01 k8s-install]# cat config-sample.yaml 

apiVersion: kubekey.kubesphere.io/v1alpha2
kind: Cluster
metadata:
  name: sample
spec:
  hosts:
  - {name: k8s-master01, address: 11.0.1.10, internalAddress: 11.0.1.10, user: root, password: "123456"}
  - {name: k8s-node01, address: 11.0.1.20, internalAddress: 11.0.1.20, user: root, password: "123456"}
  - {name: k8s-node02, address: 11.0.1.21, internalAddress: 11.0.1.21, user: root, password: "123456"}
  roleGroups:
    etcd:
    - k8s-master01
    control-plane: 
    - k8s-master01
    worker:
    - k8s-node01
    - k8s-node02
  controlPlaneEndpoint:
    ## Internal loadbalancer for apiservers 
    # internalLoadbalancer: haproxy

    domain: lb.kubesphere.local
    address: ""
    port: 6443
  kubernetes:
    version: v1.23.10
    clusterName: cluster.local
    autoRenewCerts: true
    containerManager: docker
  etcd:
    type: kubekey
  network:
    plugin: calico
    kubePodsCIDR: 10.233.64.0/18
    kubeServiceCIDR: 10.233.0.0/18
    ## multus support. https://github.com/k8snetworkplumbingwg/multus-cni
    multusCNI:
      enabled: false
  registry:
    privateRegistry: ""
    namespaceOverride: ""
    registryMirrors: []
    insecureRegistries: []
  addons: []

---
apiVersion: installer.kubesphere.io/v1alpha1
kind: ClusterConfiguration
metadata:
  name: ks-installer
  namespace: kubesphere-system
  labels:
    version: v3.4.1
spec:
  persistence:
    storageClass: ""
  authentication:
    jwtSecret: ""
  local_registry: ""
  # dev_tag: ""
  etcd:
    monitoring: false
    endpointIps: localhost
    port: 2379
    tlsEnable: true
  common:
    core:
      console:
        enableMultiLogin: true
        port: 30880
        type: NodePort
    # apiserver:
    #  resources: {}
    # controllerManager:
    #  resources: {}
    redis:
      enabled: false
      enableHA: false
      volumeSize: 2Gi
    openldap:
      enabled: false
      volumeSize: 2Gi
    minio:
      volumeSize: 20Gi
    monitoring:
      # type: external
      endpoint: http://prometheus-operated.kubesphere-monitoring-system.svc:9090
      GPUMonitoring:
        enabled: false
    gpu:
      kinds:
      - resourceName: "nvidia.com/gpu"
        resourceType: "GPU"
        default: true
    es:
      # master:
      #   volumeSize: 4Gi
      #   replicas: 1
      #   resources: {}
      # data:
      #   volumeSize: 20Gi
      #   replicas: 1
      #   resources: {}
      enabled: false
      logMaxAge: 7
      elkPrefix: logstash
      basicAuth:
        enabled: false
        username: ""
        password: ""
      externalElasticsearchHost: ""
      externalElasticsearchPort: ""
    opensearch:
      # master:
      #   volumeSize: 4Gi
      #   replicas: 1
      #   resources: {}
      # data:
      #   volumeSize: 20Gi
      #   replicas: 1
      #   resources: {}
      enabled: true
      logMaxAge: 7
      opensearchPrefix: whizard
      basicAuth:
        enabled: true
        username: "admin"
        password: "admin"
      externalOpensearchHost: ""
      externalOpensearchPort: ""
      dashboard:
        enabled: false
  alerting:
    enabled: false
    # thanosruler:
    #   replicas: 1
    #   resources: {}
  auditing:
    enabled: false
    # operator:
    #   resources: {}
    # webhook:
    #   resources: {}
  devops:
    enabled: false
    jenkinsCpuReq: 0.5
    jenkinsCpuLim: 1
    jenkinsMemoryReq: 4Gi
    jenkinsMemoryLim: 4Gi
    jenkinsVolumeSize: 16Gi
  events:
    enabled: false
    # operator:
    #   resources: {}
    # exporter:
    #   resources: {}
    ruler:
      enabled: true
      replicas: 2
    #   resources: {}
  logging:
    enabled: false
    logsidecar:
      enabled: true
      replicas: 2
      # resources: {}
  metrics_server:
    enabled: false
  monitoring:
    storageClass: ""
    node_exporter:
      port: 9100
      # resources: {}
    # kube_rbac_proxy:
    #   resources: {}
    # kube_state_metrics:
    #   resources: {}
    # prometheus:
    #   replicas: 1
    #   volumeSize: 20Gi
    #   resources: {}
    #   operator:
    #     resources: {}
    # alertmanager:
    #   replicas: 1
    #   resources: {}
    # notification_manager:
    #   resources: {}
    #   operator:
    #     resources: {}
    #   proxy:
    #     resources: {}
    gpu:
      nvidia_dcgm_exporter:
        enabled: false
        # resources: {}
  multicluster:
    clusterRole: none
  network:
    networkpolicy:
      enabled: false
    ippool:
      type: none
    topology:
      type: none
  openpitrix:
    store:
      enabled: false
  servicemesh:
    enabled: false
    istio:
      components:
        ingressGateways:
        - name: istio-ingressgateway
          enabled: false
        cni:
          enabled: false
  edgeruntime:
    enabled: false
    kubeedge:
      enabled: false
      cloudCore:
        cloudHub:
          advertiseAddress:
            - ""
        service:
          cloudhubNodePort: "30000"
          cloudhubQuicNodePort: "30001"
          cloudhubHttpsNodePort: "30002"
          cloudstreamNodePort: "30003"
          tunnelNodePort: "30004"
        # resources: {}
        # hostNetWork: false
      iptables-manager:
        enabled: true
        mode: "external"
        # resources: {}
      # edgeService:
      #   resources: {}
  gatekeeper:
    enabled: false
    # controller_manager:
    #   resources: {}
    # audit:
    #   resources: {}
  terminal:
    timeout: 600

安装helm

# 安装helm
wget https://get.helm.sh/helm-v3.10.0-linux-amd64.tar.gz
tar -zxvf helm-v3.10.0-linux-amd64.tar.gz

mv linux-amd64/helm /usr/local/bin/helm
echo "source <(helm completion bash)" >> .bashrc

#添加存储库:
helm repo add stable http://mirror.azure.cn/kubernetes/charts
helm repo add aliyun https://kubernetes.oss-cn-hangzhou.aliyuncs.com/charts
helm repo update #更新一下

#列出存储库:
helm repo list

安装 dask

# 安装task集群
helm install --repo https://helm.dask.org --create-namespace -n dask-operator --generate-name dask-kubernetes-operator

[root@k8s-master01 helm]# kubectl get pods -n dask-operator
NAME                                                   READY   STATUS    RESTARTS   AGE
dask-kubernetes-operator-1716913996-84f6f544d7-vcxjk   1/1     Running   0          3m51s

安装 Jupyter

vim jph-config.yaml

# 安装依赖
helm repo add jupyterhub https://hub.jupyter.org/helm-chart/
helm repo update

# 创建jhub命名空间
kubectl create ns jhub

helm upgrade --cleanup-on-fail \
  --install jupyterhubtest jupyterhub/jupyterhub \
  --namespace jhub \
  --version=1.0.1 \
  --values jph-config.yaml

# 删除:
# helm uninstall jupyterhubtest -n jhub

[root@k8s-master01 helm]# kubectl get pods -n jhub
NAME                              READY   STATUS    RESTARTS   AGE
continuous-image-puller-hvpmw     1/1     Running   0          7m31s
continuous-image-puller-s7wqx     1/1     Running   0          7m31s
hub-6768db5ddd-6fxhr              1/1     Running   0          7m30s
proxy-8554987959-z8m5c            1/1     Running   0          7m30s
user-scheduler-69cf8bd99d-6mftc   1/1     Running   0          7m30s
user-scheduler-69cf8bd99d-88hzc   1/1     Running   0          7m31s
[root@k8s-master01 helm]# 

查看svc:

[root@k8s-master01 helm]# kubectl get svc -n jhub
NAME           TYPE           CLUSTER-IP      EXTERNAL-IP   PORT(S)        AGE
hub            ClusterIP      10.233.4.184    <none>        8081/TCP       7h4m
proxy-api      ClusterIP      10.233.4.5      <none>        8001/TCP       7h4m
proxy-public   LoadBalancer   10.233.21.205   <pending>     80:31243/TCP   7h4m
[root@k8s-master01 helm]# 

修改 proxy-public svctype: LoadBalancer 改成 type: NodePort

kubectl edit svc proxy-public -n jhub

file

再次查看:

[root@k8s-master01 helm]# kubectl get svc -n jhub
NAME           TYPE        CLUSTER-IP      EXTERNAL-IP   PORT(S)        AGE
hub            ClusterIP   10.233.4.184    <none>        8081/TCP       7h15m
proxy-api      ClusterIP   10.233.4.5      <none>        8001/TCP       7h15m
proxy-public   NodePort    10.233.21.205   <none>        80:31243/TCP   7h15m

进入jup pod:

[root@k8s-master01 helm]# kubectl exec -it hub-6768db5ddd-6fxhr  /bin/bash -n jhub
kubectl exec [POD] [COMMAND] is DEPRECATED and will be removed in a future version. Use kubectl exec [POD] -- [COMMAND] instead.
jovyan@hub-6768db5ddd-6fxhr:/srv/jupyterhub$ 

在 容器中安装 dask-kubernetes

# Install dask-kubernetes
# https://github.com/dask/dask-kubernetes
# https://pypi.org/project/dask-kubernetes/2023.3.2/
# k8s==v1.23.10, dask-k8s==2023.1.0
pip install dask-kubernetes==2023.1.0
# pip uninstall dask-kubernetes

# ---
from dask_kubernetes.operator import KubeCluster
cluster = KubeCluster(name="my-dask-cluster", image='ghcr.io/dask/dask:latest')
# cluster.scale(10)
cluster.scale(3)

上边的执行报错:

n@hub-6768db5ddd-6fxhr:~/work$ /usr/bin/python3.8 dask_cluster.py 
╭─────────────────── Creating KubeCluster 'my-dask-cluster' ───────────────────╮
│                                                                              │
│   DaskCluster                                                            -   │
│   Scheduler Pod                                                          -   │
│   Scheduler Service                                                      -   │
│   Default Worker Group                                                   -   │
│                                                                              │
│ ⠋                                                                            │
╰──────────────────────────────────────────────────────────────────────────────╯
Traceback (most recent call last):
  File "/home/jovyan/.local/lib/python3.8/site-packages/dask_kubernetes/operator/kubecluster/kubecluster.py", line 318, in _create_cluster
    await custom_objects_api.create_namespaced_custom_object(
  File "/home/jovyan/.local/lib/python3.8/site-packages/kubernetes_asyncio/client/api_client.py", line 192, in __call_api
    raise e
  File "/home/jovyan/.local/lib/python3.8/site-packages/kubernetes_asyncio/client/api_client.py", line 185, in __call_api
    response_data = await self.request(
  File "/home/jovyan/.local/lib/python3.8/site-packages/kubernetes_asyncio/client/rest.py", line 238, in POST
    return (await self.request("POST", url,
  File "/home/jovyan/.local/lib/python3.8/site-packages/kubernetes_asyncio/client/rest.py", line 195, in request
    raise ApiException(http_resp=r)
kubernetes_asyncio.client.exceptions.ApiException: (403)
Reason: Forbidden
HTTP response headers: <CIMultiDictProxy('Cache-Control': 'no-cache, private', 'Content-Type': 'application/json', 'X-Content-Type-Options': 'nosniff', 'Date': 'Wed, 29 May 2024 16:45:32 GMT', 'Content-Length': '367')>
HTTP response body: {"kind":"Status","apiVersion":"v1","metadata":{},"status":"Failure","message":"daskclusters.kubernetes.dask.org is forbidden: User \"system:serviceaccount:jhub:hub\" cannot create resource \"daskclusters\" in API group \"kubernetes.dask.org\" in the namespace \"jhub\"","reason":"Forbidden","details":{"group":"kubernetes.dask.org","kind":"daskclusters"},"code":403}

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "dask_cluster.py", line 2, in <module>
    cluster = KubeCluster(name="my-dask-cluster", image='ghcr.io/dask/dask:latest')
  File "/home/jovyan/.local/lib/python3.8/site-packages/dask_kubernetes/operator/kubecluster/kubecluster.py", line 244, in __init__
    self.sync(self._start)
  ...
  File "/home/jovyan/.local/lib/python3.8/site-packages/dask_kubernetes/operator/kubecluster/kubecluster.py", line 280, in _start
    await self._create_cluster()
  File "/home/jovyan/.local/lib/python3.8/site-packages/dask_kubernetes/operator/kubecluster/kubecluster.py", line 326, in _create_cluster
    raise RuntimeError(
RuntimeError: Failed to create DaskCluster resource. Are the Dask Custom Resource Definitions installed? https://kubernetes.dask.org/en/latest/operator.html#installing-the-operator
Error in atexit._run_exitfuncs:
Traceback (most recent call last):
  File "/usr/lib/python3.8/asyncio/runners.py", line 44, in run
    return loop.run_until_complete(main)
  File "/usr/lib/python3.8/asyncio/base_events.py", line 616, in run_until_complete
    return future.result()
  File "/home/jovyan/.local/lib/python3.8/site-packages/dask_kubernetes/operator/kubecluster/kubecluster.py", line 1054, in _reap_clusters
    cluster.close(timeout=10)
  File "/home/jovyan/.local/lib/python3.8/site-packages/dask_kubernetes/operator/kubecluster/kubecluster.py", line 702, in close
    return self.sync(self._close, timeout=timeout)
  ...
  File "/home/jovyan/.local/lib/python3.8/site-packages/kubernetes_asyncio/client/rest.py", line 228, in DELETE
    return (await self.request("DELETE", url,
  File "/home/jovyan/.local/lib/python3.8/site-packages/kubernetes_asyncio/client/rest.py", line 195, in request
    raise ApiException(http_resp=r)
kubernetes_asyncio.client.exceptions.ApiException: (403)
Reason: Forbidden
HTTP response headers: <CIMultiDictProxy('Cache-Control': 'no-cache, private', 'Content-Type': 'application/json', 'X-Content-Type-Options': 'nosniff', 'Date': 'Wed, 29 May 2024 16:45:32 GMT', 'Content-Length': '412')>
HTTP response body: {"kind":"Status","apiVersion":"v1","metadata":{},"status":"Failure","message":"daskclusters.kubernetes.dask.org \"my-dask-cluster\" is forbidden: User \"system:serviceaccount:jhub:hub\" cannot delete resource \"daskclusters\" in API group \"kubernetes.dask.org\" in the namespace \"jhub\"","reason":"Forbidden","details":{"name":"my-dask-cluster","group":"kubernetes.dask.org","kind":"daskclusters"},"code":403}

据您提供的信息来看,dask-operator 已经成功部署在集群中了。这意味着 Dask Custom Resource Definitions 已经可以在您的集群中找到。考虑到这一点,报错信息中提到 daskclusters.kubernetes.dask.org "my-dask-cluster" 是被禁止的错误可能是由于用户权限问题造成的。

报错信息中显示 "User 'system:serviceaccount:jhub:hub' cannot delete resource 'daskclusters' in API group 'kubernetes.dask.org' in the namespace 'jhub'",这表明在命名空间 "jhub" 中的 "system:serviceaccount:jhub:hub" 用户没有权限删除 "daskclusters" 资源。

您可以尝试通过以下方式解决该问题:

确认当前用户是否拥有足够的权限来操作 "daskclusters" 资源。您可以联系 Kubernetes 集群管理员,请求为当前用户授予相关权限。
检查用户是否正确设置了访问权限以及所在的命名空间。

ai回复:
要为命名空间 "jhub" 中的 "system:serviceaccount:jhub:hub" 用户授予权限来删除 "daskclusters" 资源,您可以创建一个 RBAC 授权规则并将其绑定到该服务账号上。以下是一个示例 YAML 配置文件,用于创建 RBAC 规则并将其绑定到指定的服务账号上:

apiVersion: rbac.authorization.k8s.io/v1
kind: Role
metadata:
  namespace: jhub
  name: daskclusters-operator
rules:
- apiGroups: ["kubernetes.dask.org"]
  resources: ["daskclusters"]
  verbs: ["get", "list", "watch", "create", "update", "patch", "delete"]
- apiGroups: [""]
  resources: ["pods", "deployments", "services"]
  verbs: ["create", "get", "list", "update", "delete"]
---
apiVersion: rbac.authorization.k8s.io/v1
kind: RoleBinding
metadata:
  name: daskclusters-operator-binding
  namespace: jhub
roleRef:
  apiGroup: rbac.authorization.k8s.io
  kind: Role
  name: daskclusters-operator
subjects:
- kind: ServiceAccount
  name: hub
  namespace: jhub

保存这些配置到一个文件中,例如 daskclusters-rbac.yaml,然后使用 kubectl apply 命令应用这些配置。

kubectl apply -f daskclusters-rbac.yaml

这样做会授予指定服务账号在指定命名空间中对 "daskclusters" 资源的 get、list、watch、create、update、patch 和 delete 等操作权限。

执行:

[root@k8s-master01 helm]# vim daskclusters-rbac.yaml
[root@k8s-master01 helm]# kubectl apply -f daskclusters-rbac.yaml
role.rbac.authorization.k8s.io/daskclusters-operator created
rolebinding.rbac.authorization.k8s.io/daskclusters-operator-binding created

然后再次执行创建集群的命令:

> cd /home/jovyan/work
jovyan@hub-6768db5ddd-6fxhr:~/work$ /usr/bin/python3.8 dask_cluster.py
╭─────────────────── Creating KubeCluster 'my-dask-cluster' ───────────────────╮
│                                                                              │
│   DaskCluster                                                            -   │
│   Scheduler Pod                                                          -   │
│   Scheduler Service                                                      -   │
│   Default Worker Group                                                   -   │
│                                                                              │
│ ⠴ Waiting for controller to action cluster                                   │
╰──────────────────────────────────────────────────────────────────────────────╯
Traceback (most recent call last):
  File "dask_cluster.py", line 2, in <module>
    cluster = KubeCluster(name="my-dask-cluster", image='ghcr.io/dask/dask:latest')
  File "/home/jovyan/.local/lib/python3.8/site-packages/dask_kubernetes/operator/kubecluster/kubecluster.py", line 244, in __init__
    self.sync(self._start)
  File "/home/jovyan/.local/lib/python3.8/site-packages/distributed/utils.py", line 351, in sync
    return sync(
  File "/home/jovyan/.local/lib/python3.8/site-packages/distributed/utils.py", line 418, in sync
    raise exc.with_traceback(tb)
  File "/home/jovyan/.local/lib/python3.8/site-packages/distributed/utils.py", line 391, in f
    result = yield future
  File "/usr/local/lib/python3.8/dist-packages/tornado/gen.py", line 762, in run
    value = future.result()
  File "/home/jovyan/.local/lib/python3.8/site-packages/dask_kubernetes/operator/kubecluster/kubecluster.py", line 280, in _start
    await self._create_cluster()
  File "/home/jovyan/.local/lib/python3.8/site-packages/dask_kubernetes/operator/kubecluster/kubecluster.py", line 340, in _create_cluster
    await wait_for_scheduler(
  File "/home/jovyan/.local/lib/python3.8/site-packages/dask_kubernetes/common/networking.py", line 199, in wait_for_scheduler
    pod = await Pod.objects(api, namespace=namespace).get_by_name(
  File "/home/jovyan/.local/lib/python3.8/site-packages/dask_kubernetes/aiopykube/query.py", line 19, in get_by_name
    return await self._sync(super().get_by_name, name=name)
  File "/home/jovyan/.local/lib/python3.8/site-packages/dask_kubernetes/aiopykube/mixins.py", line 9, in _sync
    return await asyncio.get_event_loop().run_in_executor(
  File "/usr/lib/python3.8/concurrent/futures/thread.py", line 57, in run
    result = self.fn(*self.args, **self.kwargs)
  File "/home/jovyan/.local/lib/python3.8/site-packages/pykube/query.py", line 117, in get_by_name
    raise ObjectDoesNotExist(f"{name} does not exist.")
pykube.exceptions.ObjectDoesNotExist: my-dask-cluster-scheduler does not exist.

查看状态:

[root@k8s-master01 helm]# kubectl describe rolebinding daskclusters-operator-binding -n jhub
Name:         daskclusters-operator-binding
Labels:       <none>
Annotations:  <none>
Role:
  Kind:  Role
  Name:  daskclusters-operator
Subjects:
  Kind            Name  Namespace
  ----            ----  ---------
  ServiceAccount  hub   jhub

官网的权限:
daskclusters-rbac.yaml

kind: ClusterRole
apiVersion: rbac.authorization.k8s.io/v1
metadata:
  namespace: jhub
  name: dask-cluster-role
rules:
  # Application: watching & handling for the custom resource we declare.
  - apiGroups: [kubernetes.dask.org]
    resources: [daskclusters, daskworkergroups, daskworkergroups/scale, daskjobs, daskautoscalers]
    verbs: [get, list, watch, patch, create, delete]

  # Application: other resources it needs to watch and get information from.
  - apiGroups:
    - ""  # indicates the core API group
    resources: [pods, pods/status]
    verbs:
    - "get"
    - "list"
    - "watch"

  - apiGroups:
    - ""  # indicates the core API group
    resources: [services]
    verbs:
    - "get"
    - "list"
    - "watch"
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRoleBinding
metadata:
  namespace: jhub
  name: dask-cluster-role-binding
roleRef:
  apiGroup: rbac.authorization.k8s.io
  kind: ClusterRole
  name: dask-cluster-role
subjects:
  - kind: ServiceAccount
    name: hub  # adjust name based on the service account you created
        namespace: jhub

https://kubernetes.dask.org/en/latest/operator_kubecluster.html

使用 Custom Resources 创建dask集群

Dask on k8s Custom Resources

cluster.yaml

# cluster.yaml
apiVersion: kubernetes.dask.org/v1
kind: DaskCluster
metadata:
  name: simple
spec:
  worker:
    replicas: 2
    spec:
      containers:
      - name: worker
        image: "ghcr.io/dask/dask:latest"
        imagePullPolicy: "IfNotPresent"
        args:
          - dask-worker
          - --name
          - $(DASK_WORKER_NAME)
          - --dashboard
          - --dashboard-address
          - "8788"
        ports:
          - name: http-dashboard
            containerPort: 8788
            protocol: TCP
  scheduler:
    spec:
      containers:
      - name: scheduler
        image: "ghcr.io/dask/dask:latest"
        imagePullPolicy: "IfNotPresent"
        args:
          - dask-scheduler
        ports:
          - name: tcp-comm
            containerPort: 8786
            protocol: TCP
          - name: http-dashboard
            containerPort: 8787
            protocol: TCP
        readinessProbe:
          httpGet:
            port: http-dashboard
            path: /health
          initialDelaySeconds: 5
          periodSeconds: 10
        livenessProbe:
          httpGet:
            port: http-dashboard
            path: /health
          initialDelaySeconds: 15
          periodSeconds: 20
    service:
      type: NodePort
      selector:
        dask.org/cluster-name: simple
        dask.org/component: scheduler
      ports:
      - name: tcp-comm
        protocol: TCP
        port: 8786
        targetPort: "tcp-comm"
      - name: http-dashboard
        protocol: TCP
        port: 8787
        targetPort: "http-dashboard"

创建:

kubectl apply -f cluster.yaml

查看部署的服务:

[root@k8s-master01 helm]# kubectl get pods 
NAME                                                READY   STATUS      RESTARTS   AGE
simple-default-worker-441dfc8e17-7df786b74c-fpqkq   1/1     Running     0          55m
simple-default-worker-8359eff942-766d557f7f-c7vqb   1/1     Running     0          55m
simple-scheduler-b4f46d497-8bmqw                    1/1     Running     0          55m
[root@k8s-master01 helm]# 

[root@k8s-master01 helm]# kubectl get svc
NAME               TYPE        CLUSTER-IP      EXTERNAL-IP   PORT(S)                         AGE
kubernetes         ClusterIP   10.233.0.1      <none>        443/TCP                         5d11h
simple-scheduler   NodePort    10.233.59.125   <none>        8786:32389/TCP,8787:31256/TCP   56m

内部域名:

simple-scheduler.jhub.cluster.local
10.233.57.184:8786

from dask.distributed import Client
client = Client("10.233.59.125:8786")

进入worker 测试:

(base) root@simple-default-worker-441dfc8e17-7df786b74c-fpqkq:/# ps -ef 
UID        PID  PPID  C STIME TTY          TIME CMD
root         1     0  0 02:24 ?        00:00:00 tini -g -- /usr/bin/prepare.sh dask-worker --name simple-default-worker-441dfc8e17 --dashboard --dashboard-address 8788
root         7     1  2 02:24 ?        00:01:00 /opt/conda/bin/python3.10 /opt/conda/bin/dask-worker --name simple-default-worker-441dfc8e17 --dashboard --dashboard-addres
root        11     7  0 02:24 ?        00:00:00 /opt/conda/bin/python3.10 -c from multiprocessing.resource_tracker import main;main(11)
root        14     7  3 02:24 ?        00:01:46 /opt/conda/bin/python3.10 -c from multiprocessing.spawn import spawn_main; spawn_main(tracker_fd=12, pipe_handle=18) --mult
root        23     0  0 03:13 pts/0    00:00:00 /bin/bash
root        41    23  0 03:13 pts/0    00:00:00 ps -ef
(base) root@simple-default-worker-441dfc8e17-7df786b74c-fpqkq:/# python 
Python 3.10.12 | packaged by conda-forge | (main, Jun 23 2023, 22:40:32) [GCC 12.3.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> from dask.distributed import Client
>>> client = Client("10.233.59.125:8786")
>>> print(client)
<Client: 'tcp://10.233.123.40:8786' processes=2 threads=4, memory=5.48 GiB>
>>> 

由于Jupyter hub 中有网络策略限制,所以,在Jupyterhub pod 中无法访问dask 集群服务,所以删掉:

kubectl delete networkpolicy --all -n jhub

dask-worker python版本:

(base) root@simple-default-worker-23c530bbbd-6b4c8fd7bf-zkpgb:/# pip list | grep dask
dask                    2024.5.1
dask-expr               1.1.1
(base) root@simple-default-worker-23c530bbbd-6b4c8fd7bf-zkpgb:/# python version
python: can't open file '//version': [Errno 2] No such file or directory
(base) root@simple-default-worker-23c530bbbd-6b4c8fd7bf-zkpgb:/# python --version
Python 3.10.12
(base) root@simple-default-worker-23c530bbbd-6b4c8fd7bf-zkpgb:/# python
Python 3.10.12 | packaged by conda-forge | (main, Jun 23 2023, 22:40:32) [GCC 12.3.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> from dask.distributed import Client
>>> 

python 版本为 Python 3.10.12,目前不支持 python 3.11 版本;

helm jupyter安装

官网| helm 方式安装 jupyterhub

helm repo add jupyterhub https://hub.jupyter.org/helm-chart/
helm repo update

# 创建jhub命名空间
kubectl create ns jhub

    helm upgrade --cleanup-on-fail \
  --install jupyterhubtest jupyterhub/jupyterhub \
  --namespace jhub \
  --version=3.2.0 \
  --values config.yaml

config.yaml文件:

hub:
  config:
    Authenticator:
      admin_users:                          # jupyterhub 的管理员用户
      - admin
      allowed_users:                        # 不配置的话,允许用户注册
      - kaiyi
    DummyAuthenticator:
      password: hub123                      #配置通用密码(基本没什么安全性)
    JupyterHub:
      admin_access: true                      #配置是否允许管理者账户存在
      authenticator_class: dummy              #指定所有账户授权类型(默认是傻瓜式)

admin管理员权限配置:
https://z2jh.jupyter.org/en/stable/administrator/authentication.html

查看helm 更新:

    helm upgrade --cleanup-on-fail \
  --install jupyterhubtest jupyterhub/jupyterhub \
  --namespace jhub \
  --version=3.2.0 \
  --values config.yaml

查看资源:

[root@k8s-master01 jupyterhub]# kubectl get pods -n jhub

NAME                             READY   STATUS    RESTARTS   AGE
continuous-image-puller-7s4h6    1/1     Running   0          5m
continuous-image-puller-b89hn    1/1     Running   0          5m
hub-db8677cdc-mplwq              1/1     Running   0          5m
proxy-5bb68bd87-dw665            1/1     Running   0          5m
user-scheduler-786796cbd-4rlsz   1/1     Running   0          5m
user-scheduler-786796cbd-zbr6d   1/1     Running   0          5m

[root@k8s-master01 jupyterhub]# kubectl get svc -n jhub
NAME           TYPE           CLUSTER-IP      EXTERNAL-IP   PORT(S)        AGE
hub            ClusterIP      10.233.33.86    <none>        8081/TCP       5m18s
proxy-api      ClusterIP      10.233.7.91     <none>        8001/TCP       5m18s
proxy-public   LoadBalancer   10.233.36.149   <pending>     80:30241/TCP   5m18s

[root@k8s-master01 jupyterhub]# kubectl edit svc proxy-public -n jhub
service/proxy-public edited

[root@k8s-master01 jupyterhub]# kubectl get svc -n jhub
NAME           TYPE        CLUSTER-IP      EXTERNAL-IP   PORT(S)        AGE
hub            ClusterIP   10.233.33.86    <none>        8081/TCP       8m41s
proxy-api      ClusterIP   10.233.7.91     <none>        8001/TCP       8m41s
proxy-public   NodePort    10.233.36.149   <none>        80:30241/TCP   8m41s
[root@k8s-master01 jupyterhub]# 

登录jupyter:
http://11.0.1.10:30241/hub/login?next=%2Fhub%2F

file

jupyterhub的默认用户:bi:bi
管理员:admin/hub123

file

管理员用户会起一个pod:
file

[root@k8s-master01 jupyterhub]# kubectl get pods -A
NAMESPACE                      NAME                                                  READY   STATUS             RESTARTS          AGE
dask-operator                  dask-kubernetes-operator-1718350109-9c976d4f6-rkxqs   0/1     CrashLoopBackOff   103 (4m37s ago)   5d19h
jhub                           continuous-image-puller-bkbhz                         1/1     Running            0                 3d11h
jhub                           continuous-image-puller-qvtzf                         1/1     Running            0                 3d11h
jhub                           hub-7d47c446bb-4xnzw                                  1/1     Running            2 (15m ago)       28m
jhub                           jupyter-admin                                         0/1     Pending            0                 106s
jhub                           jupyter-bi                                            1/1     Running            0                 10m
jhub                           proxy-6d56fcc89c-h48v8                                1/1     Running            0                 28m
jhub                           simple-default-worker-23c530bbbd-6b4c8fd7bf-zkpgb     1/1     Running            0                 9h
jhub                           simple-default-worker-4aa3828538-54c4488c5-6fbmv      1/1     Running            0                 10h
jhub                           simple-scheduler-5df9969c5b-sqpb5                     1/1     Running            0                 9h
jhub                           user-scheduler-786796cbd-cjpbv                        1/1     Running            91 (4m23s ago)    3d11h
jhub                           user-scheduler-786796cbd-k2cdl                        1/1     Running            89 (5m44s ago)    10h

相关文章:
Dask on k8s
Jupyter Helm Chart 版本
官网| helm 方式安装 jupyterhub
K8S+Helm 安装 Jupyterhub
Dask on k8s Custom Resources
Python dask 使用 k8s 做分布式计算
kubeflow-with-the-dask-kubernetes-operator
python | dask-kubernetes 版本
K8S 部署 JupyterHub 集群 | 配置管理员
Digtime | K8S 离线部署 Dask 分布式集群及 JupyterHub

为者常成,行者常至