工具系列:PyCaret 介绍_基于 Dask 搭建分布式计算集群
一、目的
1、单机容量限制
一般单个机器的内存在64G,由于建模需要,可能数据量级别可能再100G以上超过单机限制,所以需要分布式集群去处理大规模数据。
2、利用已有计算资源
为了提升计算效率,充分利用已有计算资源,可以调用多个服务器多核去处理大规模数据。
二、架构
Dask是基于资源管理器下游的应用,可以把虚拟机的资源整合成分布式集群,Pycaret通过dask做机器学习计算。
Dask介绍
Dask本质上由两部分构成:动态计算调度、集群管理,高级Dataframe api模块;类似于spark与pandas。Dask内部实现了分布式调度,无需用户自行编写复杂的调度逻辑和程序,通过简单的方法实现了分布式计算,支持部分模型并行处理(例如分部署算法:xgboost、LR、sklearn等)。Dask 专注于数据科学领域,与Pandas非常接近,但并不完全兼容。
集群搭建:
在Dask集群中,存在多种角色:client,scheduler, worker
- 1、client: 用于客户client与集群之间的交互
- 2、scheduler:主节点(集群的注册中心)管理点,负责client提交的任务管理,以不同策略分发不同worker节点
- 3、worker:工作节点,受scheduler管理,负责数据计算
1. 主节点(scheduler):
- scheduler:默认端口8786
a. 依赖包:dask、distributed
b. 安装:pip install dask distributed
c. 启动:dask-scheduler
2. 工作节点(worker):
a. 依赖包:dask、distributed
b. 安装:pip install dask distributed
c. 启动:以192.168.1.22 为例,192.168.1.23雷同
> dask-worker 192.168.1.21:8786
三、参考内容
1、Dask介绍
https://www.dask.org
2、Pycaret介绍
http://www.pycaret.org/
四、部署安装
1、安装Dask
pip install "dask[complete]"
2、配置Dask集群
(1)主服务器配置
#命令行执行
dask-scheduler
root@notebook-rn-20231114171325952s4sc-2qr37-0:~/.local/bin# ./dask-scheduler
/root/.local/lib/python3.8/site-packages/distributed/cli/dask_scheduler.py:140: FutureWarning: dask-scheduler is deprecated and will be removed in a future release; use `dask scheduler` instead
warnings.warn(
2023-11-15 05:07:23,025 - distributed.scheduler - INFO - -----------------------------------------------
2023-11-15 05:07:23,660 - distributed.http.proxy - INFO - To route to workers diagnostics web server please install jupyter-server-proxy: python -m pip install jupyter-server-proxy
2023-11-15 05:07:23,735 - distributed.scheduler - INFO - State start
2023-11-15 05:07:23,741 - distributed.scheduler - INFO - -----------------------------------------------
2023-11-15 05:07:23,743 - distributed.scheduler - INFO - Scheduler at: tcp://172.20.12.148:8786
2023-11-15 05:07:23,743 - distributed.scheduler - INFO - dashboard at: http://172.20.12.148:8787/status
(2)从服务器配置
# 命令行执行
dask-worker 172.20.12.148:8786
/root/.local/lib/python3.8/site-packages/distributed/cli/dask_worker.py:264: FutureWarning: dask-worker is deprecated and will be removed in a future release; use `dask worker` instead
warnings.warn(
2023-11-15 05:07:27,093 - distributed.nanny - INFO - Start Nanny at: 'tcp://172.20.12.178:34413'
2023-11-15 05:07:28,847 - distributed.worker - INFO - Start worker at: tcp://172.20.12.178:41461
2023-11-15 05:07:28,847 - distributed.worker - INFO - Listening to: tcp://172.20.12.178:41461
2023-11-15 05:07:28,847 - distributed.worker - INFO - dashboard at: 172.20.12.178:37669
2023-11-15 05:07:28,848 - distributed.worker - INFO - Waiting to connect to: tcp://172.20.12.148:8786
2023-11-15 05:07:28,848 - distributed.worker - INFO - -------------------------------------------------
2023-11-15 05:07:28,848 - distributed.worker - INFO - Threads: 2
2023-11-15 05:07:28,848 - distributed.worker - INFO - Memory: 8.00 GiB
2023-11-15 05:07:28,848 - distributed.worker - INFO - Local Directory: /tmp/dask-scratch-space/worker-w6vw3yg0
2023-11-15 05:07:28,848 - distributed.worker - INFO - -------------------------------------------------
2023-11-15 05:07:30,199 - distributed.worker - INFO - Registered to: tcp://172.20.12.148:8786
2023-11-15 05:07:30,200 - distributed.worker - INFO - -------------------------------------------------
2023-11-15 05:07:30,203 - distributed.core - INFO - Starting established connection to tcp://172.20.12.148:8786
3、安装Pycaret
pip install pycaret[full]
五、功能测试
1、测试Dask
from dask.distributed import Client
client = Client('172.20.12.178:8786')
def square(x):
return x ** 2
def neg(x):
return -x
A = client.map(square, range(10))
B = client.map(neg, A)
total = client.submit(sum, B)
total.result()
-285
2、测试Pycaret集群计算
import pandas as pd
df=pd.read_csv('train.csv')
# init setup
from pycaret.classification import *
clf1 = setup(data = df, target = 'Survived',n_jobs = -1)
# import parallel back-end
from pycaret.parallel import FugueBackend
compare_models(n_select=3, parallel=FugueBackend("dask"),verbose=True)
附加 Dask分布式机器学习框架
1、dask部署及运行
# 1. 多台Linux虚拟机
# 2. 配置主机和Linux虚拟机环境:Anaconda3
# 3. 创建conda虚拟环境(全部机)
conda create -n dask python=3.9 -y
# 4. 进入虚拟环境
conda activate dask
pip install sklearn
pip install dask[complete]
pip install dask_ml[complete]
# 5. 启动
## 主机,dask环境cmd:
dask-scheduler
info: tcp://192.168.31.79:8786
## Linux集群,dask环境cmd:
dask-worker 192.168.31.79:8786
实战
相关文章:
PyCaret介绍_基于Dask搭建分布式计算集群
Dask快速搭建分布式集群
PySpark & Dask 分布式集群环境搭建(Linux)
dask集群搭建简介
Dask on k8s
Ray - scaling AI and Python applications
为者常成,行者常至
自由转载-非商用-非衍生-保持署名(创意共享3.0许可证)