mxnet分布式1

mxnet分布式1

可能的阻塞原因

啟動(dòng)分布式的時(shí)候一開(kāi)始經(jīng)常程序阻塞住,自以為一切都按照官方的操作了,從表面現(xiàn)象看發(fā)射機(jī)在啟動(dòng)了launcher.py進(jìn)程后,shell停住,這個(gè)是最讓人頭疼的,這種情況首先確保:

  1. 每一臺(tái)機(jī)器上的環(huán)境一樣,包括代碼路徑,python環(huán)境
  2. 要啟動(dòng)的進(jìn)程是否已經(jīng)存在,如果已經(jīng)存在,先殺死它們
  3. 防火墻是否已經(jīng)關(guān)閉
  4. 兩臺(tái)機(jī)器是否能免密ssh了

啟動(dòng)方式

  1. 通過(guò)官方提供的launcher.py啟動(dòng)

    參考:https://github.com/apache/incubator-mxnet/tree/master/example/image-classification

  2. 為了看明白其中的過(guò)程,看一種別的啟動(dòng)方式

首先啟動(dòng)scheduler,scheduler進(jìn)程會(huì)阻塞等待,再啟動(dòng)兩個(gè)server,每個(gè)server都指定了PS的IP地址,最后啟動(dòng)兩個(gè)worker,整個(gè)分布式程序開(kāi)始啟動(dòng)運(yùn)行,worker的shell動(dòng)起來(lái)

針對(duì)mnist

sch---export DMLC_PS_ROOT_URI=x.x.x.x; export DMLC_ROLE=scheduler; export DMLC_PS_ROOT_PORT=9001; export DMLC_NUM_WORKER=2; export DMLC_NUM_SERVER=2; 
cd /path/to;
python train_mnist.py --kv-store dist_sync

ps1---export DMLC_SERVER_ID=0; export DMLC_PS_ROOT_URI=x.x.x.x; export DMLC_ROLE=server; export DMLC_PS_ROOT_PORT=9001; export DMLC_NUM_WORKER=2; export DMLC_NUM_SERVER=2; 
cd /path/to;
python train_mnist.py --kv-store dist_sync

ps2---export DMLC_SERVER_ID=1; export DMLC_PS_ROOT_URI=x.x.x.x; export DMLC_ROLE=server; export DMLC_PS_ROOT_PORT=9001; export DMLC_NUM_WORKER=2; export DMLC_NUM_SERVER=2 
cd /path/to; 
python train_mnist.py --kv-store dist_sync

wk1---export DMLC_WORKER_ID=0; export DMLC_PS_ROOT_URI=x.x.x.x; export DMLC_ROLE=worker; export DMLC_PS_ROOT_PORT=9001; export DMLC_NUM_WORKER=2; export DMLC_NUM_SERVER=2 
cd /path/to;
python train_mnist.py --kv-store dist_sync

wk2---export DMLC_WORKER_ID=2; export DMLC_PS_ROOT_URI=x.x.x.x; export DMLC_ROLE=worker; export DMLC_PS_ROOT_PORT=9001; export DMLC_NUM_WORKER=2; export DMLC_NUM_SERVER=2 
cd /path/to;
python train_mnist.py --kv-store dist_sync

啟動(dòng)過(guò)程分析

在x.x.x.x/x兩臺(tái)機(jī)器上做實(shí)驗(yàn)

啟動(dòng)腳本:

python ../../tools/launch.py -n 2 --launcher ssh -H hosts `which python` train_mnist.py --kv-store=dist_sync

啟動(dòng)后兩臺(tái)機(jī)器上的啟動(dòng)的進(jìn)程分析

  • 發(fā)射機(jī)

/home/xxx/anaconda2/envs/ps_lite/bin/python train_mnist.py --kv-store=dist_sync這條命令執(zhí)行了3次,第一次是parameter server啟動(dòng)的scheduler進(jìn)程,由trackerpserver = PSTracker(hostIP=hostIP, cmd=pscmd, envs=envs)代碼啟動(dòng),scheduler進(jìn)程由PSTtacker的構(gòu)造函數(shù)啟動(dòng),另外兩個(gè)是由發(fā)射機(jī)ssh啟動(dòng)的server和worker進(jìn)程,以上所有進(jìn)程啟動(dòng)都是用異步線程啟動(dòng)

ssh -o StrictHostKeyChecking=no x.x.x.x -p 22 export LD_LIBRARY_PATH=.::/usr/local/cuda:/usr/local/cuda/lib64:/usr/local/cuda/extras/CUPTI/lib64:/home/xxx/xxx-workspace/cuda-8.0-cudnn-6.0/lib64; export DMLC_ROLE=server; export DMLC_PS_ROOT_PORT=9091; export DMLC_PS_ROOT_URI=x.x.x.x; export DMLC_NUM_SERVER=2; export DMLC_NUM_WORKER=2; cd /path/to/example/image-classification/; `which python` train_mnist.py --kv-store=dist_sync

這個(gè)進(jìn)程起了四次,分別是向兩個(gè)worker和兩個(gè)server發(fā)送ssh進(jìn)程,IP從hosts文件讀取,PS都是x.x.x.x這臺(tái)機(jī)器

bash -c export LD_LIBRARY_PATH=.::/usr/local/cuda:/usr/local/cuda/lib64:/usr/local/cuda/extras/CUPTI/lib64:/home/xxx/xxx-workspace/cuda-8.0-cudnn-6.0/lib64; export DMLC_ROLE=server; export DMLC_PS_ROOT_PORT=9092; export DMLC_PS_ROOT_URI=x.x.x.x; export DMLC_NUM_SERVER=2; export DMLC_NUM_WORKER=2; cd /path/to/example/image-classification/; `which python` train_mnist.py --kv-store=dist_sync

這個(gè)進(jìn)程起了2次,接收到發(fā)射機(jī)發(fā)送的兩次請(qǐng)求,分別啟動(dòng)server進(jìn)程和worker進(jìn)程

  • worker節(jié)點(diǎn)

/home/xxx/anaconda2/envs/ps_lite/bin/python train_mnist.py --kv-store=dist_sync這條命令執(zhí)行了2次

bash -c export LD_LIBRARY_PATH=.::/usr/local/cuda:/usr/local/cuda/lib64:/usr/local/cuda/extras/CUPTI/lib64:/home/xxx/xxx-workspace/cuda-8.0-cudnn-6.0/lib64; export DMLC_ROLE=worker; export DMLC_PS_ROOT_PORT=9092; export DMLC_PS_ROOT_URI=x.x.x.x; export DMLC_NUM_SERVER=2; export DMLC_NUM_WORKER=2; cd /path/to/example/image-classification/; `which python` train_mnist.py --kv-store=dist_sync

這個(gè)進(jìn)程啟動(dòng)了2次, 接收到發(fā)射機(jī)發(fā)送的兩次請(qǐng)求,分別啟動(dòng)server進(jìn)程和worker進(jìn)程

啟動(dòng)后的參數(shù)

Namespace(archives=[], auto_file_cache=True, cluster='ssh', command=['`which', 'python`', 'train_mnist.py', '--kv-store=dist_sync'], env=[], files=[], hdfs_tempdir='/tmp', host_file='hosts', host_ip=None, jobname=None, kube_namespace='default', kube_server_image='mxnet/python', kube_server_template=None, kube_worker_image='mxnet/python', kube_worker_template=None, log_file=None, log_level='INFO', mesos_master=None, num_servers=2, num_workers=2, queue='default', server_cores=1, server_memory='1g', server_memory_mb=1024, sge_log_dir=None, ship_libcxx=None, slurm_server_nodes=None, slurm_worker_nodes=None, sync_dst_dir='None', worker_cores=1, worker_memory='1g', worker_memory_mb=1024, yarn_app_classpath=None, yarn_app_dir='/path/to/tools/../dmlc-core/tracker/dmlc_tracker/../yarn')

上面一堆參數(shù)中只有num_workers, num_servers,cluseter,host_file,sync_dst_dir,command是從外部給出,其他的參數(shù)從

try:
    from dmlc_tracker import opts
except ImportError:
    print("Can't load dmlc_tracker package.  Perhaps you need to run")
    print("    git submodule update --init --recursive")
    raise
dmlc_opts = opts.get_opts(args)

中最后一行加載進(jìn)來(lái), opt.py中定義了很多參數(shù)parser

ssh.py->submit(args)->tracker.py:submit()->fun_submit->ssh.py:submit():ssh_submit()

hosts對(duì)象包裝了hosts文件的IP地址和對(duì)應(yīng)的端口

在ssh.py的方法ssh_submit()方法中,for語(yǔ)句中依次從hosts文件中拿到IP啟動(dòng)server和worker

代碼分析

只試過(guò)ssh的啟動(dòng)方式,目前只看到python層的代碼,看到啟動(dòng)了各自的進(jìn)程,主要就四個(gè)類(lèi),launcher調(diào)用ssh.py,ssh.py調(diào)用tracker.py依次啟動(dòng)scheduler,server,worker進(jìn)程


image.png
?著作權(quán)歸作者所有,轉(zhuǎn)載或內(nèi)容合作請(qǐng)聯(lián)系作者
【社區(qū)內(nèi)容提示】社區(qū)部分內(nèi)容疑似由AI輔助生成,瀏覽時(shí)請(qǐng)結(jié)合常識(shí)與多方信息審慎甄別。
平臺(tái)聲明:文章內(nèi)容(如有圖片或視頻亦包括在內(nèi))由作者上傳并發(fā)布,文章內(nèi)容僅代表作者本人觀點(diǎn),簡(jiǎn)書(shū)系信息發(fā)布平臺(tái),僅提供信息存儲(chǔ)服務(wù)。

相關(guān)閱讀更多精彩內(nèi)容

友情鏈接更多精彩內(nèi)容