環(huán)境
Linux
GPU Tesla K80
步驟
0. DeepBench下載
從官網(wǎng) https://github.com/baidu-research/DeepBench下載DeepBench包
git方式:
git clone https://github.com/baidu-research/DeepBench
1. 編譯
-
環(huán)境配置
NVIDIA benchmarks需要CUDA cuDNN MPI nccl
前三個(gè)可以直接由module導(dǎo)入,這里使用的是CUDA8.0 cuDNN5.1 openmpi1.10.2,nccl使用自己安裝好的路徑
后面出現(xiàn)的問題多半是這幾個(gè)庫(kù)的版本問題
export MODULEPATH=/BIGDATA/app/modulefiles_GPU/:/BIGDATA/app/modulefiles
module load CUDA/8.0
module load cudnn/5.1-CUDA8.0
module load openmpi/1.10.2-gcc4.9.2
export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/HOME/user_name/nccl/path/lib
從DeepBench目錄下進(jìn)入NVIDIA目錄
cd code/nvidia
-
build
使用官網(wǎng)給出的build方法,build似乎可以不用yhrun,make后要加上ARCH配置
yhrun -n 1 make CUDA_PATH=/BIGDATA/app/CUDA/8.0 CUDNN_PATH=/BIGDATA/app/cuDNN/5.1-CUDA8.0 MPI_PATH=/BIGDATA/app/openmpi/1.10.2-gcc4.9.2 NCCL_PATH=/HOME/user_name/nccl ARCH=sm_30,sm_32,sm_35,sm_50,sm_52,sm_60,sm_61,sm_62,sm_70
或者修改Makefile
也可以分開build,比如conv
make conv
#具體:
yhrun -n 1 make CUDA_PATH=/BIGDATA/app/CUDA/8.0 CUDNN_PATH=/BIGDATA/app/cuDNN/5.1-CUDA8.0 MPI_PATH=/BIGDATA/app/openmpi/1.10.2-gcc4.9.2 NCCL_PATH=/HOME/user_name/nccl ARCH=sm_30,sm_32,sm_35,sm_50,sm_52,sm_60,sm_61,sm_62 conv
build 成功
mkdir -p bin
/BIGDATA/app/CUDA/8.0/bin/nvcc conv_bench.cu -DPAD_KERNELS=1 -o bin/conv_bench -I ../kernels/ -I /BIGDATA/app/CUDA/8.0/include -I /BIGDATA/app/cuDNN/5.1-CUDA8.0/include/ -L /BIGDATA/app/cuDNN/5.1-CUDA8.0/lib64/ -L /BIGDATA/app/CUDA/8.0/lib64 -lcurand -lcudnn --generate-code arch=compute_30,code=sm_30 --generate-code arch=compute_32,code=sm_32 --generate-code arch=compute_35,code=sm_35 --generate-code arch=compute_50,code=sm_50 --generate-code arch=compute_52,code=sm_52 --generate-code arch=compute_60,code=sm_60 --generate-code arch=compute_61,code=sm_61 --generate-code arch=compute_62,code=sm_62 -std=c++11
運(yùn)行前設(shè)置好LD_LIBRARY
export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/BIGDATA/app/CUDA/8.0:/BIGDATA/app/cuDNN/5.1-CUDA8.0:/BIGDATA/app/PGIcompiler/17.1/linux86-64/2017/mpi/openmpi-1.10.2:/HOME/user_name/nccl
2. 運(yùn)行測(cè)試
-
gemm benchmark
nvidia目錄下
yhrun -n 1 ./bin/gemm_bench
CUDA8.0 cudnn5.1 配置下運(yùn)行會(huì)報(bào)錯(cuò),由于CUDA是天河配置好的,我不會(huì)改
terminate called after throwing an instance of 'std::runtime_error'
what(): sgemm failed
1760 16 1760 0 0
halfyhrun: error: gn26: task 0: Aborted (core dumped)
CUDA7.0 cudnn4.0 配置可以正常運(yùn)行
一部分結(jié)果
### CUDA7.0 cudnn4.0 openmpi1.10.2 nccl1 ###
Running training benchmark
Times
----------------------------------------------------------------------------------------
m n k a_t b_t precision time (usec)
1760 16 1760 0 0 float 340 .
...
略
-
conv benchmark
nvidia目錄下
yhrun -n 1 ./bin/conv_bench
CUDA8.0 cudnn6.0 可編譯但無法運(yùn)行
CUDA7.0 cudnn4.0 無法編譯,會(huì)提示缺很多東西,可能是版本過老
CUDA8.0 cudnn5.1 配置運(yùn)行中途會(huì)報(bào)錯(cuò):運(yùn)行到第11個(gè)算例時(shí)出現(xiàn)runtime_error導(dǎo)致運(yùn)行中止
Illegal algorithm passed to get_fwd_algo_string. Algo: 7
把conv_bench.cu文件中的std::string get_fwd_algo_string()函數(shù)中最后一部分的
else {
std::stringstream ss;
ss << "Illegal algorithm passed to get_fwd_algo_string. Algo: " << fwd_algo_ << std::endl;
throw std::runtime_error(ss.str());
}
改成
else {
return "#unknown"
}
重新編譯后再運(yùn)行,即可越過有問題的段落,第11個(gè)顯示的是unknown,后面還有好多unknown
### CUDA8.0 cudnn5.1 openmpi1.10.2 nccl1 ###
Running training benchmark
Times
----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
w h c n k f_w f_h pad_w pad_h stride_w stride_h precision fwd_time (usec) bwd_inputs_time (usec) bwd_params_time (usec) total_time (usec) fwd_algo
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
700 161 1 4 32 20 5 0 0 2 2 float 929 1136 1074 3139 IMPLICIT_GEMM
700 161 1 8 32 20 5 0 0 2 2 float 1587 2168 1928 5683 IMPLICIT_GEMM
700 161 1 16 32 20 5 0 0 2 2 float 2813 4337 3508 10658 IMPLICIT_PRECOMP_GEMM
700 161 1 32 32 20 5 0 0 2 2 float 6368 8659 6899 21926 IMPLICIT_GEMM
341 79 32 4 32 10 5 0 0 2 2 float 2174 4076 2506 8756 IMPLICIT_PRECOMP_GEMM
341 79 32 8 32 10 5 0 0 2 2 float 4211 8128 5007 17346 IMPLICIT_PRECOMP_GEMM
341 79 32 16 32 10 5 0 0 2 2 float 8459 16200 9985 34644 IMPLICIT_PRECOMP_GEMM
341 79 32 32 32 10 5 0 0 2 2 float 16903 32380 20188 69471 IMPLICIT_PRECOMP_GEMM
480 48 1 16 16 3 3 1 1 1 1 float 752 1014 1515 3281 IMPLICIT_GEMM
240 24 16 16 32 3 3 1 1 1 1 float 863 1332 1258 3453 IMPLICIT_GEMM
120 12 32 16 64 3 3 1 1 1 1 float 613 652 1005 2270 #unknown
...
略
-
rnn benchmark
nvidia目錄下
yhrun -n 1 ./bin/rnn_bench
CUDA8.0 cudnn5.1 配置下可正常運(yùn)行
### CUDA8.0 cudnn5.1 openmpi1.10.2 nccl1 ###
Running training benchmark
Times
----------------------------------------------------------------------------------------
type hidden N timesteps precision fwd_time (usec) bwd_time (usec)
vanilla 1760 16 50 float 19590 17450
vanilla 1760 32 50 float 18289 18044
...
lstm 512 16 25 float 3888 5551
lstm 512 32 25 float 3922 5603
...
gru 2816 32 1500 float 2638524 2475404
gru 2816 32 750 float 1319982 1240556
...
略
-
all reduce benchmark
nccl_single_all_reduce
nvidia目錄下
yhrun -n 1 ./bin/nccl_single_all_reduce 2
可以正常運(yùn)行
NCCL AllReduce
Num Ranks: 2
---------------------------------------------------------------------------
# of floats bytes transferred Time (msec)
---------------------------------------------------------------------------
100000 400000 0.109
3097600 12390400 1.344
...
略
nccl_mpi_all_reduce
nvidia目錄下
yhrun -n 2 -N 2 mpirun -np 2 ./bin/nccl_mpi_all_reduce
可以運(yùn)行但無結(jié)果,我在那個(gè)目錄下有報(bào)錯(cuò)提示缺失的文件,不知為什么會(huì)這樣報(bào)錯(cuò)
mca: base: component_find: unable to open /BIGDATA/app/openmpi/1.10.2-gcc4.9.2/lib/openmpi/mca_btl_scif: libscif.so.0: cannot open shared object file: No such file or directory (ignored)
3. 使用yhbatch測(cè)試
由于測(cè)試時(shí)間長(zhǎng),VPN總掉線,可以使用yhbatch來運(yùn)行
創(chuàng)建一個(gè)test.sh,文件test.sh內(nèi)容如下:
#! /bin/bash
yhrun -n xx xxx_bench (yhrun語(yǔ)句)
再使用yhbatch命令
yhbatch -n 1 ./test.sh
這樣即可將任務(wù)提交上去
任務(wù)完成后會(huì)有一個(gè)slurm_jobid.out文件,原本輸出到控制臺(tái)的語(yǔ)句都可以在這里找到