Very hight CPU load when train with Horovod, and slow training speed.
nordysu opened this issue · comments
Description
When run Allreduce with horovod, the performance is very poor. How can I speed things up?
Environment info (Required)
----------Python Info----------
('Version :', '2.7.12')
('Compiler :', 'GCC 5.4.0 20160609')
('Build :', ('default', 'Nov 12 2018 14:36:49'))
('Arch :', ('64bit', ''))
------------Pip Info-----------
('Version :', '18.1')
('Directory :', '/usr/local/lib/python2.7/dist-packages/pip')
----------MXNet Info-----------
('Version :', '1.4.0')
('Directory :', '/mxnet/python/mxnet')
Hashtag not found. Not installed from pre-built package.
----------System Info----------
('Platform :', 'Linux-3.10.0-862.9.1.el7.x86_64-x86_64-with-Ubuntu-16.04-xenial')
('system :', 'Linux')
('node :', 'da7e5697cb17')
('release :', '3.10.0-862.9.1.el7.x86_64')
('version :', '#1 SMP Mon Jul 16 16:29:36 UTC 2018')
----------Hardware Info----------
('machine :', 'x86_64')
('processor :', 'x86_64')
Architecture: x86_64
CPU op-mode(s): 32-bit, 64-bit
Byte Order: Little Endian
CPU(s): 56
On-line CPU(s) list: 0-55
Thread(s) per core: 2
Core(s) per socket: 14
Socket(s): 2
NUMA node(s): 2
Vendor ID: GenuineIntel
CPU family: 6
Model: 79
Model name: Intel(R) Xeon(R) CPU E5-2658 v4 @ 2.30GHz
Stepping: 1
CPU MHz: 2300.000
CPU max MHz: 2800.0000
CPU min MHz: 1200.0000
BogoMIPS: 4605.32
Virtualization: VT-x
Hypervisor vendor: vertical
Virtualization type: full
L1d cache: 32K
L1i cache: 32K
L2 cache: 256K
L3 cache: 35840K
NUMA node0 CPU(s): 0-13,28-41
NUMA node1 CPU(s): 14-27,42-55
Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc aperfmperf eagerfpu pni pclmulqdq dtes64 ds_cpl vmx smx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid dca sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm 3dnowprefetch epb cat_l3 cdp_l3 intel_ppin intel_pt ssbd ibrs ibpb stibp tpr_shadow vnmi flexpriority ept vpid fsgsbase tsc_adjust bmi1 hle avx2 smep bmi2 erms invpcid rtm cqm rdt_a rdseed adx smap xsaveopt cqm_llc cqm_occup_llc cqm_mbm_total cqm_mbm_local dtherm ida arat pln pts spec_ctrl intel_stibp
Package used (Python/R/Scala/Julia):
MXNet: https://github.com/apache/incubator-mxnet on branch master
Horovod: https://github.com/ctcyang/horovod on branch mxnet_feature_fp16
Python: 2.7.12
Build info (Required if built from source)
Compiler (gcc/clang/mingw/visual studio): GCC
MXNet commit hash:
(Paste the output of git rev-parse HEAD
here.)
MXNet: d2102faa228bdc6723a9da299c6ff5999cbbdcdb
Horovod: 10c35d0b54dd033b6e2d97c623d2afcbff445630
Build config:
(Paste the content of config.mk, or the build command.)
USE_DIST_KVSTORE=1
USE_CUDA=1
USE_CUDA_PATH=/usr/local/cuda
USE_CUDNN=1
USE_NCCL=1
USE_S3=1
USE_PROFILER=1
Error Message:
Training with InsightFace project(https://github.com/deepinsight/insightface), I got a very slow training speed as below, and the CPU load is quite high.
2018-12-06 10:08:42,891 Node[5] Epoch[0] Batch [0-20] Speed: 14.93 samples/sec acc=0.000000
2018-12-06 10:08:42,893 Node[1] Epoch[0] Batch [0-20] Speed: 14.87 samples/sec acc=0.000000
2018-12-06 10:08:42,895 Node[6] Epoch[0] Batch [0-20] Speed: 14.92 samples/sec acc=0.000000
2018-12-06 10:08:42,902 Node[3] Epoch[0] Batch [0-20] Speed: 14.92 samples/sec acc=0.000000
2018-12-06 10:08:43,541 Node[4] Epoch[0] Batch [0-20] Speed: 14.91 samples/sec acc=0.000000
2018-12-06 10:08:54,391 Node[0] Epoch[0] Batch [0-20] Speed: 13.90 samples/sec acc=0.000000
2018-12-06 10:08:55,067 Node[7] Epoch[0] Batch [0-20] Speed: 13.79 samples/sec acc=0.000000
2018-12-06 10:08:55,227 Node[2] Epoch[0] Batch [0-20] Speed: 13.80 samples/sec acc=0.000000
2018-12-06 10:10:57,954 Node[1] Epoch[0] Batch [20-40] Speed: 14.81 samples/sec acc=0.000000
2018-12-06 10:10:58,477 Node[6] Epoch[0] Batch [20-40] Speed: 14.75 samples/sec acc=0.000000
2018-12-06 10:11:02,376 Node[4] Epoch[0] Batch [20-40] Speed: 14.41 samples/sec acc=0.000000
2018-12-06 10:11:02,539 Node[3] Epoch[0] Batch [20-40] Speed: 14.32 samples/sec acc=0.000000
2018-12-06 10:11:03,064 Node[5] Epoch[0] Batch [20-40] Speed: 14.27 samples/sec acc=0.000000
2018-12-06 10:11:06,079 Node[0] Epoch[0] Batch [20-40] Speed: 15.19 samples/sec acc=0.000000
2018-12-06 10:11:08,218 Node[7] Epoch[0] Batch [20-40] Speed: 15.02 samples/sec acc=0.000000
2018-12-06 10:11:11,488 Node[2] Epoch[0] Batch [20-40] Speed: 14.68 samples/sec acc=0.000000
2018-12-06 10:13:19,883 Node[4] Epoch[0] Batch [40-60] Speed: 14.54 samples/sec acc=0.000000
2018-12-06 10:13:19,884 Node[1] Epoch[0] Batch [40-60] Speed: 14.09 samples/sec acc=0.000000
2018-12-06 10:13:19,888 Node[5] Epoch[0] Batch [40-60] Speed: 14.62 samples/sec acc=0.000000
2018-12-06 10:13:19,889 Node[3] Epoch[0] Batch [40-60] Speed: 14.56 samples/sec acc=0.000000
2018-12-06 10:13:20,714 Node[6] Epoch[0] Batch [40-60] Speed: 14.06 samples/sec acc=0.000000
2018-12-06 10:13:26,432 Node[2] Epoch[0] Batch [40-60] Speed: 14.82 samples/sec acc=0.000000
2018-12-06 10:13:28,200 Node[0] Epoch[0] Batch [40-60] Speed: 14.07 samples/sec acc=0.000000
2018-12-06 10:13:32,749 Node[7] Epoch[0] Batch [40-60] Speed: 13.84 samples/sec acc=0.000000
2018-12-06 10:15:40,223 Node[5] Epoch[0] Batch [60-80] Speed: 14.25 samples/sec acc=0.000000
2018-12-06 10:15:40,223 Node[4] Epoch[0] Batch [60-80] Speed: 14.25 samples/sec acc=0.000000
2018-12-06 10:15:40,224 Node[1] Epoch[0] Batch [60-80] Speed: 14.25 samples/sec acc=0.000000
2018-12-06 10:15:40,228 Node[3] Epoch[0] Batch [60-80] Speed: 14.25 samples/sec acc=0.000000
2018-12-06 10:15:41,993 Node[0] Epoch[0] Batch [60-80] Speed: 14.95 samples/sec acc=0.000000
2018-12-06 10:15:43,600 Node[6] Epoch[0] Batch [60-80] Speed: 14.00 samples/sec acc=0.000000
2018-12-06 10:15:47,586 Node[2] Epoch[0] Batch [60-80] Speed: 14.17 samples/sec acc=0.000000
2018-12-06 10:15:52,918 Node[7] Epoch[0] Batch [60-80] Speed: 14.27 samples/sec acc=0.000000
2018-12-06 10:18:00,695 Node[1] Epoch[0] Batch [80-100] Speed: 14.24 samples/sec acc=0.000000
2018-12-06 10:18:00,782 Node[0] Epoch[0] Batch [80-100] Speed: 14.41 samples/sec acc=0.000000
2018-12-06 10:18:01,198 Node[3] Epoch[0] Batch [80-100] Speed: 14.19 samples/sec acc=0.000000
2018-12-06 10:18:01,380 Node[5] Epoch[0] Batch [80-100] Speed: 14.17 samples/sec acc=0.000000
2018-12-06 10:18:02,143 Node[4] Epoch[0] Batch [80-100] Speed: 14.09 samples/sec acc=0.000000
2018-12-06 10:18:03,768 Node[6] Epoch[0] Batch [80-100] Speed: 14.27 samples/sec acc=0.000000
2018-12-06 10:18:07,044 Node[2] Epoch[0] Batch [80-100] Speed: 14.34 samples/sec acc=0.000000
2018-12-06 10:18:13,723 Node[7] Epoch[0] Batch [80-100] Speed: 14.20 samples/sec acc=0.000000
docker ps
show 99% CPU load per python process
UID PID PPID C STIME TTY TIME CMD
root 188045 188026 0 10:38 ? 00:00:00 sleep 365d
root 191718 188026 0 10:38 ? 00:00:00 /bin/sh -c PATH=/usr/local/bin:$PATH ; export PATH ; LD_LIBRARY_PATH=/usr/local/lib:$LD_LIBRARY_PATH ; export LD_LIBRARY_PATH ; DYLD_LIBRARY_PATH=/usr/local/lib:$DYLD_LIBRARY_PATH ; export DYLD_LIBRARY_PATH ; /usr/local/bin/orted -mca ess "env" -mca ess_base_jobid "4052680704" -mca ess_base_vpid 2 -mca ess_base_num_procs "3" -mca orte_node_regex "insightface-softmax-launcher-vr[1:4]cq,insightface-softmax-worker-[1:0-1]@0(3)" -mca orte_hnp_uri "4052680704.0;tcp://10.244.0.144,192.168.1.126:46328" --mca btl_tcp_if_include "ps" -mca pml "ob1" -mca btl "^openib" -mca plm "rsh" -mca plm_rsh_agent "/etc/mpi/kubexec.sh" -mca orte_default_hostfile "/etc/mpi/hostfile" -mca hwloc_base_binding_policy "none" -mca rmaps_base_mapping_policy "slot" -mca pmix "^s1,s2,cray,isolated"
root 191734 191718 2 10:38 ? 00:00:00 /usr/local/bin/orted -mca ess env -mca ess_base_jobid 4052680704 -mca ess_base_vpid 2 -mca ess_base_num_procs 3 -mca orte_node_regex insightface-softmax-launcher-vr[1:4]cq,insightface-softmax-worker-[1:0-1]@0(3) -mca orte_hnp_uri 4052680704.0;tcp://10.244.0.144,192.168.1.126:46328 --mca btl_tcp_if_include ps -mca pml ob1 -mca btl ^openib -mca plm rsh -mca plm_rsh_agent /etc/mpi/kubexec.sh -mca orte_default_hostfile /etc/mpi/hostfile -mca hwloc_base_binding_policy none -mca rmaps_base_mapping_policy slot -mca pmix ^s1,s2,cray,isolated
root 191747 191734 99 10:38 ? 00:00:26 python train_softmax.py --network r50 --loss-type 2 --margin-m 0.35 --data-dir /datasets/faces_emore --target --kv-store horovod --per-batch-size 100
root 191749 191734 99 10:38 ? 00:00:26 python train_softmax.py --network r50 --loss-type 2 --margin-m 0.35 --data-dir /datasets/faces_emore --target --kv-store horovod --per-batch-size 100
root 191751 191734 99 10:38 ? 00:00:28 python train_softmax.py --network r50 --loss-type 2 --margin-m 0.35 --data-dir /datasets/faces_emore --target --kv-store horovod --per-batch-size 100
root 191753 191734 99 10:38 ? 00:00:25 python train_softmax.py --network r50 --loss-type 2 --margin-m 0.35 --data-dir /datasets/faces_emore --target --kv-store horovod --per-batch-size 100
nvidia-smi
output:
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 396.26 Driver Version: 396.26 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 Tesla P100-PCIE... Off | 00000000:2D:00.0 Off | 0 |
| N/A 43C P0 39W / 250W | 10499MiB / 16280MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 1 Tesla P100-PCIE... Off | 00000000:31:00.0 Off | 0 |
| N/A 42C P0 37W / 250W | 10499MiB / 16280MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 2 Tesla P100-PCIE... Off | 00000000:35:00.0 Off | 0 |
| N/A 41C P0 36W / 250W | 10499MiB / 16280MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 3 Tesla P100-PCIE... Off | 00000000:39:00.0 Off | 0 |
| N/A 40C P0 36W / 250W | 10499MiB / 16280MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 4 Tesla P100-PCIE... Off | 00000000:A9:00.0 Off | 0 |
| N/A 40C P0 35W / 250W | 10499MiB / 16280MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 5 Tesla P100-PCIE... Off | 00000000:AD:00.0 Off | 0 |
| N/A 39C P0 36W / 250W | 10499MiB / 16280MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 6 Tesla P100-PCIE... Off | 00000000:B1:00.0 Off | 0 |
| N/A 40C P0 37W / 250W | 10499MiB / 16280MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 7 Tesla P100-PCIE... Off | 00000000:B5:00.0 Off | 0 |
| N/A 39C P0 37W / 250W | 10499MiB / 16280MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| 0 191747 C python 10489MiB |
| 1 191749 C python 10489MiB |
| 2 191751 C python 10489MiB |
| 3 191748 C python 10489MiB |
| 4 191750 C python 10489MiB |
| 5 191752 C python 10489MiB |
| 6 191754 C python 10489MiB |
| 7 191753 C python 10489MiB |
+-----------------------------------------------------------------------------+
Minimum reproducible example
I use train_softmax.py and port the code to work with Horovod:
def train_net(args):
ctx = []
if args.kv_store == 'horovod':
import horovod.mxnet as hvd
kv = None
hvd.init()
ctx.append(mx.gpu(hvd.local_rank()))
# logging
head = '%(asctime)-15s Node[' + str(hvd.rank()) + '] %(message)s'
logging.basicConfig(level=logging.DEBUG, format=head)
else:
kv = mx.kvstore.create(args.kv_store)
cvd = os.environ['CUDA_VISIBLE_DEVICES'].strip()
if len(cvd)>0:
for i in xrange(len(cvd.split(','))):
ctx.append(mx.gpu(i))
if len(ctx)==0:
ctx = [mx.cpu()]
print('use cpu')
else:
print('gpu num:', len(ctx))
.....
opt = optimizer.SGD(learning_rate=base_lr, momentum=base_mom, wd=base_wd, rescale_grad=_rescale)
if args.kv_store == 'horovod':
opt = hvd.DistributedOptimizer(opt)
....
# create initializer
model.bind(data_shapes=train_dataiter.provide_data, label_shapes=train_dataiter.provide_label)
model.init_params(initializer, arg_params=arg_params, aux_params=aux_params)
(arg_params, aux_params) = model.get_params()
if args.kv_store == 'horovod':
hvd.broadcast_parameters(arg_params, root_rank=0)
hvd.broadcast_parameters(aux_params, root_rank=0)
model.set_params(arg_params=arg_params, aux_params=aux_params)
model.fit(train_dataiter,
begin_epoch = begin_epoch,
num_epoch = end_epoch,
eval_data = val_dataiter,
eval_metric = eval_metrics,
kvstore = kv,
optimizer = opt,
# optimizer_params = optimizer_params,
# initializer = initializer,
# arg_params = arg_params,
# aux_params = aux_params,
allow_missing = True,
batch_end_callback = _batch_callback,
epoch_end_callback = epoch_cb )
start up command:
mpirun --mca btl_tcp_if_include ps -bind-to none -map-by slot -mca pml ob1 -mca btl ^openib python train_softmax.py --network r50 --loss-type 2 --margin-m 0.35 --data-dir /datasets/faces_emore --target --kv-store horovod --per-batch-size 100