apache / mxnet

Lightweight, Portable, Flexible Distributed/Mobile Deep Learning with Dynamic, Mutation-aware Dataflow Dep Scheduler; for Python, R, Julia, Scala, Go, Javascript and more

Home Page:https://mxnet.apache.org

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Issue when running Distributed Training with Sparse Gradients

BlakeLazarine opened this issue · comments

Description

(A clear and concise description of what the bug is.)
When running distributed training (multi-instance with each instance having a single GPU) with sparse gradients (produced by negative sampling), MXNet crashes. I have implementations that use either Synchronous or Asynchronous Parameter server implementations or use Horovod for distributed training. All of these implementations are able to train on datasets which do not use sparse gradients and the Horovod implementation successfully trains when there is only 1 instance.

The Async case seems to be the most descriptive of the issue, but I am still unable to make use of it.

Error Message

Error message from Async PS implementation

mxnet.base.MXNetError: Traceback (most recent call last):
  [bt] (8) /usr/lib/x86_64-linux-gnu/libc.so.6(clone+0x43) [0x7fa4c8b95133]
  [bt] (7) /usr/lib/x86_64-linux-gnu/libpthread.so.0(+0x8609) [0x7fa4c8a5b609]
  [bt] (6) /usr/lib/x86_64-linux-gnu/libstdc++.so.6(+0xd7172) [0x7fa447ade172]
  [bt] (5) /usr/local/lib/python3.8/dist-packages/mxnet/libmxnet.so(std::thread::_Impl<std::_Bind_simple<std::function<void (std::shared_ptr<dmlc::ManualEvent>)> (std::shared_ptr<dmlc::ManualEvent>)> >::_M_run()+0x3b) [0x7fa4813459fb]
  [bt] (4) /usr/local/lib/python3.8/dist-packages/mxnet/libmxnet.so(std::_Function_handler<void (std::shared_ptr<dmlc::ManualEvent>), mxnet::engine::ThreadedEnginePerDevice::PushToExecute(mxnet::engine::OprBlock*, bool)::{lambda()#1}::operator()() const::{lambda(std::shared_ptr<dmlc::ManualEvent>)#1}>::_M_invoke(std::_Any_data const&, std::shared_ptr<dmlc::ManualEvent>)+0x104) [0x7fa481348624]
  [bt] (3) /usr/local/lib/python3.8/dist-packages/mxnet/libmxnet.so(mxnet::engine::ThreadedEngine::ExecuteOprBlock(mxnet::RunContext, mxnet::engine::OprBlock*)+0x48b) [0x7fa48134694b]
  [bt] (2) /usr/local/lib/python3.8/dist-packages/mxnet/libmxnet.so(mxnet::kvstore::KVStoreDist::PullDefault(int, mxnet::NDArray const&, int)::{lambda(mxnet::RunContext, mxnet::engine::CallbackOnComplete)#1}::operator()(mxnet::RunContext, mxnet::engine::CallbackOnComplete) const+0x5c) [0x7fa48151382c]
  [bt] (1) /usr/local/lib/python3.8/dist-packages/mxnet/libmxnet.so(mxnet::kvstore::KVStoreDist::EncodeDefaultKey(int, unsigned long, int)+0x159) [0x7fa4814e86a9]
  [bt] (0) /usr/local/lib/python3.8/dist-packages/mxnet/libmxnet.so(dmlc::LogMessageFatal::~LogMessageFatal()+0x4f) [0x7fa4812077ef]
  File "../src/kvstore/./kvstore_dist.h", line 627

MXNetError: Check failed: static_cast<size_t>(pskv.size) == pskv_size (172770864 vs. 447000596) : The value size cannot be changed 447000596. Key is 3

Error message from Horovod Implementation

2022-08-08T18:52:00.918-07:00	[1,0]<stderr>:terminate called after throwing an instance of 'std::logic_error'

2022-08-08T18:52:00.918-07:00	[1,0]<stderr>: what(): cudaEventSynchronize failed: an illegal memory access was encountered

2022-08-08T18:52:00.918-07:00	[1,0]<stderr>:[algo-1:00036] *** Process received signal ***

2022-08-08T18:52:00.918-07:00	[1,0]<stderr>:[algo-1:00036] Signal: Aborted (6)

2022-08-08T18:52:00.918-07:00	[1,0]<stderr>:[algo-1:00036] Signal code: (-6)

2022-08-08T18:52:00.918-07:00	[1,0]<stderr>:[algo-1:00036] [ 0] /usr/lib/x86_64-linux-gnu/libc.so.6(+0x43090)[0x7f2add200090]

2022-08-08T18:52:00.918-07:00	[1,0]<stderr>:[algo-1:00036] [ 1] /usr/lib/x86_64-linux-gnu/libc.so.6(gsignal+0xcb)[0x7f2add20000b]

2022-08-08T18:52:00.918-07:00	[1,0]<stderr>:[algo-1:00036] [ 2] /usr/lib/x86_64-linux-gnu/libc.so.6(abort+0x12b)[0x7f2add1df859]

2022-08-08T18:52:00.918-07:00	[1,0]<stderr>:[algo-1:00036] [ 3] /usr/lib/x86_64-linux-gnu/libstdc++.so.6(+0x9e911)[0x7f2a5c1ad911]

2022-08-08T18:52:00.918-07:00	[1,0]<stderr>:[algo-1:00036] [ 4] /usr/lib/x86_64-linux-gnu/libstdc++.so.6(+0xaa38c)[0x7f2a5c1b938c]

2022-08-08T18:52:00.918-07:00	[1,0]<stderr>:[algo-1:00036] [ 5] /usr/lib/x86_64-linux-gnu/libstdc++.so.6(+0xaa3f7)[0x7f2a5c1b93f7]

2022-08-08T18:52:00.918-07:00	[1,0]<stderr>:[algo-1:00036] [ 6] /usr/lib/x86_64-linux-gnu/libstdc++.so.6(+0xaa6a9)[0x7f2a5c1b96a9]

2022-08-08T18:52:00.918-07:00	[1,0]<stderr>:[algo-1:00036] [ 7] /usr/local/lib/python3.8/dist-packages/horovod/mxnet/mpi_lib.cpython-38-x86_64-linux-gnu.so(_ZN7horovod6common10GPUContext4impl13WaitForEventsERSt5queueISt4pairINSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEENS0_5EventEESt5dequeISC_SaISC_EEERKSt6vectorINS0_16TensorTableEntryESaISJ_EERNS0_8TimelineERKSt8functionIFvvEE+0x8a1)[0x7f29f2f94b61]

2022-08-08T18:52:00.918-07:00	[1,0]<stderr>:[algo-1:00036] [ 8] /usr/local/lib/python3.8/dist-packages/horovod/mxnet/mpi_lib.cpython-38-x86_64-linux-gnu.so(+0x1317a7)[0x7f29f2f957a7]

2022-08-08T18:52:00.918-07:00	[1,0]<stderr>:[algo-1:00036] [ 9] /usr/local/lib/python3.8/dist-packages/horovod/mxnet/mpi_lib.cpython-38-x86_64-linux-gnu.so(_ZN7horovod6common10ThreadPool4loopEv+0x170)[0x7f29f2f52250]

2022-08-08T18:52:00.918-07:00	[1,0]<stderr>:[algo-1:00036] [10] /usr/lib/x86_64-linux-gnu/libstdc++.so.6(+0xd6de4)[0x7f2a5c1e5de4]

2022-08-08T18:52:00.918-07:00	[1,0]<stderr>:[algo-1:00036] [11] /usr/lib/x86_64-linux-gnu/libpthread.so.0(+0x8609)[0x7f2add1a2609]

2022-08-08T18:52:00.918-07:00	[1,0]<stderr>:[algo-1:00036] [12] /usr/lib/x86_64-linux-gnu/libc.so.6(clone+0x43)[0x7f2add2dc133]

2022-08-08T18:52:00.918-07:00	[1,0]<stderr>:[algo-1:00036] *** End of error message ***

2022-08-08T18:52:00.918-07:00	--------------------------------------------------------------------------

2022-08-08T18:52:00.918-07:00	Primary job terminated normally, but 1 process returned

2022-08-08T18:52:00.918-07:00	a non-zero exit code. Per user-direction, the job has been aborted.

2022-08-08T18:52:00.918-07:00	--------------------------------------------------------------------------

2022-08-08T18:52:01.918-07:00	[1,0]<stderr>:/usr/lib/python3.8/multiprocessing/resource_tracker.py:216: UserWarning: resource_tracker: There appear to be 3 leaked semaphore objects to clean up at shutdown

2022-08-08T18:52:01.919-07:00	[1,0]<stderr>: warnings.warn('resource_tracker: There appear to be %d '

2022-08-08T18:52:02.919-07:00	--------------------------------------------------------------------------

2022-08-08T18:52:02.919-07:00	mpirun.real noticed that process rank 1 with PID 43 on node algo-2 exited on signal 6 (Aborted).

The Sync PS implementation does not crash, but has very high loss.

To Reproduce

(If you developed your own code, please provide a short script that reproduces the error. For existing examples, please provide link.)

PS Async

trainer = gluon.Trainer(model.collect_params(),
                            optimizer,
                            optimizer_params,kvstore="dist_async", update_on_kvstore=True)```

Horovod:

Only difference is in how the trainer is created

opt = mx.optimizer.create(optimizer, **optimizer_params)
    
hvd.init()

assert hvd.mpi_threads_supported()
from mpi4py import MPI
comm = MPI.COMM_WORLD

params = model.collect_params()
if params is not None:
    hvd.broadcast_parameters(params, root_rank=0)

trainer = hvd.DistributedTrainer(params, opt)

The stepping is done as:

grads = [i.grad(ctx) for i in model.collect_params().values()
            if i.grad_req != 'null']
trainer.step(batch_size, ignore_stale_grad=True)
model.collect_params().zero_grad()  # for dangling nodes

The sparse nature of the data is passed through the nn.Embedding block and into the loss function.

I am withholding parts of the code for the purpose of data security

Steps to reproduce

(Paste the commands you ran that produced the error.)

  1. Using AWS Sagemaker to run multi-instance training, launched with toolkit https://github.com/aws/sagemaker-mxnet-training-toolkit/tree/2f26babd9ba72f48d2336f7817d8255b6b2a2adc/src/sagemaker_mxnet_container
  2. Use dataset with sparse gradients (very large vocabulary size). Note that training is fine on this dataset when only a single instance is used3. 3.

What have you tried to solve it?

  1. Change MXNet versions (downgraded to 1.8)
  2. Try 3 different approaches to distributed training
  3. Use logging to identify breaking point.
  4. Attempt to understand the back-end implementation of kvstore_dist.h, but was unable to make sense of the error message line.

Environment

MXNet 1.9.0
Horovod 0.19.0
OpenMPI 4.0.1
cuda 11.2.2
protobuf==3.20.1
h5py==2.10.0
onnx==1.8.1
"numpy<1.20"
pandas==1.3.0
"Pillow>=9.0,<10.0"
"requests<3"
scikit-learn
scipy==1.7.0
gluonnlp==0.10.0
gluoncv==0.8.0
"urllib3<2"
python-dateutil==2.8.0
tqdm==4.39.0
"PyYAML>=5.4,<5.5"
mpi4py==3.0.3
${MX_URL}
awscli
s3fs==0.4.2
opencv-python
Instance type: AWS ml.g4dn.xlarge

Welcome to Apache MXNet (incubating)! We are on a mission to democratize AI, and we are glad that you are contributing to it by opening this issue.
Please make sure to include all the relevant context, and one of the @apache/mxnet-committers will be here shortly.
If you are interested in contributing to our project, let us know! Also, be sure to check out our guide on contributing to MXNet and our development guides wiki.