batchsize调大，训练异常退出 cuda alloc terminate called after throwing an instance of 'dmlc::Error'

Question

batchsize调大，训练异常退出 cuda alloc terminate called after throwing an instance of 'dmlc::Error'

cuisonghui opened this issue 5 years ago · comments

使用gpu mxnet 1.3版本镜像，当训练的batch size大于65000以上，训练程序会报如下错误：
cuda alloc terminate called after throwing an instance of 'dmlc::Error'
what(): [10:07:06] /usr/local/lib/python2.7/dist-packages/mxnet-1.3.0-py2.7.egg/mxnet/cpp-package/include/mxnet-cpp/ndarray.hpp:

因为我想把batch size尽量调大，来提高显卡使用率，目前65000 batch size,显卡使用率不能打满。

看起来和显存不够有关系，但是我的显存是32g，而只使用了12g左右。
随后在docker中单卡跑两个训练程序，显存占用在24g,说明显存是够用的。
难道是一个worker有什么显存上的限制吗，请问这个可以调整吗

@zhuhan1236 @lovickie @woso @yiling-dc @songyue1104

Ruoqian Guo · Answer 1 · Fri Sep 11 2020 14:27:30 GMT+0800 (China Standard Time)

同样遇到了这个问题，请问解决了吗？