for help

Question

for help

VinsonXuxuxu opened this issue 6 years ago · comments

Xu Zhuang commented 6 years ago

Sorry to disturb you again，How to solve this problem？

pengzhou1108 · Answer 1 · Thu Oct 25 2018 23:32:14 GMT+0800 (China Standard Time)

Sorry to disturb you again，How to solve this problem？

Hi,

I think this is due to the version of tensorflow. The input format of resnet bottleneck changed after 0.12. You can switch to 0.12 to see if the error is still there.

Xu Zhuang · Answer 2 · Sun Oct 28 2018 20:20:07 GMT+0800 (China Standard Time)

Sorry to disturb you again，How to solve this problem？

Hi,

I think this is due to the version of tensorflow. The input format of resnet bottleneck changed after 0.12. You can switch to 0.12 to see if the error is still there.

Hi,thanks for your reply.During my training process ,I often encounter the problem like this.

Do you know why?

pengzhou1108 · Answer 3 · Mon Oct 29 2018 02:49:19 GMT+0800 (China Standard Time)

Sorry to disturb you again，How to solve this problem？

Hi,
I think this is due to the version of tensorflow. The input format of resnet bottleneck changed after 0.12. You can switch to 0.12 to see if the error is still there.

Hi,thanks for your reply.During my training process ,I often encounter the problem like this.

Do you know why?

I think the GPU memory is not enough. You can reduce the batch size of RPN to reduce the required memory.

sunglowzhang · Answer 4 · Sat Nov 03 2018 22:57:12 GMT+0800 (China Standard Time)

HI @pengzhou1108 , I have the similar issues and my GPU is GeForce 1080, ~8G. After changing RPN batch size to 1, still doesn't work. Any ideas to figure this out?

Detailed Log

Issues in log:

E tensorflow/stream_executor/cuda/cuda_fft.cc:169] failed to create cuFFT batched plan:2
E tensorflow/stream_executor/cuda/cuda_fft.cc:111] failed to run cuFFT routine cufftSetStream: 1
E tensorflow/stream_executor/cuda/cuda_fft.cc:169] failed to create cuFFT batched plan:2
W tensorflow/core/framework/op_kernel.cc:975] Internal: c2c fft failed : in.shape=[3136,16384]
[[Node: noise_pred/FFT = FFT_device="/job:localhost/replica:0/task:0/gpu:0"]]
Traceback (most recent call last):
File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/client/session.py", line 1021, in _do_call
return fn(*args)
File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/client/session.py", line 1003, in _run_fn
status, run_metadata)
File "/usr/lib/python3.5/contextlib.py", line 66, in exit
next(self.gen)
File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/framework/errors_impl.py", line 469, in raise_exception_on_not_ok_status
pywrap_tensorflow.TF_GetCode(status))
tensorflow.python.framework.errors_impl.InternalError: c2c fft failed : in.shape=[3136,16384]
[[Node: noise_pred/FFT = FFT_device="/job:localhost/replica:0/task:0/gpu:0"]]

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File "./tools/trainval_net.py", line 174, in
max_iters=args.max_iters)
File "/home/sunglow/Downloads/RGB-N-master/tools/../lib/model/train_val.py", line 356, in train_net
sw.train_model(sess, max_iters)
File "/home/sunglow/Downloads/RGB-N-master/tools/../lib/model/train_val.py", line 247, in train_model
self.net.train_step(sess, blobs, train_op)
File "/home/sunglow/Downloads/RGB-N-master/tools/../lib/nets/network_fusion.py", line 456, in train_step
feed_dict=feed_dict)
File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/client/session.py", line 766, in run
run_metadata_ptr)
File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/client/session.py", line 964, in _run
feed_dict_string, options, run_metadata)
File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/client/session.py", line 1014, in _do_run
target_list, options, run_metadata)
File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/client/session.py", line 1034, in _do_call
raise type(e)(node_def, op, message)
tensorflow.python.framework.errors_impl.InternalError: c2c fft failed : in.shape=[3136,16384]
[[Node: noise_pred/FFT = FFT_device="/job:localhost/replica:0/task:0/gpu:0"]]

Caused by op 'noise_pred/FFT', defined at:
File "./tools/trainval_net.py", line 174, in
max_iters=args.max_iters)
File "/home/sunglow/Downloads/RGB-N-master/tools/../lib/model/train_val.py", line 356, in train_net
sw.train_model(sess, max_iters)
File "/home/sunglow/Downloads/RGB-N-master/tools/../lib/model/train_val.py", line 105, in train_model
anchor_ratios=cfg.ANCHOR_RATIOS)
File "/home/sunglow/Downloads/RGB-N-master/tools/../lib/nets/network_fusion.py", line 377, in create_architecture
rois, cls_prob, bbox_pred = self.build_network(sess, training)
File "/home/sunglow/Downloads/RGB-N-master/tools/../lib/nets/resnet_fusion.py", line 275, in build_network
bilinear_pool=compact_bilinear_pooling_layer(fc7,noise_fc7,2048*8,compute_size=16,sequential=False)
File "/home/sunglow/Downloads/RGB-N-master/tools/../lib/compact_bilinear_pooling/compact_bilinear_pooling.py", line 137, in compact_bilinear_pooling_layer
sequential, compute_size)
File "/home/sunglow/Downloads/RGB-N-master/tools/../lib/compact_bilinear_pooling/compact_bilinear_pooling.py", line 12, in _fft
return tf.fft(bottom)
File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/ops/gen_math_ops.py", line 800, in fft
result = _op_def_lib.apply_op("FFT", input=input, name=name)
File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/framework/op_def_library.py", line 759, in apply_op
op_def=op_def)
File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/framework/ops.py", line 2240, in create_op
original_op=self._default_original_op, op_def=op_def)
File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/framework/ops.py", line 1128, in init
self._traceback = _extract_stack()

InternalError (see above for traceback): c2c fft failed : in.shape=[3136,16384]
[[Node: noise_pred/FFT = FFT_device="/job:localhost/replica:0/task:0/gpu:0"]]

pengzhou1108 · Answer 5 · Mon Nov 05 2018 05:13:00 GMT+0800 (China Standard Time)

HI @pengzhou1108 , I have the similar issues and my GPU is GeForce 1080, ~8G. After changing RPN batch size to 1, still doesn't work. Any ideas to figure this out?

Detailed Log

Issues in log:

E tensorflow/stream_executor/cuda/cuda_fft.cc:169] failed to create cuFFT batched plan:2
E tensorflow/stream_executor/cuda/cuda_fft.cc:111] failed to run cuFFT routine cufftSetStream: 1
E tensorflow/stream_executor/cuda/cuda_fft.cc:169] failed to create cuFFT batched plan:2
W tensorflow/core/framework/op_kernel.cc:975] Internal: c2c fft failed : in.shape=[3136,16384]
[[Node: noise_pred/FFT = FFT_device="/job:localhost/replica:0/task:0/gpu:0"]]
Traceback (most recent call last):
File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/client/session.py", line 1021, in _do_call
return fn(*args)
File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/client/session.py", line 1003, in _run_fn
status, run_metadata)
File "/usr/lib/python3.5/contextlib.py", line 66, in exit
next(self.gen)
File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/framework/errors_impl.py", line 469, in raise_exception_on_not_ok_status
pywrap_tensorflow.TF_GetCode(status))
tensorflow.python.framework.errors_impl.InternalError: c2c fft failed : in.shape=[3136,16384]
[[Node: noise_pred/FFT = FFT_device="/job:localhost/replica:0/task:0/gpu:0"]]

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File "./tools/trainval_net.py", line 174, in
max_iters=args.max_iters)
File "/home/sunglow/Downloads/RGB-N-master/tools/../lib/model/train_val.py", line 356, in train_net
sw.train_model(sess, max_iters)
File "/home/sunglow/Downloads/RGB-N-master/tools/../lib/model/train_val.py", line 247, in train_model
self.net.train_step(sess, blobs, train_op)
File "/home/sunglow/Downloads/RGB-N-master/tools/../lib/nets/network_fusion.py", line 456, in train_step
feed_dict=feed_dict)
File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/client/session.py", line 766, in run
run_metadata_ptr)
File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/client/session.py", line 964, in _run
feed_dict_string, options, run_metadata)
File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/client/session.py", line 1014, in _do_run
target_list, options, run_metadata)
File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/client/session.py", line 1034, in _do_call
raise type(e)(node_def, op, message)
tensorflow.python.framework.errors_impl.InternalError: c2c fft failed : in.shape=[3136,16384]
[[Node: noise_pred/FFT = FFT_device="/job:localhost/replica:0/task:0/gpu:0"]]

Caused by op 'noise_pred/FFT', defined at:
File "./tools/trainval_net.py", line 174, in
max_iters=args.max_iters)
File "/home/sunglow/Downloads/RGB-N-master/tools/../lib/model/train_val.py", line 356, in train_net
sw.train_model(sess, max_iters)
File "/home/sunglow/Downloads/RGB-N-master/tools/../lib/model/train_val.py", line 105, in train_model
anchor_ratios=cfg.ANCHOR_RATIOS)
File "/home/sunglow/Downloads/RGB-N-master/tools/../lib/nets/network_fusion.py", line 377, in create_architecture
rois, cls_prob, bbox_pred = self.build_network(sess, training)
File "/home/sunglow/Downloads/RGB-N-master/tools/../lib/nets/resnet_fusion.py", line 275, in build_network
bilinear_pool=compact_bilinear_pooling_layer(fc7,noise_fc7,2048*8,compute_size=16,sequential=False)
File "/home/sunglow/Downloads/RGB-N-master/tools/../lib/compact_bilinear_pooling/compact_bilinear_pooling.py", line 137, in compact_bilinear_pooling_layer
sequential, compute_size)
File "/home/sunglow/Downloads/RGB-N-master/tools/../lib/compact_bilinear_pooling/compact_bilinear_pooling.py", line 12, in _fft
return tf.fft(bottom)
File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/ops/gen_math_ops.py", line 800, in fft
result = _op_def_lib.apply_op("FFT", input=input, name=name)
File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/framework/op_def_library.py", line 759, in apply_op
op_def=op_def)
File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/framework/ops.py", line 2240, in create_op
original_op=self._default_original_op, op_def=op_def)
File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/framework/ops.py", line 1128, in init
self._traceback = _extract_stack()

InternalError (see above for traceback): c2c fft failed : in.shape=[3136,16384]
[[Node: noise_pred/FFT = FFT_device="/job:localhost/replica:0/task:0/gpu:0"]]

Hi,

I guess you did not change the batch size correctly. How do you change the RPN batch size? You should change the number in cfgs/*.yml file instead of the files in lib folder, which is the actual parameter used for training and testing.

sunglowzhang · Answer 6 · Mon Nov 05 2018 10:21:35 GMT+0800 (China Standard Time)

I set TRAIN.RPN_BATCHSIZE = 1 in train_faster_rcnn.sh to override the config. In the printed log I noticed the setting worked.

One update:
I checked the log again and found in resnet_fusion.py when building_network, sequential = False in bilinear_pool. I changed this to True. and now it can run. Does sequential = True make sense?

rois, cls_prob, bbox_pred = self.build_network(sess, training)
File "/home/sunglow/Downloads/RGB-N-master/tools/../lib/nets/resnet_fusion.py", line 275, in build_network
bilinear_pool=compact_bilinear_pooling_layer(fc7,noise_fc7,2048*8,compute_size=16,sequential=False)

pengzhou1108 · Answer 7 · Mon Nov 05 2018 10:35:11 GMT+0800 (China Standard Time)

I set TRAIN.RPN_BATCHSIZE = 1 in train_faster_rcnn.sh to override the config. In the printed log I noticed the setting worked.

One update: I checked the log again and found in resnet_fusion.py when building_network, sequential = False in bilinear_pool. I changed this to True. and now it can run. Does sequential = True make sense?

rois, cls_prob, bbox_pred = self.build_network(sess, training) File "/home/sunglow/Downloads/RGB-N-master/tools/../lib/nets/resnet_fusion.py", line 275, in build_network bilinear_pool=compact_bilinear_pooling_layer(fc7,noise_fc7,2048*8,compute_size=16,sequential=False)

In the log figure you sent indicates the the dimensionality of bilinear layer is [3136(64X7x7), 16384], which has 64 as batch size (the default RPN batch size in yml file). In my case sequential is set to be False.

sunglowzhang · Answer 8 · Mon Nov 05 2018 10:56:04 GMT+0800 (China Standard Time)

I set TRAIN.RPN_BATCHSIZE = 1 in train_faster_rcnn.sh to override the config. In the printed log I noticed the setting worked.
One update: I checked the log again and found in resnet_fusion.py when building_network, sequential = False in bilinear_pool. I changed this to True. and now it can run. Does sequential = True make sense?
rois, cls_prob, bbox_pred = self.build_network(sess, training) File "/home/sunglow/Downloads/RGB-N-master/tools/../lib/nets/resnet_fusion.py", line 275, in build_network bilinear_pool=compact_bilinear_pooling_layer(fc7,noise_fc7,2048*8,compute_size=16,sequential=False)

In the log figure you sent indicates the the dimensionality of bilinear layer is [3136(64X7x7), 16384], which has 64 as batch size (the default RPN batch size in yml file). In my case sequential is set to be False.

En...Thanks! I will change the config yml directly.