NaN losses during training!

Question

NaN losses during training!

amirhfarzaneh opened this issue 7 years ago · comments

I'm following the exact same instructions for training, but during training with the command
./experiments/scripts/train_faster_rcnn.sh 0 pascal_voc vgg16

+ set -e
+ export PYTHONUNBUFFERED=True
+ PYTHONUNBUFFERED=True
+ GPU_ID=0
+ DATASET=pascal_voc
+ NET=vgg16
+ array=($@)
+ len=3
+ EXTRA_ARGS=
+ EXTRA_ARGS_SLUG=
+ case ${DATASET} in
+ TRAIN_IMDB=voc_2007_trainval
+ TEST_IMDB=voc_2007_test
+ STEPSIZE=50000
+ ITERS=70000
+ ANCHORS='[8,16,32]'
+ RATIOS='[0.5,1,2]'
++ date +%Y-%m-%d_%H-%M-%S
+ LOG=experiments/logs/vgg16_voc_2007_trainval__vgg16.txt.2017-05-11_18-12-08
+ exec
++ tee -a experiments/logs/vgg16_voc_2007_trainval__vgg16.txt.2017-05-11_18-12-08
+ echo Logging output to experiments/logs/vgg16_voc_2007_trainval__vgg16.txt.2017-05-11_18-12-08
Logging output to experiments/logs/vgg16_voc_2007_trainval__vgg16.txt.2017-05-11_18-12-08
+ set +x
+ '[' '!' -f output/vgg16/voc_2007_trainval/default/vgg16_faster_rcnn_iter_70000.ckpt.index ']'
+ [[ ! -z '' ]]
+ CUDA_VISIBLE_DEVICES=0
+ time python ./tools/trainval_net.py --weight data/imagenet_weights/vgg16.ckpt --imdb voc_2007_trainval --imdbval voc_2007_test --iters 70000 --cfg experiments/cfgs/vgg16.yml --net vgg16 --set ANCHOR_SCALES '[8,16,32]' ANCHOR_RATIOS '[0.5,1,2]' TRAIN.STEPSIZE 50000
Called with args:
Namespace(cfg_file='experiments/cfgs/vgg16.yml', imdb_name='voc_2007_trainval', imdbval_name='voc_2007_test', max_iters=70000, net='vgg16', set_cfgs=['ANCHOR_SCALES', '[8,16,32]', 'ANCHOR_RATIOS', '[0.5,1,2]', 'TRAIN.STEPSIZE', '50000'], tag=None, weight='data/imagenet_weights/vgg16.ckpt')
Using config:
{'ANCHOR_RATIOS': [0.5, 1, 2],
 'ANCHOR_SCALES': [8, 16, 32],
 'DATA_DIR': '/home/amirhf/Projects/tf-faster-rcnn/data',
 'DEDUP_BOXES': 0.0625,
 'EPS': 1e-14,
 'EXP_DIR': 'vgg16',
 'GPU_ID': 0,
 'MATLAB': 'matlab',
 'PIXEL_MEANS': array([[[ 102.9801,  115.9465,  122.7717]]]),
 'POOLING_MODE': 'crop',
 'POOLING_SIZE': 7,
 'RESNET': {'BN_TRAIN': False, 'FIXED_BLOCKS': 1, 'MAX_POOL': False},
 'RNG_SEED': 3,
 'ROOT_DIR': '/home/amirhf/Projects/tf-faster-rcnn',
 'TEST': {'BBOX_REG': True,
          'HAS_RPN': True,
          'MAX_SIZE': 1000,
          'MODE': 'nms',
          'NMS': 0.3,
          'PROPOSAL_METHOD': 'gt',
          'RPN_NMS_THRESH': 0.7,
          'RPN_POST_NMS_TOP_N': 300,
          'RPN_PRE_NMS_TOP_N': 6000,
          'RPN_TOP_N': 5000,
          'SCALES': [600],
          'SVM': False},
 'TRAIN': {'ASPECT_GROUPING': False,
           'BATCH_SIZE': 256,
           'BBOX_INSIDE_WEIGHTS': [1.0, 1.0, 1.0, 1.0],
           'BBOX_NORMALIZE_MEANS': [0.0, 0.0, 0.0, 0.0],
           'BBOX_NORMALIZE_STDS': [0.1, 0.1, 0.2, 0.2],
           'BBOX_NORMALIZE_TARGETS': True,
           'BBOX_NORMALIZE_TARGETS_PRECOMPUTED': True,
           'BBOX_REG': True,
           'BBOX_THRESH': 0.5,
           'BG_THRESH_HI': 0.5,
           'BG_THRESH_LO': 0.0,
           'BIAS_DECAY': False,
           'DISPLAY': 20,
           'DOUBLE_BIAS': True,
           'FG_FRACTION': 0.25,
           'FG_THRESH': 0.5,
           'GAMMA': 0.1,
           'HAS_RPN': True,
           'IMS_PER_BATCH': 1,
           'LEARNING_RATE': 0.001,
           'MAX_SIZE': 1000,
           'MOMENTUM': 0.9,
           'PROPOSAL_METHOD': 'gt',
           'RPN_BATCHSIZE': 256,
           'RPN_BBOX_INSIDE_WEIGHTS': [1.0, 1.0, 1.0, 1.0],
           'RPN_CLOBBER_POSITIVES': False,
           'RPN_FG_FRACTION': 0.5,
           'RPN_NEGATIVE_OVERLAP': 0.3,
           'RPN_NMS_THRESH': 0.7,
           'RPN_POSITIVE_OVERLAP': 0.7,
           'RPN_POSITIVE_WEIGHT': -1.0,
           'RPN_POST_NMS_TOP_N': 2000,
           'RPN_PRE_NMS_TOP_N': 12000,
           'SCALES': [600],
           'SNAPSHOT_ITERS': 5000,
           'SNAPSHOT_KEPT': 3,
           'SNAPSHOT_PREFIX': 'vgg16_faster_rcnn',
           'STEPSIZE': 50000,
           'SUMMARY_INTERVAL': 180,
           'TRUNCATED': False,
           'USE_ALL_GT': True,
           'USE_FLIPPED': True,
           'USE_GT': False,
           'WEIGHT_DECAY': 0.0005},
 'USE_GPU_NMS': True}
Loaded dataset `voc_2007_trainval` for training
Set proposal method: gt
Appending horizontally-flipped training examples...
wrote gt roidb to /home/amirhf/Projects/tf-faster-rcnn/data/cache/voc_2007_trainval_gt_roidb.pkl
done
Preparing training data...
done
10022 roidb entries
Output will be saved to `/home/amirhf/Projects/tf-faster-rcnn/output/vgg16/voc_2007_trainval/default`
TensorFlow summaries will be saved to `/home/amirhf/Projects/tf-faster-rcnn/tensorboard/vgg16/voc_2007_trainval/default`
Loaded dataset `voc_2007_test` for training
Set proposal method: gt
Preparing training data...
wrote gt roidb to /home/amirhf/Projects/tf-faster-rcnn/data/cache/voc_2007_test_gt_roidb.pkl
done
4952 validation roidb entries
Filtered 0 roidb entries: 10022 -> 10022
Filtered 0 roidb entries: 4952 -> 4952
2017-05-11 18:12:37.107319: W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use SSE4.1 instructions, but these are available on your machine and could speed up CPU computations.
2017-05-11 18:12:37.107338: W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use SSE4.2 instructions, but these are available on your machine and could speed up CPU computations.
2017-05-11 18:12:37.107344: W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use AVX instructions, but these are available on your machine and could speed up CPU computations.
2017-05-11 18:12:37.107350: W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use FMA instructions, but these are available on your machine and could speed up CPU computations.
2017-05-11 18:12:37.404484: I tensorflow/core/common_runtime/gpu/gpu_device.cc:887] Found device 0 with properties: 
name: GeForce GTX 980 Ti
major: 5 minor: 2 memoryClockRate (GHz) 1.291
pciBusID 0000:01:00.0
Total memory: 5.93GiB
Free memory: 5.27GiB
2017-05-11 18:12:37.404517: I tensorflow/core/common_runtime/gpu/gpu_device.cc:908] DMA: 0 
2017-05-11 18:12:37.404523: I tensorflow/core/common_runtime/gpu/gpu_device.cc:918] 0:   Y 
2017-05-11 18:12:37.404537: I tensorflow/core/common_runtime/gpu/gpu_device.cc:977] Creating TensorFlow device (/gpu:0) -> (device: 0, name: GeForce GTX 980 Ti, pci bus id: 0000:01:00.0)
Solving...
/home/amirhf/.local/lib/python2.7/site-packages/tensorflow/python/ops/gradients_impl.py:93: UserWarning: Converting sparse IndexedSlices to a dense Tensor of unknown shape. This may consume a large amount of memory.
  "Converting sparse IndexedSlices to a dense Tensor of unknown shape. "
Loading initial model weights from data/imagenet_weights/vgg16.ckpt
Varibles restored: vgg_16/conv1/conv1_1/biases:0
Varibles restored: vgg_16/conv1/conv1_2/weights:0
Varibles restored: vgg_16/conv1/conv1_2/biases:0
Varibles restored: vgg_16/conv2/conv2_1/weights:0
Varibles restored: vgg_16/conv2/conv2_1/biases:0
Varibles restored: vgg_16/conv2/conv2_2/weights:0
Varibles restored: vgg_16/conv2/conv2_2/biases:0
Varibles restored: vgg_16/conv3/conv3_1/weights:0
Varibles restored: vgg_16/conv3/conv3_1/biases:0
Varibles restored: vgg_16/conv3/conv3_2/weights:0
Varibles restored: vgg_16/conv3/conv3_2/biases:0
Varibles restored: vgg_16/conv3/conv3_3/weights:0
Varibles restored: vgg_16/conv3/conv3_3/biases:0
Varibles restored: vgg_16/conv4/conv4_1/weights:0
Varibles restored: vgg_16/conv4/conv4_1/biases:0
Varibles restored: vgg_16/conv4/conv4_2/weights:0
Varibles restored: vgg_16/conv4/conv4_2/biases:0
Varibles restored: vgg_16/conv4/conv4_3/weights:0
Varibles restored: vgg_16/conv4/conv4_3/biases:0
Varibles restored: vgg_16/conv5/conv5_1/weights:0
Varibles restored: vgg_16/conv5/conv5_1/biases:0
Varibles restored: vgg_16/conv5/conv5_2/weights:0
Varibles restored: vgg_16/conv5/conv5_2/biases:0
Varibles restored: vgg_16/conv5/conv5_3/weights:0
Varibles restored: vgg_16/conv5/conv5_3/biases:0
Varibles restored: vgg_16/fc6/biases:0
Varibles restored: vgg_16/fc7/biases:0
Loaded.
Fix VGG16 layers..
iter: 20 / 70000, total loss: 1.780578
 >>> rpn_loss_cls: 0.331266
 >>> rpn_loss_box: 0.058807
 >>> loss_cls: 0.851354
 >>> loss_box: 0.539151
 >>> lr: 0.001000
speed: 0.908s / iter
iter: 40 / 70000, total loss: 0.701749
 >>> rpn_loss_cls: 0.551406
 >>> rpn_loss_box: 0.128653
 >>> loss_cls: 0.021690
 >>> loss_box: 0.000000
 >>> lr: 0.001000
.
.  [REMOVED LINES TO MAKE THE POST SHORTER]
.
.
iter: 3380 / 70000, total loss: 0.616202
 >>> rpn_loss_cls: 0.100265
 >>> rpn_loss_box: 0.145635
 >>> loss_cls: 0.185931
 >>> loss_box: 0.184371
 >>> lr: 0.001000
speed: 0.433s / iter
iter: 3400 / 70000, total loss: 1.312786
 >>> rpn_loss_cls: 0.295694
 >>> rpn_loss_box: 0.017820
 >>> loss_cls: 0.452280
 >>> loss_box: 0.546992
 >>> lr: 0.001000
speed: 0.432s / iter
iter: 3420 / 70000, total loss: 0.642559
 >>> rpn_loss_cls: 0.132440
 >>> rpn_loss_box: 0.039820
 >>> loss_cls: 0.293447
 >>> loss_box: 0.176852
 >>> lr: 0.001000
speed: 0.431s / iter
/home/amirhf/Projects/tf-faster-rcnn/tools/../lib/model/bbox_transform.py:56: RuntimeWarning: invalid value encountered in subtract
  pred_boxes[:, 0::4] = pred_ctr_x - 0.5 * pred_w
/home/amirhf/Projects/tf-faster-rcnn/tools/../lib/model/bbox_transform.py:58: RuntimeWarning: invalid value encountered in subtract
  pred_boxes[:, 1::4] = pred_ctr_y - 0.5 * pred_h
/home/amirhf/Projects/tf-faster-rcnn/tools/../lib/model/bbox_transform.py:60: RuntimeWarning: invalid value encountered in add
  pred_boxes[:, 2::4] = pred_ctr_x + 0.5 * pred_w
/home/amirhf/Projects/tf-faster-rcnn/tools/../lib/model/bbox_transform.py:62: RuntimeWarning: invalid value encountered in add
  pred_boxes[:, 3::4] = pred_ctr_y + 0.5 * pred_h
iter: 3440 / 70000, total loss: nan
 >>> rpn_loss_cls: nan
 >>> rpn_loss_box: nan
 >>> loss_cls: nan
 >>> loss_box: nan
 >>> lr: 0.001000

There are those

RuntimeWarning: invalid value encountered in subtract
  pred_boxes[:, 0::4] = pred_ctr_x - 0.5 * pred_w

errors and from there, losses become nan! I have changed nothing in the files!

Xinlei Chen · Answer 1 · Fri May 12 2017 14:09:13 GMT+0800 (China Standard Time)

did you try testing? did you get the same number?

Amir H. Farzaneh · Answer 2 · Fri May 12 2017 14:11:46 GMT+0800 (China Standard Time)

It doesn't go through testing phase! after all the losses getting nans, it finishes like this:

iter: 3760 / 70000, total loss: nan
 >>> rpn_loss_cls: nan
 >>> rpn_loss_box: nan
 >>> loss_cls: nan
 >>> loss_box: nan
 >>> lr: 0.001000
speed: 0.430s / iter
2017-05-11 18:39:44.836200: W tensorflow/core/framework/op_kernel.cc:1152] Invalid argument: Nan in summary histogram for: TRAIN/vgg_16/conv5/conv5_2/biases
	 [[Node: TRAIN/vgg_16/conv5/conv5_2/biases = HistogramSummary[T=DT_FLOAT, _device="/job:localhost/replica:0/task:0/cpu:0"](TRAIN/vgg_16/conv5/conv5_2/biases/tag, vgg_16/conv5/conv5_2/biases/read/_257)]]
2017-05-11 18:39:44.836202: W tensorflow/core/framework/op_kernel.cc:1152] Invalid argument: Nan in summary histogram for: TRAIN/vgg_16/conv5/conv5_2/biases
	 [[Node: TRAIN/vgg_16/conv5/conv5_2/biases = HistogramSummary[T=DT_FLOAT, _device="/job:localhost/replica:0/task:0/cpu:0"](TRAIN/vgg_16/conv5/conv5_2/biases/tag, vgg_16/conv5/conv5_2/biases/read/_257)]]
2017-05-11 18:39:44.837950: W tensorflow/core/framework/op_kernel.cc:1152] Invalid argument: Nan in summary histogram for: TRAIN/vgg_16/conv5/conv5_2/biases
	 [[Node: TRAIN/vgg_16/conv5/conv5_2/biases = HistogramSummary[T=DT_FLOAT, _device="/job:localhost/replica:0/task:0/cpu:0"](TRAIN/vgg_16/conv5/conv5_2/biases/tag, vgg_16/conv5/conv5_2/biases/read/_257)]]
2017-05-11 18:39:44.838161: W tensorflow/core/framework/op_kernel.cc:1152] Invalid argument: Nan in summary histogram for: TRAIN/vgg_16/conv5/conv5_2/biases
	 [[Node: TRAIN/vgg_16/conv5/conv5_2/biases = HistogramSummary[T=DT_FLOAT, _device="/job:localhost/replica:0/task:0/cpu:0"](TRAIN/vgg_16/conv5/conv5_2/biases/tag, vgg_16/conv5/conv5_2/biases/read/_257)]]
2017-05-11 18:39:44.838203: W tensorflow/core/framework/op_kernel.cc:1152] Invalid argument: Nan in summary histogram for: TRAIN/vgg_16/conv5/conv5_2/biases
	 [[Node: TRAIN/vgg_16/conv5/conv5_2/biases = HistogramSummary[T=DT_FLOAT, _device="/job:localhost/replica:0/task:0/cpu:0"](TRAIN/vgg_16/conv5/conv5_2/biases/tag, vgg_16/conv5/conv5_2/biases/read/_257)]]
2017-05-11 18:39:44.838346: W tensorflow/core/framework/op_kernel.cc:1152] Invalid argument: Nan in summary histogram for: TRAIN/vgg_16/conv5/conv5_2/biases
	 [[Node: TRAIN/vgg_16/conv5/conv5_2/biases = HistogramSummary[T=DT_FLOAT, _device="/job:localhost/replica:0/task:0/cpu:0"](TRAIN/vgg_16/conv5/conv5_2/biases/tag, vgg_16/conv5/conv5_2/biases/read/_257)]]
2017-05-11 18:39:44.838614: W tensorflow/core/framework/op_kernel.cc:1152] Invalid argument: Nan in summary histogram for: TRAIN/vgg_16/conv5/conv5_2/biases
	 [[Node: TRAIN/vgg_16/conv5/conv5_2/biases = HistogramSummary[T=DT_FLOAT, _device="/job:localhost/replica:0/task:0/cpu:0"](TRAIN/vgg_16/conv5/conv5_2/biases/tag, vgg_16/conv5/conv5_2/biases/read/_257)]]
2017-05-11 18:39:44.838676: W tensorflow/core/framework/op_kernel.cc:1152] Invalid argument: Nan in summary histogram for: TRAIN/vgg_16/conv5/conv5_2/biases
	 [[Node: TRAIN/vgg_16/conv5/conv5_2/biases = HistogramSummary[T=DT_FLOAT, _device="/job:localhost/replica:0/task:0/cpu:0"](TRAIN/vgg_16/conv5/conv5_2/biases/tag, vgg_16/conv5/conv5_2/biases/read/_257)]]
2017-05-11 18:39:44.838770: W tensorflow/core/framework/op_kernel.cc:1152] Invalid argument: Nan in summary histogram for: TRAIN/vgg_16/conv5/conv5_2/biases
	 [[Node: TRAIN/vgg_16/conv5/conv5_2/biases = HistogramSummary[T=DT_FLOAT, _device="/job:localhost/replica:0/task:0/cpu:0"](TRAIN/vgg_16/conv5/conv5_2/biases/tag, vgg_16/conv5/conv5_2/biases/read/_257)]]
2017-05-11 18:39:44.838976: W tensorflow/core/framework/op_kernel.cc:1152] Invalid argument: Nan in summary histogram for: TRAIN/vgg_16/conv5/conv5_2/biases
	 [[Node: TRAIN/vgg_16/conv5/conv5_2/biases = HistogramSummary[T=DT_FLOAT, _device="/job:localhost/replica:0/task:0/cpu:0"](TRAIN/vgg_16/conv5/conv5_2/biases/tag, vgg_16/conv5/conv5_2/biases/read/_257)]]
2017-05-11 18:39:44.918997: W tensorflow/core/framework/op_kernel.cc:1152] Invalid argument: Nan in summary histogram for: TRAIN/vgg_16/conv5/conv5_2/biases
	 [[Node: TRAIN/vgg_16/conv5/conv5_2/biases = HistogramSummary[T=DT_FLOAT, _device="/job:localhost/replica:0/task:0/cpu:0"](TRAIN/vgg_16/conv5/conv5_2/biases/tag, vgg_16/conv5/conv5_2/biases/read/_257)]]
Traceback (most recent call last):
  File "./tools/trainval_net.py", line 136, in <module>
    max_iters=args.max_iters)
  File "/home/amirhf/Projects/tf-faster-rcnn/tools/../lib/model/train_val.py", line 381, in train_net
    sw.train_model(sess, max_iters)
  File "/home/amirhf/Projects/tf-faster-rcnn/tools/../lib/model/train_val.py", line 270, in train_model
    self.net.train_step_with_summary(sess, blobs, train_op)
  File "/home/amirhf/Projects/tf-faster-rcnn/tools/../lib/nets/network.py", line 387, in train_step_with_summary
    feed_dict=feed_dict)
  File "/home/amirhf/.local/lib/python2.7/site-packages/tensorflow/python/client/session.py", line 778, in run
    run_metadata_ptr)
  File "/home/amirhf/.local/lib/python2.7/site-packages/tensorflow/python/client/session.py", line 982, in _run
    feed_dict_string, options, run_metadata)
  File "/home/amirhf/.local/lib/python2.7/site-packages/tensorflow/python/client/session.py", line 1032, in _do_run
    target_list, options, run_metadata)
  File "/home/amirhf/.local/lib/python2.7/site-packages/tensorflow/python/client/session.py", line 1052, in _do_call
    raise type(e)(node_def, op, message)
tensorflow.python.framework.errors_impl.InvalidArgumentError: Nan in summary histogram for: TRAIN/vgg_16/conv5/conv5_2/biases
	 [[Node: TRAIN/vgg_16/conv5/conv5_2/biases = HistogramSummary[T=DT_FLOAT, _device="/job:localhost/replica:0/task:0/cpu:0"](TRAIN/vgg_16/conv5/conv5_2/biases/tag, vgg_16/conv5/conv5_2/biases/read/_257)]]
	 [[Node: gradients/loss_default/mul_grad/Shape/_313 = _HostRecv[client_terminated=false, recv_device="/job:localhost/replica:0/task:0/gpu:0", send_device="/job:localhost/replica:0/task:0/cpu:0", send_device_incarnation=1, tensor_name="edge_1614_gradients/loss_default/mul_grad/Shape", tensor_type=DT_INT32, _device="/job:localhost/replica:0/task:0/gpu:0"]()]]

Caused by op u'TRAIN/vgg_16/conv5/conv5_2/biases', defined at:
  File "./tools/trainval_net.py", line 136, in <module>
    max_iters=args.max_iters)
  File "/home/amirhf/Projects/tf-faster-rcnn/tools/../lib/model/train_val.py", line 381, in train_net
    sw.train_model(sess, max_iters)
  File "/home/amirhf/Projects/tf-faster-rcnn/tools/../lib/model/train_val.py", line 105, in train_model
    anchor_ratios=cfg.ANCHOR_RATIOS)
  File "/home/amirhf/Projects/tf-faster-rcnn/tools/../lib/nets/network.py", line 332, in create_architecture
    self._add_train_summary(var)
  File "/home/amirhf/Projects/tf-faster-rcnn/tools/../lib/nets/network.py", line 71, in _add_train_summary
    tf.summary.histogram('TRAIN/' + var.op.name, var)
  File "/home/amirhf/.local/lib/python2.7/site-packages/tensorflow/python/summary/summary.py", line 209, in histogram
    tag=scope.rstrip('/'), values=values, name=scope)
  File "/home/amirhf/.local/lib/python2.7/site-packages/tensorflow/python/ops/gen_logging_ops.py", line 139, in _histogram_summary
    name=name)
  File "/home/amirhf/.local/lib/python2.7/site-packages/tensorflow/python/framework/op_def_library.py", line 768, in apply_op
    op_def=op_def)
  File "/home/amirhf/.local/lib/python2.7/site-packages/tensorflow/python/framework/ops.py", line 2336, in create_op
    original_op=self._default_original_op, op_def=op_def)
  File "/home/amirhf/.local/lib/python2.7/site-packages/tensorflow/python/framework/ops.py", line 1228, in __init__
    self._traceback = _extract_stack()

InvalidArgumentError (see above for traceback): Nan in summary histogram for: TRAIN/vgg_16/conv5/conv5_2/biases
	 [[Node: TRAIN/vgg_16/conv5/conv5_2/biases = HistogramSummary[T=DT_FLOAT, _device="/job:localhost/replica:0/task:0/cpu:0"](TRAIN/vgg_16/conv5/conv5_2/biases/tag, vgg_16/conv5/conv5_2/biases/read/_257)]]
	 [[Node: gradients/loss_default/mul_grad/Shape/_313 = _HostRecv[client_terminated=false, recv_device="/job:localhost/replica:0/task:0/gpu:0", send_device="/job:localhost/replica:0/task:0/cpu:0", send_device_incarnation=1, tensor_name="edge_1614_gradients/loss_default/mul_grad/Shape", tensor_type=DT_INT32, _device="/job:localhost/replica:0/task:0/gpu:0"]()]]

Command exited with non-zero status 1
1316.02user 363.43system 27:37.30elapsed 101%CPU (0avgtext+0avgdata 3572892maxresident)k
202408inputs+33336outputs (13major+5238801minor)pagefaults 0swaps

Xinlei Chen · Answer 3 · Fri May 12 2017 14:25:41 GMT+0800 (China Standard Time)

no, i mean did you try testing with the pre-trained model i released?

Amir H. Farzaneh · Answer 4 · Fri May 12 2017 14:26:55 GMT+0800 (China Standard Time)

Yes, and it worked and showed the detected bounding boxes correctly

Xinlei Chen · Answer 5 · Fri May 12 2017 14:28:02 GMT+0800 (China Standard Time)

so the same number you can get with 78.7?

Amir H. Farzaneh · Answer 6 · Fri May 12 2017 14:54:38 GMT+0800 (China Standard Time)

I'm getting these results for pascal_voc2007 trainval with vgg16
AP for aeroplane = 0.6895
AP for bicycle = 0.7835
AP for bird = 0.6753
AP for boat = 0.5338
AP for bottle = 0.5864
AP for bus = 0.7863
AP for car = 0.8411
AP for cat = 0.8395
AP for chair = 0.4778
AP for cow = 0.8139
AP for diningtable = 0.6685
AP for dog = 0.8073
AP for horse = 0.8407
AP for motorbike = 0.7558
AP for person = 0.7715
AP for pottedplant = 0.4624
AP for sheep = 0.7073
AP for sofa = 0.6700
AP for train = 0.7418
AP for tvmonitor = 0.7315
Mean AP = 0.7092

Xinlei Chen · Answer 7 · Fri May 12 2017 15:04:21 GMT+0800 (China Standard Time)

hmm this is right.. it maybe the case that 980 is not big enough to support gpu nms and 256 batch size during training, you may need some way to go over that

Amir H. Farzaneh · Answer 8 · Fri May 12 2017 15:07:35 GMT+0800 (China Standard Time)

do you think disabling gpu nms will help? how do I do that?
there are two batch sizes if I'm not mistaken! should I change those to 128? What are the name of the variables for the batch sizes?

Amir H. Farzaneh · Answer 9 · Fri May 12 2017 15:46:05 GMT+0800 (China Standard Time)

@endernewton The person in issue#8 also has the same problem and she's using a K40!

Xinlei Chen · Answer 10 · Fri May 12 2017 16:24:56 GMT+0800 (China Standard Time)

@amirhfarzaneh i guess later she figure it out and the error was not nan in training

Amir H. Farzaneh · Answer 11 · Sat May 13 2017 09:08:47 GMT+0800 (China Standard Time)

@endernewton Could you please share your log files for training? Especially for the voc_2007_trainval dataset with vgg16 architecture? I think this will be useful to others too. This way we can compare some statistics while we're training, like how the loss numbers should look like! Thank you in advance

Xinlei Chen · Answer 12 · Sun May 14 2017 05:12:47 GMT+0800 (China Standard Time)

@amirhfarzaneh the original one is lost. Let me see if I can retrain to get a similar log.

Xinlei Chen · Answer 13 · Thu May 18 2017 18:06:53 GMT+0800 (China Standard Time)

i have put up a log file at http://gs11655.sp.cs.cmu.edu/xinleic/tf-faster-rcnn/logs/vgg16_voc_2007_trainval__vgg16.txt.2017-05-14_19-26-27

Dan Salo · Answer 14 · Mon May 22 2017 02:54:44 GMT+0800 (China Standard Time)

@endernewton the link to the log file you posted appears to be broken.

I just ran the res101 model with gpu_nms and with cpu_nms. gpu_nms gave me NaN's during training; cpu_nms gave me the expected results. I am using one Titan Xp (compute capability 6.1) and configured the setup.py with 'sm_61,' following the README. Is this expected behavior?

Perhaps the OP would get expected results if they used the cpu_nms...

Xinlei Chen · Answer 15 · Mon May 22 2017 04:07:27 GMT+0800 (China Standard Time)

Wow nice! On my side I am actually using arch_52 for both pascal and non pascal gpus, just another data point to make it work. The web server is not stable for some reason. I can move that log to google drive later.

…

Sent from my iPhone

On May 21, 2017, at 11:54, Dan Salo ***@***.***> wrote: @endernewton the link to the log file you posted appears to be broken. I just ran the res101 model with gpu_nms and with cpu_nms. gpu_nms gave me NaN's during training; cpu_nms gave me the expected results. I am using one Titan Xp (compute capability 6.1) and configured the setup.py with 'sm_61,' following the README. Is this expected behavior? Perhaps the OP would get expected results if they used the cpu_nms... — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or mute the thread.

Dan Salo · Answer 16 · Tue May 23 2017 01:35:36 GMT+0800 (China Standard Time)

@endernewton I re-ran the res101 model with gpu_nms and configured the setup.py with 'sm_52'. No NaNs, but I only got 0.65 mAP. I am going to re-run and see what the variance is.

Xinlei Chen · Answer 17 · Tue May 23 2017 01:48:36 GMT+0800 (China Standard Time)

No 0.65 is too low.. hmm then this problem is still hidden. Did you do testing with the provided models? What map did you get?

…

Sent from my iPhone

On May 22, 2017, at 10:35, Dan Salo ***@***.***> wrote: @endernewton I re-ran the res101 model with gpu_nms and configured the setup.py with 'sm_52'. No NaNs, but I only got 0.65 mAP. I am going to re-run and see what the variance is. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or mute the thread.

Xinlei Chen · Answer 18 · Tue May 23 2017 05:38:50 GMT+0800 (China Standard Time)

@dancsalo maybe because you have the Xp. the code needs to get some modifications to work on more recent gpus i guess. i haven't got access to such gpus yet so i cannot help much.

Amir H. Farzaneh · Answer 19 · Tue May 23 2017 06:25:56 GMT+0800 (China Standard Time)

Seems like the NaN problem occurs only on some gpus. I have a GTX 980Ti and NaN happens. I have tested the code on a Quadro M4000 and GTX 1080 and NaNs don't appear and the training goes as it should! This is my log file on a 1080Ti : https://drive.google.com/file/d/0Bz-CTQRw0GZCeTNrcjZ0OFVXRWs/view?usp=sharing

zdm123 · Answer 20 · Thu Sep 28 2017 22:54:21 GMT+0800 (China Standard Time)

@amirhfarzaneh Hello, my gpu is Tesla K40c. I also meet the NaN problem, do you know how to fix it ?

QiangLiang · Answer 21 · Mon Jan 01 2018 01:12:31 GMT+0800 (China Standard Time)

Hi, anyone who can tell me that it has the same effect and result with the tf_test_faster_rcnn.sh when I run the tf_train_faster_rcnn.sh .it means that the train shell didn't work at all. thanks much

Ahmed Nassar · Answer 22 · Fri Jan 12 2018 20:58:24 GMT+0800 (China Standard Time)

I had this error and the only fix was that I had problems in my xml annotation files, some were empty, and some bboxes had negative values. After eliminating them the error disappeared.

A free bird · Answer 23 · Mon Apr 30 2018 16:33:33 GMT+0800 (China Standard Time)

I had this error ,too,
today, I make it!!!!!
I find there are lots of boxes outside of my pics.
for example, my pics are 600*600,but there is a box (550,550,650,650)
when i delete these pics in trainval.txt, it works!!!

henbucuoshanghai · Answer 24 · Wed Jun 12 2019 11:14:06 GMT+0800 (China Standard Time)

mg picus is 1280*960 it is too big ? does it matter???? your py will resize it ???

mengce97 · Answer 25 · Mon Jul 01 2019 21:27:47 GMT+0800 (China Standard Time)

hello, did u fix it？
I meet the same error and i tries it all day with no help.
If you know why it happens, please tell me, i will be very appretriate!

Ahmed Nassar · Answer 26 · Mon Jul 08 2019 20:11:03 GMT+0800 (China Standard Time)

hello, did u fix it？
I meet the same error and i tries it all day with no help.
If you know why it happens, please tell me, i will be very appretriate!
I had this error and the only fix was that I had problems in my xml annotation files, some were empty, and some bboxes had negative values. After eliminating them the error disappeared.