Batch normalization --training parameter

Question

Batch normalization --training parameter

galinator9000 opened this issue 5 years ago · comments

Hi, I wanted to use YOLOv3-tiny model. Downloaded cfg and weights from official website.

With this code below i successfully built .pb and .meta files.
python main.py --cfg ../yolov3-tiny/yolov3-tiny.cfg --weights ../yolov3-tiny/yolov3-tiny.weights --output ../yolov3-tiny/ --prefix "YOLO/"

With this script below I could load graph and weights.
Tried to get output from last convolutional13 layer, I got array with full of nan values:

import tensorflow as tf
import numpy as np
import cv2
saver = tf.train.import_meta_graph("yolov3-tiny/yolov3-tiny.meta")
sess = tf.Session()
saver.restore(sess, "yolov3-tiny/yolov3-tiny.ckpt")

image = cv2.cvtColor(cv2.imread("sample.jpg"), cv2.COLOR_BGR2RGB) / 255.0
image = np.expand_dims(image, axis=0)
print(
	sess.run("YOLO/convolutional13/BiasAdd:0", feed_dict={"YOLO/net1:0":image})
)

Outputs:

[[[[nan nan nan ... nan nan nan]
   [nan nan nan ... nan nan nan]
   [nan nan nan ... nan nan nan]
   ...
   [nan nan nan ... nan nan nan]
   [nan nan nan ... nan nan nan]
   [nan nan nan ... nan nan nan]]

  [[nan nan nan ... nan nan nan]
   [nan nan nan ... nan nan nan]
   [nan nan nan ... nan nan nan]
   ...
   [nan nan nan ... nan nan nan]
   [nan nan nan ... nan nan nan]
   [nan nan nan ... nan nan nan]]]]

However when i tried same conversion with
python main.py --training --cfg ../yolov3-tiny/yolov3-tiny.cfg --weights ../yolov3-tiny/yolov3-tiny.weights --output ../yolov3-tiny/ --prefix "YOLO/

Same script outputs:

[[[[-0.5312634   0.23449755 -0.22042923 ... -0.99058443 -0.75764066
     0.05638865]
   [-0.1264087  -0.06148954 -0.13978335 ... -0.57391363 -0.65091616
    -0.34988856]
   [-0.27005857  0.18064664 -0.1842366  ... -0.7720764  -0.63676864
    -0.22235665]
   ...
   [-0.14108022  0.12593661  0.040429   ... -0.51453155 -0.8112872
    -0.2482701 ]
   [-0.14169356  0.05826963  0.04545707 ... -0.36210614 -0.6568373
    -0.17424914]
   [-0.24074644  0.49974358 -0.17072684 ... -1.1237179  -0.8400626
    -0.20994306]]

  [[-0.37883073  0.06569445  0.07646853 ... -0.72665095 -0.5669313
     0.23495841]
   [-0.11390454  0.00512573  0.09839267 ...  0.02260823 -0.31830767
     0.00776402]
   [-0.18927872  0.14090516  0.06336813 ... -0.17192174 -0.3423958
     0.07134365]
   ...
   [-0.5374908   0.17205149  0.30092606 ... -1.299513   -0.50735444
    -0.45372528]
   [-0.44234592  0.17717186  0.11988509 ... -0.9887123  -0.25854525
    -0.40106654]
   [-0.30651295  0.32414198  0.01627261 ... -1.7556211  -0.55981153
    -0.5505434 ]]]]

I believe this is because batch-normalization, --training parameter. And I want to use this model for transfer learning.

Also when I tried to get output from earlier layers like convolutional2 (without --training parameter), values were like:

[[[[nan -1.4262159e+36 -1.6400952e+36 ... -1.5521092e+36
     1.1826908e+38 -1.1971094e+37]
   [           nan -5.4608188e+36           -inf ... -2.9475174e+35
    -2.9942158e+36           -inf]
   [           nan -5.4608188e+36           -inf ... -2.9475174e+35
    -2.9942158e+36           -inf]
   ...
   [           nan -5.4608188e+36           -inf ... -2.9475174e+35
    -2.9942158e+36           -inf]
   [           nan -5.4608188e+36           -inf ... -2.9475174e+35
    -2.9942158e+36           -inf]
   [           nan -4.9901782e+36 -2.4481979e+36 ...  8.4210530e+36
              -inf -1.1353102e+37]]

  [[           nan -1.3676106e+36            inf ...  1.5158864e+37
               inf -8.5954786e+36]
   [           nan -7.9527132e+36            inf ...  2.1685821e+37
     1.6828479e+37           -inf]
   [           nan -7.9527132e+36            inf ...  2.1685821e+37
     1.6828479e+37           -inf]
   ...
   [           nan -3.1938362e+36            inf ...  1.5331453e+37
     3.3975579e+37 -9.5892951e+36]
   [           nan -3.1938362e+36            inf ...  1.5331453e+37
     3.3975579e+37 -9.5892951e+36]
   [           nan -5.6393693e+36  4.6983167e+37 ...  1.0347686e+37
    -5.8164126e+36 -4.1906564e+36]]]]

Is this a problem about code or am I missing something about like image input?

Sambhav Jain · Answer 1 · Thu Feb 28 2019 06:03:30 GMT+0800 (China Standard Time)

@fmehmetun Thanks for reporting this. After a little digging, this seems to be due to different weight offsets (16 vs 20) for different major/minor versions. So, yolov2-tiny, yolov3-tiny and yolov3 seem to require an offset of 20 instead of 16. If not set properly, this can corrupt the converted TF weights (ckpt), which likely caused the nans you reported.

Fortunately someone fixed this for darkflow in this PR. From a quick test, it seems to resolve your issue. I'll run some more tests and push the fix shortly.

Sambhav Jain · Answer 2 · Thu Feb 28 2019 06:40:03 GMT+0800 (China Standard Time)

@fmehmetun - give it a try and let me know if you see any other issues.

Gali · Answer 3 · Thu Feb 28 2019 18:51:00 GMT+0800 (China Standard Time)

Thanks for the fix. I tried now and its working with no problem. After opening issue I tried darkflow though, it's worked with no problem too. It's good to know I have another option for conversion. Thanks.