Object detection with own model fails with ...

Question

Object detection with own model fails with ...

jetzzze opened this issue 7 years ago · comments

Hi Nick,
after recapping successfully your tutorial "as is" I tried out to create my own object detection model. I replaced your pictures with mine, used my own annotation xml files (boxes only), trained it (was very short run time - only one global step ...), created the graph file and then tested it with the following result. There seems to go something wrong with the "aspect" parameter set to "normal" ... but I do not know what that means ;-)

jre@ibm-jre-mbp  ==_-+- python object_detection/object_detection_runner.py
dyld: warning, LC_RPATH $ORIGIN/../../_solib_darwin_x86_64/_U_S_Stensorflow_Spython_C_Upywrap_Utensorflow_Uinternal.so___Utensorflow in /Users/jre/Library/Python/2.7/lib/python/site-packages/tensorflow/python/_pywrap_tensorflow_internal.so being ignored in restricted program because it is a relative path
Loading model...
detecting...
2017-12-01 11:55:11.616214: I tensorflow/core/platform/cpu_feature_guard.cc:137] Your CPU supports instructions that this TensorFlow binary was not compiled to use: SSE4.1 SSE4.2 AVX AVX2 FMA
Traceback (most recent call last):
  File "object_detection/object_detection_runner.py", line 90, in <module>
    detect_objects(image_path)
  File "object_detection/object_detection_runner.py", line 63, in detect_objects
    plt.imshow(image_np, aspect = 'normal')
  File "/Users/jre/Library/Python/2.7/lib/python/site-packages/matplotlib/pyplot.py", line 3080, in imshow
    **kwargs)
  File "/Users/jre/Library/Python/2.7/lib/python/site-packages/matplotlib/__init__.py", line 1710, in inner
    return func(ax, *args, **kwargs)
  File "/Users/jre/Library/Python/2.7/lib/python/site-packages/matplotlib/axes/_axes.py", line 5189, in imshow
    self.set_aspect(aspect)
  File "/Users/jre/Library/Python/2.7/lib/python/site-packages/matplotlib/axes/_base.py", line 1273, in set_aspect
    self._aspect = float(aspect)  # raise ValueError if necessary
ValueError: could not convert string to float: normal
[ /Users/jre/Downloads/Watson/JRE-Object-Detection-master ]
jre@ibm-jre-mbp  ==_-+-

What is going wrong? Find the installed versions of numpy and matplotlib below:

jre@ibm-jre-mbp  ==_-+- pip list --format=legacy
altgraph (0.10.2)
appnope (0.1.0)
asn1crypto (0.23.0)
backports-abc (0.5)
backports.functools-lru-cache (1.4)
backports.shutil-get-terminal-size (1.0.0)
backports.weakref (1.0.post1)
bdist-mpkg (0.5.0)
bleach (2.1.1)
bonjour-py (0.3)
certifi (2017.11.5)
cffi (1.11.2)
chardet (3.0.4)
configparser (3.5.0)
cryptography (2.1.3)
cycler (0.10.0)
decorator (4.1.2)
entrypoints (0.2.3)
enum34 (1.1.6)
funcsigs (1.0.2)
functools32 (3.2.3.post2)
futures (3.1.1)
html5lib (1.0b10)
idna (2.6)
ipaddress (1.0.18)
ipykernel (4.6.1)
ipython (5.5.0)
ipython-genutils (0.2.0)
ipywidgets (7.0.5)
Jinja2 (2.10)
jsonschema (2.6.0)
jupyter (1.0.0)
jupyter-client (5.1.0)
jupyter-console (5.2.0)
jupyter-core (4.4.0)
lxml (4.1.1)
macholib (1.5.1)
Markdown (2.6.9)
MarkupSafe (1.0)
matplotlib (2.1.0)
mistune (0.8.1)
mock (2.0.0)
modulegraph (0.10.4)
nbconvert (5.3.1)
nbformat (4.4.0)
notebook (5.2.1)
numpy (1.13.3)
olefile (0.44)
pandocfilters (1.4.2)
pathlib2 (2.3.0)
pbr (3.1.1)
pexpect (4.3.0)
pickleshare (0.7.4)
Pillow (4.3.0)
pip (9.0.1)
prompt-toolkit (1.0.15)
protobuf (3.5.0.post1)
ptyprocess (0.5.2)
py2app (0.7.3)
pycparser (2.18)
Pygments (2.2.0)
pyobjc-core (2.5.1)
pyobjc-framework-Accounts (2.5.1)
pyobjc-framework-AddressBook (2.5.1)
pyobjc-framework-AppleScriptKit (2.5.1)
pyobjc-framework-AppleScriptObjC (2.5.1)
pyobjc-framework-Automator (2.5.1)
pyobjc-framework-CFNetwork (2.5.1)
pyobjc-framework-Cocoa (2.5.1)
pyobjc-framework-Collaboration (2.5.1)
pyobjc-framework-CoreData (2.5.1)
pyobjc-framework-CoreLocation (2.5.1)
pyobjc-framework-CoreText (2.5.1)
pyobjc-framework-DictionaryServices (2.5.1)
pyobjc-framework-EventKit (2.5.1)
pyobjc-framework-ExceptionHandling (2.5.1)
pyobjc-framework-FSEvents (2.5.1)
pyobjc-framework-InputMethodKit (2.5.1)
pyobjc-framework-InstallerPlugins (2.5.1)
pyobjc-framework-InstantMessage (2.5.1)
pyobjc-framework-LatentSemanticMapping (2.5.1)
pyobjc-framework-LaunchServices (2.5.1)
pyobjc-framework-Message (2.5.1)
pyobjc-framework-OpenDirectory (2.5.1)
pyobjc-framework-PreferencePanes (2.5.1)
pyobjc-framework-PubSub (2.5.1)
pyobjc-framework-QTKit (2.5.1)
pyobjc-framework-Quartz (2.5.1)
pyobjc-framework-ScreenSaver (2.5.1)
pyobjc-framework-ScriptingBridge (2.5.1)
pyobjc-framework-SearchKit (2.5.1)
pyobjc-framework-ServiceManagement (2.5.1)
pyobjc-framework-Social (2.5.1)
pyobjc-framework-SyncServices (2.5.1)
pyobjc-framework-SystemConfiguration (2.5.1)
pyobjc-framework-WebKit (2.5.1)
pyOpenSSL (17.4.0)
pyparsing (2.2.0)
pysolr (3.6.0)
python-dateutil (2.6.1)
pytz (2017.3)
pyzmq (16.0.3)
qtconsole (4.3.1)
requests (2.18.4)
scandir (1.6)
scipy (0.13.0b1)
setuptools (38.2.1)
simplegeneric (0.8.1)
singledispatch (3.4.0.3)
six (1.11.0)
subprocess32 (3.2.7)
tensorflow (1.4.0)
tensorflow-tensorboard (0.4.0rc3)
terminado (0.8)
testpath (0.3.1)
tornado (4.5.2)
traitlets (4.3.2)
urllib3 (1.22)
vboxapi (1.0)
watson-developer-cloud (1.0.0)
wcwidth (0.1.7)
webencodings (0.5.1)
Werkzeug (0.12.2)
wheel (0.30.0)
widgetsnbextension (3.0.8)
xattr (0.6.4)
zope.interface (4.1.1)
[ /Users/jre/Downloads/Watson/JRE-Object-Detection-master ]
jre@ibm-jre-mbp  ==_-+-

Nick Bourdakos · Answer 1 · Fri Dec 01 2017 22:04:16 GMT+0800 (China Standard Time)

Try changing it from plt.imshow(image_np, aspect = 'normal')

To
plt.imshow(image_np)

Looking at the documentation ‘normal’ isn’t a real thing, not sure why it was running fine before...

jetzzze · Answer 2 · Fri Dec 01 2017 23:41:45 GMT+0800 (China Standard Time)

Cheers ... it runs now but the output images show no identified objects ... I suspect the mistake in the training which took only some minutes and gave this output (there was only one "global step" ...)

INFO:tensorflow:Scale of 0 disables regularizer.
INFO:tensorflow:Scale of 0 disables regularizer.
WARNING:tensorflow:From /Users/jre/Downloads/Watson/JRE-Object-Detection-master/object_detection/trainer.py:176: create_global_step (from tensorflow.contrib.framework.python.ops.variables) is deprecated and will be removed in a future version.
Instructions for updating:
Please switch to tf.train.create_global_step
INFO:tensorflow:Scale of 0 disables regularizer.
INFO:tensorflow:Scale of 0 disables regularizer.
INFO:tensorflow:Scale of 0 disables regularizer.
INFO:tensorflow:Scale of 0 disables regularizer.
WARNING:tensorflow:From /Users/jre/Downloads/Watson/JRE-Object-Detection-master/object_detection/builders/optimizer_builder.py:105: get_or_create_global_step (from tensorflow.contrib.framework.python.ops.variables) is deprecated and will be removed in a future version.
Instructions for updating:
Please switch to tf.train.get_or_create_global_step
INFO:tensorflow:Summary name Learning Rate is illegal; using Learning_Rate instead.
INFO:tensorflow:Summary name /clone_loss is illegal; using clone_loss instead.
/Users/jre/Library/Python/2.7/lib/python/site-packages/tensorflow/python/ops/gradients_impl.py:96: UserWarning: Converting sparse IndexedSlices to a dense Tensor of unknown shape. This may consume a large amount of memory.
  "Converting sparse IndexedSlices to a dense Tensor of unknown shape. "
2017-12-01 10:55:27.175007: I tensorflow/core/platform/cpu_feature_guard.cc:137] Your CPU supports instructions that this TensorFlow binary was not compiled to use: SSE4.1 SSE4.2 AVX AVX2 FMA
INFO:tensorflow:Restoring parameters from model.ckpt
INFO:tensorflow:Starting Session.
INFO:tensorflow:Saving checkpoint to path train/model.ckpt
INFO:tensorflow:Starting Queues.
INFO:tensorflow:global_step/sec: 0
INFO:tensorflow:Recording summary at step 0.
INFO:tensorflow:global_step/sec: 0
INFO:tensorflow:global step 1: loss = 1.9224 (140.383 sec/step)
INFO:tensorflow:Recording summary at step 1.
INFO:tensorflow:global_step/sec: 0.00850703
Killed: 9
[ /Users/jre/Downloads/Watson/JRE-Object-Detection-master ]
jre@ibm-jre-mbp  ==_-+-

Any idea what's gone wrong with my training? My annotation XML files looks like this:

</annotation><annotation>
    <folder>images</folder>
    <filename>Strom_8.jpg</filename>
    <size>
        <width>2988</width>
        <height>5312</height>
        <depth>3</depth>
    </size>
    <segmented>0</segmented>
    <object>
        <name>Stromzaehler_ID</name>
        <pose>Unspecified</pose>
        <truncated>0</truncated>
        <difficult>0</difficult>
        <bndbox>
            <xmin>801</xmin>
            <ymin>2162</ymin>
            <xmax>1743</xmax>
            <ymax>2347</ymax>
        </bndbox>
    </object>
    <object>
        <name>Stromzaehler_CT</name>
        <pose>Unspecified</pose>
        <truncated>0</truncated>
        <difficult>0</difficult>
        <bndbox>
            <xmin>1138</xmin>
            <ymin>886</ymin>
            <xmax>1852</xmax>
            <ymax>1101</ymax>
        </bndbox>
    </object>

I have started with 20 training pictures per object but the images I am testing look all very similar to the training pics. So it is 45 pictures in sum and I have two objects per pic.

Nick Bourdakos · Answer 3 · Fri Dec 01 2017 23:59:47 GMT+0800 (China Standard Time)

I would try and get at the minimum 100 images per object. Although I'm not sure if thats the problem. I believe Killed: 9 is an "Abnormal termination of the process." Have you tried running it again? The kernel could have just killed it because it was using too many resources. Are you running this locally?

jetzzze · Answer 4 · Sat Dec 02 2017 00:04:05 GMT+0800 (China Standard Time)

Yes I am running locally ... it is a 2015 Macbook Pro ... the training of the tutorial ran without troubles ...

Nick Bourdakos · Answer 5 · Sat Dec 02 2017 00:08:31 GMT+0800 (China Standard Time)

I would just try training it again and if that doesn’t work add more training data. Just out of curiosity, how many steps did the tutorial run for and how long did it take?

Zach Erkkila · Answer 6 · Sat Dec 02 2017 00:28:25 GMT+0800 (China Standard Time)

I ran into the same ValueError: could not convert string to float: normal issue about 10 mins ago, just used plt.imshow(image_np, aspect = 'auto') and that worked for me

jetzzze · Answer 7 · Sat Dec 02 2017 00:38:08 GMT+0800 (China Standard Time)

Yes, I tried this also and it worked ... thanks for this advise :-)
But I probably found now, why my training fails ... I have only 16GB physical memory in my Macbook ... and once starting the training it goes up to 65GB compressed memory. After the first global step the process kills itself ... Probably this messages is the reason why I am running out of memory:

INFO:tensorflow:Summary name Learning Rate is illegal; using Learning_Rate instead.
INFO:tensorflow:Summary name /clone_loss is illegal; using clone_loss instead.
/Users/jreich/Library/Python/2.7/lib/python/site-packages/tensorflow/python/ops/gradients_impl.py:96: UserWarning: Converting sparse IndexedSlices to a dense Tensor of unknown shape. This may consume a large amount of memory.
  "Converting sparse IndexedSlices to a dense Tensor of unknown shape. "

=>
... UserWarning: Converting sparse IndexedSlices to a dense Tensor of unknown shape. This may consume a large amount of memory.
"Converting sparse IndexedSlices to a dense Tensor of unknown shape. " ...
<=

Can I change the configuration to use less memory?

Zach Erkkila · Answer 8 · Sat Dec 02 2017 00:48:05 GMT+0800 (China Standard Time)

What model are you using? The faster_rcnn_resnet101_coco model and config file provided had me around 3.5 GB for the duration, even though I got the same warnings. Likely is the mem, I had the same issue, getting Killed after the first global step, because my docker was set up for only 2 GB. After increasing that I was able to train approx 120 test images varying in size but nothing too huge. I am just starting to use this lib so I am by no means well versed in it but just figured I would let you know my limited experience.

Nick Bourdakos · Answer 9 · Sat Dec 02 2017 00:51:05 GMT+0800 (China Standard Time)

The batch_size is by default set to 1, so I don't think you can lower that anymore. How large are the images you are trying to train on? Maybe if you use the mobile net model it will use less memory?

jetzzze · Answer 10 · Sat Dec 02 2017 00:55:51 GMT+0800 (China Standard Time)

The training images are between 3 and 5 MB. I have reduce file sizes by 85% ... now around 400-500 Kbyte ... running again ...

jetzzze · Answer 11 · Sat Dec 02 2017 01:11:24 GMT+0800 (China Standard Time)

Memory still increasing over 65GB and still killing itself ... trying now to crop the images to get down to average 200k files like in your tutorial ...

Nick Bourdakos · Answer 12 · Sat Dec 02 2017 01:15:45 GMT+0800 (China Standard Time)

how many pixels x pixels do the images average?

Nick Bourdakos · Answer 13 · Sat Dec 02 2017 01:18:30 GMT+0800 (China Standard Time)

I don't think the compression of the image files will have much of an effect, because I'm pretty sure it has to load all of the pixels into memory when it loads the images. So a reduction in pixel size should have a much more dramatic effect. Just make sure you update your bounding boxes

jetzzze · Answer 14 · Sat Dec 02 2017 01:31:30 GMT+0800 (China Standard Time)

Yeah ... I thought so too ... Newbie me used pics with 5XXX x 2XXX size ... far too big ... installed imagemagick and converted to width of 1000 pixel ... Now the files are down to 100-150k ... As you say, I will need to update the annotations now as this x and y values do not fit anymore ... will do probably later tonight or during the weekend. I am still in the office and need to go home to my family ... weekend evening ... they are waiting for Daddy. It's 6:30pm in Germany :-) time to leave ...
Thanks for guiding me to the right point ... I will do the changes and run the training ... keep you posted!

Nick Bourdakos · Answer 15 · Sat Dec 02 2017 01:37:25 GMT+0800 (China Standard Time)

That should work much better :) Enjoy your weekend!

Sodininikas · Answer 16 · Sun Dec 10 2017 07:01:01 GMT+0800 (China Standard Time)

I'm facing the same problem, process killed after 1 step. How people can go to home with such an issue.

jetzzze · Answer 17 · Tue Dec 12 2017 22:15:27 GMT+0800 (China Standard Time)

I was busy last week making a lot of money for my company :-) ... As already mentioned I reduced the picture size of the training pics to 563 x 1000 and I have recreated the annotation files and now it runs ... 18 to 20 secs per step ... Thinking about your hint to leverage Nimbix :-)) Thanks for your support!!!

P.S. Can you give a short hint/advise how I could achieve to get the marked objects in the output/test_images as cropped JPGs like in your other tutorial where you create a cropped image for each dog?

Sodininikas · Answer 18 · Wed Dec 13 2017 10:47:38 GMT+0800 (China Standard Time)

And now get back those money to you. And I was sleeping on office floor, because of this. Anyway I found another solution that can add:
batch_queue_capacity: 50
But anyway better to reduce size of image, because I don't get or if image width and height higher than in config file, everything fails. I trained about 3k steps, with 150 picture. And tensor fails everywhere detects all objects as trained. All my hundred tests failed with tensorflow. That i manage to do it with opencv functions what i need, very disappointed in tenosflow

jetzzze · Answer 19 · Wed Dec 13 2017 21:37:20 GMT+0800 (China Standard Time)

My training fails somewhere around step 123 +/- 3 steps with the error:

InvalidArgumentError (see above for traceback): Incompatible shapes: [1,63,4] vs. [1,64,4]
[[Node: gradients/Loss/BoxClassifierLoss/Loss/sub_grad/BroadcastGradientArgs = BroadcastGradientArgs[T=DT_INT32, _device="/job:localhost/replica:0/task:0/device:CPU:0"](gradients/Loss/BoxClassifierLoss/Loss/sub_grad/Shape, gradients/Loss/BoxClassifierLoss/Loss/sub_1_grad/Shape)]]

I found in this GitHub thread : tensorflow/models#1618
that this could be due to the fact that the number of bounding boxes per image may not fit the number of classes and class_text in the tf record file (don't know where this is :-)

But my loss at that stage is around 0.07 and 0.06.

My question is: Does it make sense to solve this issue (and what would be the right approach) or may I just ignore it as the loss is already good enough to use is at that stage ...?

Edit 1: Probably it helps to say that I have 4 class ids in my label_map file. You only have 2 ...?

Edit 2: Found the num_classes parameter in the faster_rcnn_resnet101.config file and changed it to 4 ... just educated guessing :-) ... starting training again from 0 ... see how it works ...

jetzzze · Answer 20 · Thu Dec 14 2017 00:03:32 GMT+0800 (China Standard Time)

Changing the num_classes parameter solved the issue!!!
Training runs now with out errors but I would be keen to understand at what loss level I can stop training or what indicators there are to find out at what stage the model has been trained well enough!

Edit 1: Loss level after 120 steps is now not decreasing so fast and is still around 0.5 to 0.4 steadily decreasing. So it seems that it will take now longer. But I would like to know when I can consider the model as "solid" like you did in your tutorial.

Thanks for supporting the newbie!

Sodininikas · Answer 21 · Fri Dec 15 2017 06:46:10 GMT+0800 (China Standard Time)

you can make evolution steps, or stop test, continue, and test add the end. It would be great if you manage to make some success. loss level has to be slightly slowing, if bouncing it's not good, the model can't get right weight number, don't matter how low. About the actual number that you see, you have to know what the loss function is using, regularization etc. etc. most likely, when the class score is plus 1, if you have little number of classes maybe it's better to have number close to 0, if 1 class, it could be actual 0, because it don't have to compare with. For me the point is not detect random objects in image, also can look at hard negative mining. But for me what's the point it's not working anyway

Sandeep G · Answer 22 · Tue Feb 19 2019 13:41:44 GMT+0800 (China Standard Time)

while I am training the custom object detection I got this error can anyone help me out.

File "train.py", line 189, in
tf.app.run()
File "F:\BIN\ERMVA\lib\site-packages\tensorflow\python\platform\app.py", line 125, in run
_sys.exit(main(argv))
File "F:\BIN\ERMVA\lib\site-packages\tensorflow\python\util\deprecation.py", line 250, in new_func
return func(*args, **kwargs)
File "train.py", line 185, in main
graph_hook_fn=graph_rewriter_fn)
File "E:\CUSTOM\models1\object_detection\trainer.py", line 299, in train
clones = model_deploy.create_clones(deploy_config, model_fn, [input_queue])
File "E:\CUSTOM\models1\object_detection\model_deploy.py", line 193, in create_clones
outputs = model_fn(*args, **kwargs)
File "E:\CUSTOM\models1\object_detection\trainer.py", line 212, in _create_losses
prediction_dict = detection_model.predict(images, true_image_shapes)
File "E:\CUSTOM\models1\object_detection\ssd_meta_arch.py", line 575, in predict
preprocessed_inputs)
File "E:\CUSTOM\models1\object_detection\ssd_mobilenet_v1_feature_extractor.py", line 130, in extract_features
use_explicit_padding=self._use_explicit_padding,scope=scope)
TypeError: mobilenet_v1_base() got an unexpected keyword argument 'use_explicit_padding'

Nick Bourdakos · Answer 23 · Wed Feb 20 2019 01:01:13 GMT+0800 (China Standard Time)

Hi @sandeeprddy, this repo relies on a dated version of the TensorFlow object detection api. We've moved to a more future proof version here: https://github.com/cloud-annotations/training
I encourage you to try it out and reopen this issue there if you are still running into problems