rbgirshick / py-faster-rcnn

Faster R-CNN (Python implementation) -- see https://github.com/ShaoqingRen/faster_rcnn for the official MATLAB version

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

How to train Faster R-CNN on my own dataset ?

JohnnyY8 opened this issue · comments

Hi everyone:
I want to train Faster R-CNN on my own dataset. Because Faster R-CNN does not use selective search method, I comment the code about selective. However, there are still some errors about roidb, and so on.
Can anybody help me ? I am not quite sure what should I do for training Faster R-CNN. It is a little complicated for me.
Thanks so much!

@JohnnyY8

Hi, I did the same thing. At first you should work through the code and check out, where which functions are called and you should try the demo.py. Afterwards in the readme is a section called "Beyond the demo" which explains the basic proceeding.

Additionally, you should search for issues in this repo. There are actually quite a lot similar issues that ask the same question.

Furthermore, here is a really good documentation of the "how to train on own dataset". This helped me a lot.

Finally, I'll sum up the main steps for you:

  1. Copy the structure of the pascal voc dataset into the FRCN_ROOT/data/, create a symbolic link and place your data in a similar manner as the pascal voc data set. That's actually the best way to prevent you from huge code changes in the following steps.
  2. Create a FRCN_ROOT/lib/datasets/.py and a _eval.py corresponding the pascal_voc.py and voc_eval.py
  3. Update the FRCN_ROOT/lib/datasets/factory.py by adding a new entry for your own dataset.
  4. Adapt the models under FRCN_ROOT/models/ by copying and changing an existing one like pascal_voc. Note, that you have to take care of the path within the solver and the amount of classes in the train and test prototxt. I can recommend you to start with the ZF model and the end2end algorithm. The alt_opt is more complex and better if you have more experience later.
  5. Create a config file under FRCN_ROOT/experiments/cfgs also by copying and updating an existing one.
  6. Create or update an experiment script under FRCN_ROOT/experiments/scripts by modifying it to your dataset
  7. Start training and testing by running the experiment script created in the previous step.

There are just the main steps I figured out during my work with the framework. It will take some time to get into it and several problems will occur by using the framework with your own dataset. The most problems are already addressed within other issues in this repo.

It might also be very helpful to use a python IDE that supports debugging.

Hope that helps. =)

Hi @ednarb29 , thanks for you answer sincerely, I will try it now. Hope I can do it.
In addition, VID dataset has a lot of frames, more than one million. I am not quite sure if the code will create cache file for VID dataset ? Every time, it will takes me much time to load frames ?
Thank you again!

You can easily check that out, the file should be under FRCN_ROOT/data/cache/

Of course if this file is huge it needs some time even to load the cache file I guess. Maybe you should debug that. Naively you can delete the cache file and start training again. So you can compare the time it needs to create the dataset / load the cache file.

Hi @ednarb29 , I have tried method you said. There are some errors about selective_search I can't handle like following.
image
In my opinion, Faster R-CNN doesn't use selective search, so I prefer to comment some codes about selective search such as "self.selective_search_roidb". But maybe it is not a right way to solve. Could you please give me some suggestions?

@JohnnyY8 : Can you paste here your configuration information which are printed on terminal. I guess that your configuration file still choose the proposal method is selective search

@tiepnh Hi! You are right. According to tutorial "https://github.com/deboc/py-faster-rcnn/tree/master/help", I use command ($ echo 'MODELS_DIR: "$PY_FASTER_RCNN/models"' >> config.yml) to generate config.yml. But if I change it to "experiments/cfgs/faster_rcnn_end2end.yml", it looks ok.

@tiepnh @ednarb29 I can starting training, it looks close to right way. I will check it on validation set after finishing training. Thanks for you guys' help!!!
Another question is in factory.py like following. What does the split mean? If there are ["train", "val", "test"], what do they use for ? train for training, val and test for what ?
image

@JohnnyY8 : This array will point to your image set files. As your pasted code, there are no image set file for testing or they use same image set for both training and testing.
Example: for the pascal_voc
The script file will call the this command for training
time ./tools/train_net.py --gpu ${GPU_ID} \ --solver models/${PT_DIR}/${NET}/faster_rcnn_end2end/solver.prototxt \ --weights data/prdcv_models/${NET}.v2.caffemodel \ --imdb ${TRAIN_IMDB} \ --iters ${ITERS} \ --cfg experiments/cfgs/faster_rcnn_end2end.yml \ ${EXTRA_ARGS}
The TRAIN_IMDB is "voc_2007_trainval" => they will load all image in image set files ".....trainval.txt"
For the testing, they will use TEST_IMDB="voc_2007_test" => load image in image set file "....test.txt" to test the trained network

@tiepnh Cool! Your answer is very useful and clear! Thanks so much!
That means the ground truth of PASCAL VOC 2007 test set is under "Annotaions" folder, right? Otherwise, it can't get mAP after finish training.
But I do not have the ground truth of VID test set and use TEST_IMDB="VID_val", does that mean it will test on validation set?

@tiepnh Hi!
I use command to start training:

  • sudo ./tools/train_net.py --gpu 0 --iters 100000 --weights data/imagenet_models/ZF.v2.caffemodel --imdb VID_train --cfg ./experiments/cfgs/faster_rcnn_end2end.yml --solver models/pascal_voc/ZF/faster_rcnn_end2end/solver.prototxt

but still got following errors:
Traceback (most recent call last):
File "./tools/train_net.py", line 112, in
max_iters=args.max_iters)
File "/usr/local/caffes/xlw/faster-rcnn-third/tools/../lib/fast_rcnn/train.py", line 155, in train_net
roidb = filter_roidb(roidb)
File "/usr/local/caffes/xlw/faster-rcnn-third/tools/../lib/fast_rcnn/train.py", line 145, in filter_roidb
filtered_roidb = [entry for entry in roidb if is_valid(entry)]
File "/usr/local/caffes/xlw/faster-rcnn-third/tools/../lib/fast_rcnn/train.py", line 134, in is_valid
overlaps = entry['max_overlaps']
KeyError: 'max_overlaps'

Is there something wrong ?

@JohnnyY8 :

That means the ground truth of PASCAL VOC 2007 test set is under "Annotaions" folder, right?
For both, test set/ train set, the ground truth of Pascal_voc is under Annotations.

For the TEST_IMDB, it just point to set of image use to test. So, if your use same image set for TRAIN_IMDB and TEST_IMDB, it will train and test the network in same dataset.
Secondly, you have to write your test function. See this tuto https://github.com/deboc/py-faster-rcnn/tree/master/lib/datasets

The error "max_overlaps" it seem that your data have no foreground ROI or background ROI. So, please check again your py file, which use to read your dataset

@tiepnh Thank you so much! You are so nice.
I have found some bugs and restart training.
Let's waiting for the results.
Really, thanks for your help!

@tiepnh @ednarb29 Hi!
I restarted training, but some strange problem occurred. I printed some path in train.txt, like this:
image
When I see the printed information in terminal, I notice that the data has been loaded for many times! My teammate and me are pretty sure it has finished the whole training set for at least once. But this information shows it start from 0000 again.
image
Could you please help me? We have loaded training data for more than 20 hours.
Thank you so much!

At first I would suggest you to start training and testing with a very little data set (100 images and 1k iterations), that you can debug the training and testing quite fast.

Does the problem occur during creation of the data set or during training?

@ednarb29 I am not quite sure, several times before, I can load data about 2~4 hours (also load repeatly). But this time is stranger. We do not change any codes, just restart the training. The time for loading data is very long!

@ednarb29 Do you just load data for once after start traing ?

I am not sure about that because this kind of problem did not occur for me... If I had problems with loading the data set I just removed the cache file and that solved the problem in most cases because changes on the original data set are not updated in the cache file. Sorry dude.

commented

Hi @JohnnyY8,
I completely agree with the idea of ednarb29, you should test with a (very) small dataset at first.
Moreover, I'm pretty sure that it's a bad idea to print anything for each data input. That may be the cause of the enormous additional loading time you got.

@ednarb29 Not to be sorry, I should thank you!
I will remove the cache file and restart training! Really thanks for your help!

@deboc That is right. I will try it. Thank you!
If I print anything, that will cause huge loading time ?

commented

I just bet it's not negligible.
You were saying the loading time had raised from 4h to 20h right ? What did you change beside adding this print ?

@deboc Oh, I see. Only add print codes. So that is stranger for us.

Did removing the print command speed up the process?

And did removing the cache file and build the database again solve your problem with the KeyError: 'max_overlaps'?

@ednarb29 I don't try to remove the print command. Because I really want to know the process, I guss this time consuming is negligible.
And removing the cache file works, my training restarts into iteration. Thanks a lot!

Cool, so if it works fine you can close the issue? =)

@ednarb29 Sure, thank you very much!

@deboc , I have a quick question. I get the following error when I executed the following command:

Command:
./tools/train_faster_rcnn_alt_opt.py --gpu 0 --net_name INRIA_Person --weights data/faster_rcnn_models/VGG16_faster_rcnn_final.caffemodel --imdb inria_train --cfg config.yml

Error:

.....
I0725 04:10:00.437233  3494 net.cpp:816] Ignoring source layer conv4_3
I0725 04:10:00.437252  3494 net.cpp:816] Ignoring source layer relu4_3
I0725 04:10:00.437268  3494 net.cpp:816] Ignoring source layer pool4
I0725 04:10:00.437296  3494 net.cpp:816] Ignoring source layer conv5_1
I0725 04:10:00.437314  3494 net.cpp:816] Ignoring source layer relu5_1
I0725 04:10:00.437331  3494 net.cpp:816] Ignoring source layer conv5_2
I0725 04:10:00.437350  3494 net.cpp:816] Ignoring source layer relu5_2
I0725 04:10:00.437366  3494 net.cpp:816] Ignoring source layer conv5_3
I0725 04:10:00.437384  3494 net.cpp:816] Ignoring source layer relu5_3
I0725 04:10:00.437397  3494 net.cpp:816] Ignoring source layer conv5_3_relu5_3_0_split
I0725 04:10:00.437405  3494 net.cpp:816] Ignoring source layer roi_pool5
F0725 04:10:00.737687  3494 net.cpp:829] Cannot copy param 0 weights from layer 'fc6'; shape mismatch.  Source param shape is 4096 25088 (102760448); target param shape is 4096 18432 (75497472). To learn this layer's parameters from scratch rather than copying from a saved net, rename the layer.
*** Check failure stack trace: ***

I read that there's basically a difference in the expected size that the network has been setup to expect. The one thing that I can imagine is that I am using the faster-rcnn VGG16 model( data/faster_rcnn_models/VGG16_faster_rcnn_final.caffemodel )? Is it possible to use this model instead of the one you mentioned( data/imagenet_models/VGG_CNN_M_1024.v2.caffemodel ) ?

P.S. Thank you for that awesome tutorial !

commented

Hi GeorgiAngelov,
I see you are using a final faster-rcnn caffemodel as pretrained network, but those ones doesn't have any fc6 layer, thus your issue.
The classical way for another dataset would be to use a pretrained caffe classifier for your data, and use its train.prototxt to build a faster-rcnn model.
So I suggest you investigate which classifier was used in your pretrained model, and provide this caffemodel (e.g. VGG_CNN_M_1024.v2.caffemodel) instead of the faster-rcnn one in the weights option

@GeorgiAngelov Hi!
I think the weight should be assigned imagenet pretrained model, not faster rcnn final model.
Hope it can help you.

@deboc, is the VGG_CNN_M_1024.v2.caffemodel considered a pre-trained model ? I am wondering if this model in itself is already capable of classifying objects. My basic idea is that I would like to start training a model with my own data but I would like that model to already be a trained model so I can leverage the weights.

My idea is that you can pretty much start with a trained .caffemodel file such as the VGG16_faster_rcnn_final.caffemodel and then train it even further. It appears that this might not be possible with this model in particular.

My question is: What does the v2 stand for in VGG_CNN_M_1024.v2.caffemodel and can I get a final model from this model to actually use it with tools/demo.py for example?

@JohnnyY8 , thank you for clarifying that. Until now, I was assuming that a model is a model is a model. I did not differentiate between pretrained model and a final model. I guess I am still not clear on the distinction.

@GeorgiAngelov If you want to train on final caffemodel and go further, it may be OK. Just pay attention to the difference of architecture of networks.
I also do not know what v2 meas. But according to tutorial I consider it as pre-trained model, when I train faster r-cnn on my own dataset. And the final caffemodel can be directly utilized to classify objects.

commented

Some confusion here. Every .caffemodel contains a pretrained model, with the weights of a converged neural network. The ones of faster-rcnn just also happen to be called "final" models.

Before touching faster-rcnn I suggest you start by getting more used to the caffe deep learning framework. A lot of pre-trained models can be found on the zoo, and are ready to use. Most of them are classifier that can infer an object class from an image. VGG_CNN_M_1024.v2.caffemodel is one of those (sorry, don't know about the v2 neither but the originals are from there).
Indeed you can finetune a classifier by removing the last layer and adapt it for another dataset. For that you can carefully change the learning rate of each layer in order to balance between "start from scratch policy" and "reuse the former network policy".
Good tutorials about caffe can be found on the Berkeley Vision website

Now about faster-rcnn. It's a framework for object detection, developed by R. Girshick. It's using the convnet classifier of your choice and the training phase learns how to detect the objects classified by the underlying classifier.
That's why you need to reuse or finetune a classifier for your data, before even considering detection (and faster-rcnn).

So :

  • If your objects are already classified by a converged model from the caffe zoo (e.g. 'aeroplane', 'bicycle', 'bird', 'person', etc for VGG), you can directly use this model to launch a faster-rcnn training
  • If not forget faster-rcnn for now and take a look on caffe tutorials to build your own classifier

@JohnnyY8 : Hey, could you share how you managed to solve the "max_overlaps" issue ?

@vikiboy Hi, I do not remember it clearly, it seems that there are a little of xml files of gt that do not contain any objects. I remove them and corresponding images. Hope it can help you.

@vikiboy In addition, please pay attention to the coordinates of imagenet, it is starting from 1 not 0. I remember that there are two places nee to be modified. First one is lib/dataset/your_dataset.py. Second one is lib/dataset/imdb.py. I am not quite sure what I remember, please try them.

Hi, I carried out ednarb29's method, but when I ran ./tools/train_faster_rcnn_alt_opt.py --gpu 0 --net_name INRIA_Person --weights data/imagenet_models/VGG_CNN_M_1024.v2.caffemodel --imdb inria_train --cfg config.yml , I got error as below.

Output will be saved to /home/keisan/py-faster-rcnn/output/default/train Filtered 0 roidb entries: 1228 -> 1228 WARNING: Logging before InitGoogleLogging() is written to STDERR F1107 12:32:17.155658 12497 io.cpp:36] Check failed: fd != -1 (-1 vs. -1) File not found: ~/py-faster-rcnn/models/INRIA_Person/faster_rcnn_alt_optpt/stage1_rpn_solver60k80k.pt *** Check failure stack trace: ***
The file of "stage1_rpn_solver60k80k.pt" exist in the~/py-faster-rcnn/models/INRIA_Person/faster_rcnn_alt_opt .

What should I do?

@miyamon11 Hi:
I did not try to train model in alt_opt. But according to the error info "~/py-faster-rcnn/models/INRIA_Person/**faster_rcnn_alt_optpt/**stage1_rpn_solver60k80k.pt", is here any problem? I mean optpt?

commented

I followed this tutorial but got the following errors:

Traceback (most recent call last):
File "./tools/train_net.py", line 113, in
max_iters=args.max_iters)
File "/media/username/DC1A-EA60/git14/py-faster-rcnn/tools/../lib/fast_rcnn/train.py", line 157, in train_net
pretrained_model=pretrained_model)
File "/media/username/DC1A-EA60/git14/py-faster-rcnn/tools/../lib/fast_rcnn/train.py", line 53, in init
self.solver.net.layers[0].set_roidb(roidb)
File "/media/username/DC1A-EA60/git14/py-faster-rcnn/tools/../lib/roi_data_layer/layer.py", line 68, in set_roidb
self._shuffle_roidb_inds()
File "/media/username/DC1A-EA60/git14/py-faster-rcnn/tools/../lib/roi_data_layer/layer.py", line 35, in _shuffle_roidb_inds
inds = np.reshape(inds, (-1, 2))
File "/usr/local/lib/python2.7/dist-packages/numpy/core/fromnumeric.py", line 224, in reshape
return reshape(newshape, order=order)
ValueError: total size of new array must be unchanged

Any ideas?

inds = np.reshape(inds, (-1, 2)) because of second demotion of reshaping is 2 you should use only even numbers of images in data set.

@GeorgiAngelov The tutorial of @deboc uses the image_net model VGG_CNN_M_1024.v2.caffemodel. You can get it by following the steps here https://github.com/deboc/py-faster-rcnn#download-pre-trained-imagenet-models.

@ednarb29

first I would suggest you to start training and testing with a very little data set (100 images and 1k iterations), that you can debug the training and testing quite fast.

Does the problem occur during creation of the data set or during training?

Thanks I had the same problem:

overlaps = entry['max_overlaps']
KeyError: 'max_overlaps'

I deleted the cache file and it is now running.

@ednarb29

What tool should I should to create imdb files?

@ednarb29 , removing cache file fixed problem for me regarding the max_overlaps

@ArturoDeza
What tool/code have you used to make imdb file for training?

@VanitarNordic , I don't think there's a quick recipe for that. I've been following this setup:
https://github.com/smallcorgi/Faster-RCNN_TF
You will have to modify some lines of code in the factory.py, and copy the pascal_voc.py file to your my_dataset.py file and modify the lines of code regarding the number of training classes. *Besides also annotating all your images with .xml files

@ArturoDeza
Thanks, actually I have annotated files but I've stuck in imdb creation :-(

@VanitarNordic What is the error you've been getting? You should create a new issue with the error you get when you run the end2end training script, that way we can be more helpful.

@ArturoDeza
No, but I don't understand the fact that when we have a custom dataset, then when the model should be trained on that?! because end to end training does not have the dataset input parameter.

Hi!
I am getting the following error:
Traceback (most recent call last):
File "/usr/lib/python2.7/multiprocessing/process.py", line 258, in _bootstrap
self.run()
File "/usr/lib/python2.7/multiprocessing/process.py", line 114, in run
self._target(*self._args, **self._kwargs)
File "./tools/train_faster_rcnn_alt_opt.py", line 129, in train_rpn
max_iters=max_iters)
File "/home/siplab/py-faster-rcnn/tools/../lib/fast_rcnn/train.py", line 160, in train_net
model_paths = sw.train_model(max_iters)
File "/home/siplab/py-faster-rcnn/tools/../lib/fast_rcnn/train.py", line 101, in train_model
self.solver.step(1)
File "/home/siplab/py-faster-rcnn/tools/../lib/roi_data_layer/layer.py", line 144, in forward
blobs = self._get_next_minibatch()
File "/home/siplab/py-faster-rcnn/tools/../lib/roi_data_layer/layer.py", line 63, in _get_next_minibatch
return get_minibatch(minibatch_db, self._num_classes)
File "/home/siplab/py-faster-rcnn/tools/../lib/roi_data_layer/minibatch.py", line 22, in get_minibatch
assert(cfg.TRAIN.BATCH_SIZE % num_images == 0),
ZeroDivisionError: integer division or modulo by zero

Can anyone help me with that?

I"m using INRIA Person data set. After running below command

./tools/train_faster_rcnn_alt_opt.py --gpu 0 --net_name INRIA_Person --weights data/imagenet_models/VGG_CNN_M_1024.v2.caffemodel --imdb inria_train --cfg config.yml

I got a error
File "./tools/train_faster_rcnn_alt_opt.py", line 62
print 'Loaded dataset {:s} for training'.format(imdb.name)
^
SyntaxError: invalid syntax

Can you please let me know reason behind this error

Do you have any solutions for this error?
Traceback (most recent call last):
File "/usr/lib/python2.7/multiprocessing/process.py", line 258, in _bootstrap
self.run()
File "/usr/lib/python2.7/multiprocessing/process.py", line 114, in run
self._target(*self._args, **self._kwargs)
File "tools/train_faster_rcnn_alt_opt.py", line 129, in train_rpn
max_iters=max_iters)
File "/home/medhani/py-faster-rcnn/tools/../lib/fast_rcnn/train.py", line 160, in train_net
model_paths = sw.train_model(max_iters)
File "/home/medhani/py-faster-rcnn/tools/../lib/fast_rcnn/train.py", line 101, in train_model
self.solver.step(1)
File "/home/medhani/py-faster-rcnn/tools/../lib/roi_data_layer/layer.py", line 144, in forward
blobs = self._get_next_minibatch()
File "/home/medhani/py-faster-rcnn/tools/../lib/roi_data_layer/layer.py", line 63, in _get_next_minibatch
return get_minibatch(minibatch_db, self._num_classes)
File "/home/medhani/py-faster-rcnn/tools/../lib/roi_data_layer/minibatch.py", line 27, in get_minibatch
assert(cfg.TRAIN.BATCH_SIZE % num_images == 0),
ZeroDivisionError: integer division or modulo by zero

Thanks

@medhani It's not finding any images, which means either the path to your images is wrong, or there are no images listed in your image set text file.

Thank Sean,I feel like there is a problem with my annotation file.
screenshot from 2017-06-16 15 47 12

I'm training my network for spider detection. Annotations files are in .xml format. Is it the correct structure of the .xml file?

@Roskgp96 Have you able find a solution for the below error?
line 27, in get_minibatch
assert(cfg.TRAIN.BATCH_SIZE % num_images == 0),
ZeroDivisionError: integer division or modulo by zero

I used another modification of fasterrcnn in TF and it saves permutation into snapshots. In my case, I actually traced the code and found out that I was using an OLD permutation loaded with my snapshot. That means, if you modified the number of testing or training data, it is possible you would access outside the permutation array and return zero index, and then load nothing from roidb. A simply solution is to delete all snapshots or modify the permutation in your train_val.py after loaded. Hope it helps.

@ivalab Thanks, when I delete all the .pyc files in the path "$FRCN/lib/",it can train well without the ZeroDivisionError. @medhani Have you solve the problem? You could also try this method。

commented

@deboc Apologies for digging up an old discussion topic, but you mentioned that we have the option to reuse a pre-trained model that already classifies our objects OR train our own model from scratch. Would that put any restrictions on how we train our faster R-CNN? Would the joint approximation (end-2-end) approach be better than the alternate training method?

Hi,
I'm trying to train the net on my own dataset I have created, using video with microphone. It seems that I did everything as ednrab29 wrote (started from the model I've got from training VOC2007), but results a really surprising:

  1. Testing a picture from my dataset gives me porper region and class=microphone (the only class (+backround) I left during training) with 1.0 probability
  2. Testing a picture not from my dataset gives me nothing. That's can be explained I think by that my dataset is good enough and too small (hundreds of pics of one mic).
  3. What's really surprized me that any picture from voc dataset gives me bounding boxes of objects in voc dataset with microphone label and lesser probability.
    What have I done wrong?

Excuse me.When I trained my own model, I used the model I trained to run demo.py to detect the graph. When the pixel was large (5000,3000), the results were all white include image.If the image pixel is not too large, there is no problem.What's the reason?(当我训练好自己的模型时,用自己训练的模型运行demo.py,去检测图形,当检测图片像素很大时(5000,3000),检测出来的结果是全白包括图片。如果图片像素不是太大,就不会出问题。请问这是什么原因?)

@mantou22 sorry, I do not understand "the results were all white"?

I"m using INRIA Person data set. After running below command

./tools/train_faster_rcnn_alt_opt.py --gpu 0 --net_name INRIA_Person --weights data/imagenet_models/VGG_CNN_M_1024.v2.caffemodel --imdb inria_train --cfg config.yml

I got a error
File "./tools/train_faster_rcnn_alt_opt.py", line 62
print 'Loaded dataset {:s} for training'.format(imdb.name)
^
SyntaxError: invalid syntax

Can you please let me know reason behind this error

have you fixed it?
I met the same problem

I"m using INRIA Person data set. After running below command
./tools/train_faster_rcnn_alt_opt.py --gpu 0 --net_name INRIA_Person --weights data/imagenet_models/VGG_CNN_M_1024.v2.caffemodel --imdb inria_train --cfg config.yml
I got a error
File "./tools/train_faster_rcnn_alt_opt.py", line 62
print 'Loaded dataset {:s} for training'.format(imdb.name)
^
SyntaxError: invalid syntax
Can you please let me know reason behind this error

have you fixed it?
I met the same problem

Hey, I have the same problem. Have you fixed it?