COCO working - Training own dataset, one category - process stuck at "Start training"

Question

COCO working - Training own dataset, one category - process stuck at "Start training"

chrieke opened this issue 7 years ago · comments

Hi, I adjusted my own dataset to fit the COCO format, should be correct (bbox and area are dummy values but that shouldn't matter right?).

Training on COCO is working for me, so everything is set up.
I have only one category of polygons, the polygons are encoded as [[x, y, x1, y1, ...]. Switched the number of categories in Datasampler from 80 to 1 as described in issue 72..
The images are 640x480 jpg.

When starting the training, the json files are converted to t7, but then the process seems to be "stuck" at "start training". Any idea what could be wrong? Thanks!

gnc10 · Answer 1 · Fri Feb 03 2017 18:49:57 GMT+0800 (China Standard Time)

Hi, during training 1 epoch for deepmask takes around 30 mins on a nvidia gtx 960 so maybe wait and see. Also if you run watch nvidia-smi it will show if there are any processes running on the GPU.

Christoph Rieke · Answer 2 · Mon Feb 06 2017 01:19:32 GMT+0800 (China Standard Time)

Problem fixed, was a dumb polygon / pixel overlap issue (y axis was inverted) and/or also some images without segments in it. Not a single epoch finished on p2 instance for 8 hours. Now around ~1 hour per epoch.

Shenoy Pratik · Answer 3 · Sat Mar 04 2017 02:41:30 GMT+0800 (China Standard Time)

Hi @ChrisCKR did you use the coco API to check the overlap? and then come to a conclusion, because i am facing the same issue now.

Christoph Rieke · Answer 4 · Sat Mar 04 2017 03:03:36 GMT+0800 (China Standard Time)

@ps48 I prepared this Jupyter Notebook to visually check the exact overlay. Just replace in_json and in_folder with your own data to see if it exactly fits the COCO dataset. Hope this helps!
https://github.com/chrisckr/COCO_misc/blob/master/COCO_dataExploration.ipynb

Shenoy Pratik · Answer 5 · Mon Mar 06 2017 18:16:52 GMT+0800 (China Standard Time)

Hey @ChrisCKR, thanks for the ipython code, I did the similar things using coco API. I just wanted to know if it is necessary to have a bounding box to run deepmask or segmentation in itself is enough for deepmask to run?

I overlapped the polygon and it was correct still getting the same issue. Thanks :)

Christoph Rieke · Answer 6 · Tue Mar 07 2017 06:51:05 GMT+0800 (China Standard Time)

@ps48 Haven't checked but my guess is that you can replace "bbox": [69.64, 205.24, 61.16, 50.76] by "bbox":[], Deepmask should only use the segmentation. Getting bounding box coordinates from polygons is pretty easy, for the shapely library it would just be polygon.bounds.

Shenoy Pratik · Answer 7 · Tue Mar 07 2017 12:30:05 GMT+0800 (China Standard Time)

@ChrisCKR Thanks, I did that using opencv contour Bounding Rectangle (simple python script). But still no clue on why the training is stuck on "Start Training".

~/torch/deepmask$ th train.lua
-- ignore option rundir
-- ignore option dm
-- ignore option reload
-- ignore option gpu
-- ignore option datadir
| running in directory /home/ubuntu/torch/deepmask/exps/deepmask/exp
| number of paramaters trunk: 15198016
| number of paramaters mask branch: 1608768
| number of paramaters score branch: 526337
| number of paramaters total: 17333121
| start training

Christoph Rieke · Answer 8 · Tue Mar 07 2017 12:41:57 GMT+0800 (China Standard Time)

@ps48 Check that every image has at least one segmentation in the json file. I just rememberd that I had another thing to fix: I threw out polygons below a certain area treshhold but forgot to remove images that were without polygons afterwards. Also doublecheck the correct json formatting (brackets and stuff) and Image / segmentation IDs.

Shenoy Pratik · Answer 9 · Wed Mar 08 2017 17:08:40 GMT+0800 (China Standard Time)

Thank you @ChrisCKR, I had some similar issues. 👍 It works now but, it stopped again after two epochs. I think i need to change some target value in one of the files, which is hard coded for coco classes. Apart from the Datasampler one as described in #72

~/torch/deepmask$ th train.lua -batch 1
batch 1 32
-- ignore option rundir
-- ignore option dm
-- ignore option reload
-- ignore option gpu
-- ignore option datadir
| running in directory /home/ubuntu/torch/deepmask/exps/deepmask/exp,batch=1
| number of paramaters trunk: 15198016
| number of paramaters mask branch: 1608768
| number of paramaters score branch: 526337
| number of paramaters total: 17333121
| start training
[train] | epoch 00001 | s/batch 0.07 | loss: 0.54912
[train] | epoch 00002 | s/batch 0.07 | loss: 0.69407
/home/ubuntu/torch/install/bin/luajit: /home/ubuntu/torch/deepmask/trainMeters.lua:57: attempt to index local 'output' (a number value)
stack traceback:
/home/ubuntu/torch/deepmask/trainMeters.lua:57: in function 'add'
/home/ubuntu/torch/deepmask/TrainerDeepMask.lua:133: in function 'test'
train.lua:118: in main chunk
[C]: in function 'dofile'
...untu/torch/install/lib/luarocks/rocks/trepl/scm-1/bin/th:150: in main chunk
[C]: at 0x00405d50

Christoph Rieke · Answer 10 · Wed Mar 08 2017 18:40:19 GMT+0800 (China Standard Time)

The Datasampler one is the only variable I changed in regard to the number of classes. Check your validation data set, maybe ithas the same problems still? After every 2 epochs deepmask atempts to test IoU and accuracy, so the next line in your output would be [test]...Seems like the error comes from that step, see the error message "in function test", the 2 training epochs worked fine.

Shenoy Pratik · Answer 11 · Fri Mar 10 2017 13:07:53 GMT+0800 (China Standard Time)

@ChrisCKR no luck, the validation json is made from the same script as training json. Hence the val data is correct. I commented the validation line in train.lua (line no. 118) to try out training and it works. But resuts are very bad.

Avilash Kumar · Answer 12 · Mon Oct 09 2017 04:20:59 GMT+0800 (China Standard Time)

@ps48
Did you get desirable results on your own dataset ?
The network trained on COCO works better on my dataset than the one I finetuned , although I ran it only for two epochs