Issue: Bug/Performance Issue

Question

Issue: Bug/Performance Issue

JohnsonQi opened this issue 4 years ago · comments

System information

OS Platform and Distribution (e.g., Linux Ubuntu 16.04):Linux Ubuntu 16.04
Python version:3.5.2
Installed using pip or ROS: pip
GPU model (if applicable): Nvida 1080Ti

Describe the result you are trying to replicate
(https://berkeleyautomation.github.io/gqcnn/index.html).
I used train_dex-net_2.0.yaml to train the gqcnn, but I didn't get the expected results. It's really strange that the training only took 30 minutes for 5 epochs on a full dex-net 2.0 dataset you provided. (https://berkeley.app.box.com/s/6mnb2bzi5zfa7qpwyn7uq5atb7vbztng/folder/25803680060)

How can I fix this problem?

visatish · Answer 1 · Tue Mar 17 2020 18:17:00 GMT+0800 (China Standard Time)

Hi @JohnsonQi,

Glad to hear that you're using the GQ-CNN package, and apologies for the trouble.

The training should definitely take longer than that. Can you share models/GQCNN-2.0/training.log?

Thanks,
Vishal

JohnsonQi · Answer 2 · Tue Mar 17 2020 18:32:04 GMT+0800 (China Standard Time)

Hi @JohnsonQi,

Glad to hear that you're using the GQ-CNN package, and apologies for the trouble.

The training should definitely take longer than that. Can you share models/GQCNN-2.0/training.log?

Thanks,
Vishal

Hi @visatish ,

Thanks for your reply! Here is my training log, and I can't figure out where is wrong. I set "train_pct"=0.8,"totoal_pct"=1.
training.log

Kind regards,
Johnson

visatish · Answer 3 · Wed Mar 18 2020 05:26:41 GMT+0800 (China Standard Time)

Hi @JohnsonQi,

I noticed that you're having the same issue as #99, which was resolved over email. It turned out that the benchmark we provided was actually trained on 50 epochs instead of the default 25. I will push a fix for that shortly.

It does seem like you are training on the entire dataset (26283 steps * 64 samples/step(bsz) * 1.25(account for training split) = 2102640 samples). Can you try training for 50 epochs? I'm not sure where you got 5 epochs from, unless you manually lowered it.

In the meanwhile, I will try to replicate the result again on my end, although I did replicate it earlier this year for the other issue.

Thanks,
Vishal