BerkeleyAutomation / gqcnn

Python module for GQ-CNN training and deployment with ROS integration.

Home Page:https://berkeleyautomation.github.io/gqcnn

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Issue: Bug/Performance Issue

JohnsonQi opened this issue · comments

System information

  • OS Platform and Distribution (e.g., Linux Ubuntu 16.04):Linux Ubuntu 16.04
  • Python version:3.5.2
  • Installed using pip or ROS: pip
  • GPU model (if applicable): Nvida 1080Ti

Describe the result you are trying to replicate
(https://berkeleyautomation.github.io/gqcnn/index.html).
I used train_dex-net_2.0.yaml to train the gqcnn, but I didn't get the expected results. It's really strange that the training only took 30 minutes for 5 epochs on a full dex-net 2.0 dataset you provided. (https://berkeley.app.box.com/s/6mnb2bzi5zfa7qpwyn7uq5atb7vbztng/folder/25803680060)

How can I fix this problem?

Hi @JohnsonQi,

Glad to hear that you're using the GQ-CNN package, and apologies for the trouble.

The training should definitely take longer than that. Can you share models/GQCNN-2.0/training.log?

Thanks,
Vishal

Hi @JohnsonQi,

Glad to hear that you're using the GQ-CNN package, and apologies for the trouble.

The training should definitely take longer than that. Can you share models/GQCNN-2.0/training.log?

Thanks,
Vishal

Hi @visatish ,

Thanks for your reply! Here is my training log, and I can't figure out where is wrong. I set "train_pct"=0.8,"totoal_pct"=1.
training.log

Kind regards,
Johnson

Hi @JohnsonQi,

I noticed that you're having the same issue as #99, which was resolved over email. It turned out that the benchmark we provided was actually trained on 50 epochs instead of the default 25. I will push a fix for that shortly.

It does seem like you are training on the entire dataset (26283 steps * 64 samples/step(bsz) * 1.25(account for training split) = 2102640 samples). Can you try training for 50 epochs? I'm not sure where you got 5 epochs from, unless you manually lowered it.

In the meanwhile, I will try to replicate the result again on my end, although I did replicate it earlier this year for the other issue.

Thanks,
Vishal