Issue: Bug/Performance Issue [Custom Images] - training on dexnet compatible dataset result in gqcnn unable to predict good grasps (pred nonzero is always '0')

Question

Issue: Bug/Performance Issue [Custom Images] - training on dexnet compatible dataset result in gqcnn unable to predict good grasps (pred nonzero is always '0')

aprath1 opened this issue 3 years ago · comments

System information

OS Platform and Distribution (e.g., Linux Ubuntu 16.04): Linux Ubuntu 16.04
Python version: 2.7.12
Installed using pip or ROS: pip
Camera: default

Describe what you are trying to do
Trying to train GQCNN from scratch on a custom dataset and also trying to fine tune a pretrained GQCNN_2.0 on custom dataset. (Datasets are created using dex-net API).

Describe current behavior
Training or finetuning (Optimizing CNN also) using the dataset results in a behavior such that the network is unable to make any good grasp prediction. Referring to the log output the - 'Pred nonzero' is always 0. Even after 5 to 10 iterations in case of finetuning. Is this a normal behavior?

Describe the expected behavior
I am expecting the network to make at least a few good grasps out of the available good grasps. Interestingly, when I keep the layers upto fc3 or fc4 as base layer and DO NOT optimise the base layer the network seem to get finetuned properly and it is predicting some good grasps but still the error rate is high.

Describe the input images
The input dataset is generated using a dexnet compatible hdf5 database and I used dex-net API to generate the dataset from these. Source of database - https://dougsm.github.io/egad/ (please see section dex-net compatible data.)

Describe the physical camera setup
generated using dexnet API

Other info / logs
Few lines of Training logs:

GQCNNTrainerTF INFO     Step took 2.304 sec.
GQCNNTrainerTF INFO     Max 0.23993634
GQCNNTrainerTF INFO     Min 0.14524038
GQCNNTrainerTF INFO     Pred nonzero 0
GQCNNTrainerTF INFO     True nonzero 15
GQCNNTrainerTF INFO     Step 27312 (epoch 1.426), 0.02 s
GQCNNTrainerTF INFO     Minibatch loss: 0.478, learning rate: 0.009025
GQCNNTrainerTF INFO     Minibatch error: 11.719
GQCNNTrainerTF INFO     Step took 2.369 sec.
GQCNNTrainerTF INFO     Max 0.23774128
GQCNNTrainerTF INFO     Min 0.19348052
GQCNNTrainerTF INFO     Pred nonzero 0
GQCNNTrainerTF INFO     True nonzero 80
GQCNNTrainerTF INFO     Step 27313 (epoch 1.426), 0.02 s
GQCNNTrainerTF INFO     Minibatch loss: 1.077, learning rate: 0.009025
GQCNNTrainerTF INFO     Minibatch error: 62.5
GQCNNTrainerTF INFO     Step took 2.158 sec.
GQCNNTrainerTF INFO     Max 0.23704815
GQCNNTrainerTF INFO     Min 0.16592737
GQCNNTrainerTF INFO     Pred nonzero 0
GQCNNTrainerTF INFO     True nonzero 45

Few lines of finetuning log (fc3 set as base layer and using oldformat for layers upto fc3 and optimizing the base layers also ):

10-04 11:56:09 GQCNNTrainerTF INFO     Step 191576 (epoch 9.999), 0.08 s
10-04 11:56:09 GQCNNTrainerTF INFO     Minibatch loss: 0.433, learning rate: 0.004633
10-04 11:56:09 GQCNNTrainerTF INFO     Minibatch error: 13.281
10-04 11:56:10 GQCNNTrainerTF INFO     Step took 1.242 sec.
10-04 11:56:10 GQCNNTrainerTF INFO     Max 0.25177836
10-04 11:56:10 GQCNNTrainerTF INFO     Min 0.16215596
10-04 11:56:10 GQCNNTrainerTF INFO     Pred nonzero 0
10-04 11:56:10 GQCNNTrainerTF INFO     True nonzero 34
10-04 11:56:10 GQCNNTrainerTF INFO     Step 191577 (epoch 9.999), 0.07 s
10-04 11:56:10 GQCNNTrainerTF INFO     Minibatch loss: 0.577, learning rate: 0.004633
10-04 11:56:10 GQCNNTrainerTF INFO     Minibatch error: 26.563
10-04 11:56:11 GQCNNTrainerTF INFO     Step took 1.171 sec.
10-04 11:56:11 GQCNNTrainerTF INFO     Max 0.25264603
10-04 11:56:11 GQCNNTrainerTF INFO     Min 0.18788987
10-04 11:56:11 GQCNNTrainerTF INFO     Pred nonzero 0
10-04 11:56:11 GQCNNTrainerTF INFO     True nonzero 49
10-04 11:56:11 GQCNNTrainerTF INFO     Step 191578 (epoch 10.0), 0.06 s
10-04 11:56:11 GQCNNTrainerTF INFO     Minibatch loss: 0.709, learning rate: 0.004633
10-04 11:56:11 GQCNNTrainerTF INFO     Minibatch error: 38.281
10-04 11:56:13 GQCNNTrainerTF INFO     Step took 1.36 sec.
10-04 11:56:13 GQCNNTrainerTF INFO     Max 0.25366336
10-04 11:56:13 GQCNNTrainerTF INFO     Min 0.17693533
10-04 11:56:13 GQCNNTrainerTF INFO     Pred nonzero 0
10-04 11:56:13 GQCNNTrainerTF INFO     True nonzero 16
10-04 11:56:13 GQCNNTrainerTF INFO     Step 191579 (epoch 10.0), 0.07 s
10-04 11:56:13 GQCNNTrainerTF INFO     Minibatch loss: 0.423, learning rate: 0.004633
10-04 11:56:13 GQCNNTrainerTF INFO     Minibatch error: 12.5
10-04 11:56:14 GQCNNTrainerTF INFO     Step took 1.24 sec.
10-04 11:56:14 GQCNNTrainerTF INFO     Max 0.25436333
10-04 11:56:14 GQCNNTrainerTF INFO     Min 0.1827491
10-04 11:56:14 GQCNNTrainerTF INFO     Pred nonzero 0
10-04 11:56:14 GQCNNTrainerTF INFO     True nonzero 10
10-04 11:56:14 GQCNNTrainerTF INFO     Step 191580 (epoch 10.0), 0.07 s
10-04 11:56:14 GQCNNTrainerTF INFO     Minibatch loss: 0.372, learning rate: 0.004633
10-04 11:56:14 GQCNNTrainerTF INFO     Minibatch error: 7.813

Another interesting thing is that the softmax output seems to be not proper, out of the 2 outputs the 1st value is always in range of 0.7 and the 2nd value is in range of 0.3 ! (varies somewhat at different trainings due to the random initialization of the weights during training)
Sample softmax output:

array([[0.7649399 , 0.23506004],
       [0.7651925 , 0.23480749],
       [0.76295185, 0.23704815],
       [0.7643285 , 0.23567156],
       [0.7630225 , 0.23697755],
       [0.7642536 , 0.23574635],
       [0.76532423, 0.23467574],
       [0.76295376, 0.23704618],
       [0.7632632 , 0.23673679],
       [0.76498514, 0.23501493],
       [0.7632064 , 0.2367936 ],
       [0.7959242 , 0.20407586],
       [0.7641547 , 0.23584531],
       [0.76448244, 0.23551749],
       [0.76394135, 0.23605862],
       [0.7647108 , 0.23528923],
       [0.7639811 , 0.23601893],
       [0.7649897 , 0.23501036],
       [0.7647293 , 0.23527072],
       [0.7651613 , 0.23483868],
       [0.76307136, 0.23692863],
       [0.7640458 , 0.23595421],
       [0.76476514, 0.23523483],
       [0.7672727 , 0.23272723],
       [0.7630191 , 0.2369809 ],
       [0.7645683 , 0.23543172],
       [0.7641252 , 0.2358748 ],
       [0.7639672 , 0.23603278],
       [0.7635745 , 0.23642555],
       [0.79914796, 0.20085205],
       [0.7640747 , 0.23592529],
       [0.76295626, 0.23704374],
       [0.7648026 , 0.23519741],
       [0.76468086, 0.23531915],
       [0.79236376, 0.20763627],
       [0.763892  , 0.23610799],
       [0.76452196, 0.23547806],
       [0.76323694, 0.2367631 ],
       [0.76363677, 0.23636323],
       [0.7694154 , 0.23058464],
     .......

Hi @visatish , could you please let me know if this is a normal behavior? any clue as to what could be the reason for this...?

elevenjiang · Answer 1 · Sat Jan 15 2022 17:16:35 GMT+0800 (China Standard Time)

Yeah, I have same problem,the dataset use is DexNet2.0 dataset and output is same all Pred nonzero=0
Do you find the reason?

elevenjiang · Answer 2 · Sat Jan 15 2022 23:52:03 GMT+0800 (China Standard Time)

After testing a little more, It seems that when you train a more(around 3 epoch),Then it will not all 0;
And after removing all nosing param from .yaml file,This will not all 0 at begining...

aprath · Answer 3 · Mon Jan 17 2022 16:01:44 GMT+0800 (China Standard Time)

Hai @elevenjiang1 ,
Thanks for your comments. I observed similarly with the training epochs like you mentioned and also in addition that using a batch size of 64 strangely gives a comparatively better non zero values than using a batch size of 128.
I am interested in the noise parameters that you deactivated. So did you remove all the below parameters under "# denoising / synthetic data params" in the config file for training the network ?

# denoising / synthetic data params
multiplicative_denoising: 1
gamma_shape: 1000.00

symmetrize: 1

gaussian_process_denoising: 1
gaussian_process_rate: 0.5
gaussian_process_scaling_factor: 4.0
gaussian_process_sigma: 0.005

elevenjiang · Answer 4 · Sat Feb 12 2022 15:41:35 GMT+0800 (China Standard Time)

Sorry for reply so so so late @aprath1
Actually, I find that when train a lot, this problem will be solve, remove parameters seems don't make much sense.

By the way, remove means set them to zero.
Below is my yaml file for training dexnet2.0, which is base on train_dex-net_2.0.yaml

# general optimization params
train_batch_size: 64
val_batch_size: &val_batch_size 64

# logging params
num_epochs: 40        # number of epochs to train for
eval_frequency: 2    # how often to get validation error (in epochs)
save_frequency: 2    # how often to save output (in epochs)
vis_frequency: 10000  # how often to visualize filters (in epochs)
log_frequency: 300      # how often to log output (in steps)

# train / val split params
train_pct: 0.8              # percentage of the data to use for training vs validation
total_pct: 1.0              # percentage of all the files to use
eval_total_train_error: 0   # whether or not to evaluate the total training error on each validataion
max_files_eval: 1000        # the number of validation files to use in each eval

# optimization params
loss: sparse
optimizer: momentum
train_l2_regularizer: 0.0005
base_lr: 0.01
decay_step_multiplier: 0.66   # number of times to go through training datapoints before stepping down decay rate (in epochs)
decay_rate: 0.95
momentum_rate: 0.9
max_training_examples_per_load: 128
drop_rate: 0.0
max_global_grad_norm: 100000000000

# input params
training_mode: classification
image_field_name: depth_ims_tf_table
pose_field_name: hand_poses

# label params
target_metric_name: robust_ferrari_canny  # name of the field to use for the labels
metric_thresh: 0.002                 # threshold for positive examples (label = 1 if grasp_metric > metric_thresh)

# preproc params
num_random_files: 10000     # the number of random files to compute dataset statistics in preprocessing (lower speeds initialization)
preproc_log_frequency: 100 # how often to log preprocessing (in steps)

# denoising / synthetic data params
multiplicative_denoising: 0
gamma_shape: 1000.00

symmetrize: 0

gaussian_process_denoising: 0
gaussian_process_rate: 0.5
gaussian_process_scaling_factor: 4.0
gaussian_process_sigma: 0.005

# tensorboard
tensorboard_port: 6006

# debugging params
debug: &debug 0
debug_num_files: 10 # speeds up initialization
seed: &seed 24098

### GQCNN CONFIG ###
gqcnn:
  # basic data metrics
  im_height: 32
  im_width: 32
  im_channels: 1
  debug: *debug
  seed: *seed

  # needs to match input data mode that was used for training, determines the pose dimensions for the network
  gripper_mode: legacy_parallel_jaw

  # prediction batch size, in training this will be overriden by the val_batch_size in the optimizer's config file
  batch_size: *val_batch_size

  # architecture
  architecture:
    im_stream:
      conv1_1:
        type: conv
        filt_dim: 7
        num_filt: 64
        pool_size: 1
        pool_stride: 1
        pad: SAME
        norm: 0
        norm_type: local_response
      conv1_2:
        type: conv
        filt_dim: 5
        num_filt: 64
        pool_size: 2
        pool_stride: 2
        pad: SAME
        norm: 1
        norm_type: local_response
      conv2_1:
        type: conv
        filt_dim: 3
        num_filt: 64
        pool_size: 1
        pool_stride: 1
        pad: SAME
        norm: 0
        norm_type: local_response
      conv2_2:
        type: conv
        filt_dim: 3
        num_filt: 64
        pool_size: 2
        pool_stride: 2
        pad: SAME
        norm: 1
        norm_type: local_response
      fc3:
        type: fc
        out_size: 1024
    pose_stream:
      pc1:
        type: pc
        out_size: 16
      pc2:
        type: pc
        out_size: 0
    merge_stream:
      fc4:
        type: fc_merge
        out_size: 1024
      fc5:
        type: fc
        out_size: 2

  # architecture normalization constants
  radius: 2
  alpha: 2.0e-05
  beta: 0.75
  bias: 1.0

  # leaky relu coefficient
  relu_coeff: 0.0