Unstable performance on Semantic3d

Question

Unstable performance on Semantic3d

zexinyang opened this issue 4 years ago · comments

Hi Loic,

I truly enjoyed your paper and have played with your code for quite a while (mostly training SPG models from scratch on Semantic3D). Thank you so much for sharing it :)
Recently, I got stuck in some weird situations. I'd appreciate some help.

1. Some trained models can't distinguish between road and grass.
I ran the partition code on Semantic3D (your split of 11/4) once and evaluated the partition performance (perfect predictions) by assigning each superpoint its majority label. From the scores of model 0 and the visual partition result, we can make sure that the partition part works well.
Based on the same partition results, I trained several models from scratch using your latest source code without any modifications. Specifically, I trained 6 models without RGB using the same hyperparameter settings (s1). However, two of the trained models perform much poorer (with mIoUs around 55%) than the others (with mIoUs near 70%). We can see that these two models are unable to separate road and grass points from both the per class IoUs (model 2&3) and the visual prediction results (the misclassification appears in the sg27_4 scan). One of the RGB models (model 8) I trained from scratch (s2) also suffered from this issue. Is it because of the multisample strategy or the random seeds? How can I avoid this?

scores on sema3d(11/4)

the partition and classification results of sg27_4

hyperparameter setting (noted that the 4 validation scans are stored in the testfull folder):

s1: CUDA_VISIBLE_DEVICES=0 python learning/main.py --dataset sema3d --SEMA3D_PATH $SEMA3D_DIR --db_test_name testfull --db_train_name train --epochs 500 --lr_steps '[350, 400, 450]' --test_nth_epoch 100 --model_config 'gru_10,f_8' --ptn_nfeat_stn 8 --nworkers 2 --pc_attrib xyzelpsv --odir "results/sema3d/SPG_noRGB"
s2: CUDA_VISIBLE_DEVICES=0 python learning/main.py --dataset sema3d --SEMA3D_PATH $SEMA3D_DIR --db_test_name testfull --db_train_name train --epochs 500 --lr_steps '[350, 400, 450]' --test_nth_epoch 100 --model_config 'gru_10,f_8' --ptn_nfeat_stn 11 --nworkers 2 --pc_attrib xyzrgbelpsv --odir "results/sema3d/SPG_RGB"

2. Why do the pointwise curves (mIoU & oAcc) fluctuate wildly?
To overcome the issue mentioned above, I've tried early-stopping (--use_val_set '1') on a custom split of Semantic3d: 9/3/3 training/validation/test scans. Unfortunately, it didn't help since the pointwise mIoU curve fluctuates irregularly, which means we can't guarantee the saved model, the best on the validation set, perform well on the test set. If I use count_predicted_batch_hard instead of count_predicted_batch to construct a superpoint-wise confusion matrix, these curves become relatively stable, but I can't evaluate the model on raw points in this way. Any suggestions to save the best model?

pointwise mIoU and oAcc curves (row 2&3) on your split of 11/4 scans

superpoint-wise curves

3. The NoEdgeFeat models perform unexpectedly well?
I was interested in the performance of the NoEdgeFeat model on Semantic3d, so I trained two models without any super-edge information (one with and one without color) by setting --edge_attribs 'constant'. I got a surprise when I saw the evaluation scores (see model 9&10): there are no differences between models with and without super-edge features, which is opposite to your ablation study on S3DIS (table 5 in your SPG paper). Any ideas?

Many thanks!

Loic Landrieu · Answer 1 · Tue Apr 28 2020 17:39:52 GMT+0800 (China Standard Time)

Hi,

Great work! Very rigorous analysis. I will try to help you but I do not have access to my work machine during the confinement and therefore cannot run experiments myself.

For RGB-less clouds, there no way for the network to distinguish road and grass. If anything, I am very surprised that the model 1,4,5 an 6 were able to figure it out!

What I think is going on is that the road/grass, being very planar and horizontal, are in huge superpoints. Hence, there must be only a dozen of them or so in the test set, and a single error can ruin the IoU.

Here are some leads to better understand what is going on:

check how many superpoints are road/grass, and compute an histogram of their size (log y axis!)
check in the confusion matrix that the faulty models indeed confuses grasses and roads
plot the IoU of grass and road along the training t see if they are indeed the culprits

Here are some leads to improve the results:

remove the jittering in the augmentation function (at least in the z axis). The only hance to distinguish between road and grass would their 'verticl geometric texture'.
you can limit the extent of the superpoints by concatenating 'xyz' (or here maybe just xy?)(line 169 in partition) times a user-defined parameter (I think 0.02 is a good start). This may prevent giant superpoints and improve stability.
for the RGB less problem, fuse grass and road into the same class 'ground' since they are essentially geometrically indistinguishable, and asking the network to perform an impossible task will only hamper its training ability.

The IoU and Oacc fluctuate wildly because there are massive superpoints of road/grass which are indistinguishable.
I think you have the right idea with the valdiation set, provided it is chosen wisely. What are the scenes in your train/valid/test split ?

Also I cannot see the validation performance in your plots. Using the val set has helped me a lot to stabilize the performance on S3DIS/vKITTI (but I haven't tried it for sema3d yet).

this is very surprising to me. Can you check line 98 of spg that the edge feats are indeed a column of 1 with the 'constant' parameter?

Note: class 7 is artifact and not necessarily pedestrians, alsthough they do cause artifacts.

Zexin Yang · Answer 2 · Mon May 04 2020 19:50:10 GMT+0800 (China Standard Time)

Hi Loic,

Sorry for my delay in getting back to you.
I highly appreciate all your suggestions! There is no need for you to run experiments by yourself. It would be great if you could keep pointing out the possible issues as well as giving some suggestions.

I think your understanding is correct. In the sg27_4 scan, there are only 153 road and 338 grass superpoints, and some of them are in huge sizes (> 100 million). These huge road/grass superpoints are quite similar and thus very difficult to distinguish even using RGB values. Please see below the size histogram of sg27_4, and the confusion matrix and the road/grass IoUs of model 8 (with RGB).

I have tried all your suggestions: removing the jittering, shrinking the superpoints by concatenating xyz * 0.02, and training with RGB values. It did help, but the training curves remain unstable. Is it normal? Is it because of the random subgraph strategy?

Here is my training/validation/test split:

training sets: bildstein1, bildstein5, domfountain1, untermaederbrunnen1, neugasse, sg27_1, sg27_5, sg27_9, sg28_4
validation sets: domfountain2, untermaederbrunnen3, sg27_2
test sets: bildstein_3, domfountain_3, sg27_4
You can also find below the corresponding validation curves (past experiments). I probably didn't make a good split, but we can still see that the validation/test IoUs are in the opposite direction at some epoch, which means the saved (best) model in the validation set can perform poorly in the test set.

Yes, edge_feats is a column of 1.0 with the 'constant' parameter.

Loic Landrieu · Answer 3 · Mon May 04 2020 22:59:19 GMT+0800 (China Standard Time)

There is an instability due to the fact that the loss operates on superpoints, and is unaware of the consequence of its decisions (ie the size of the considered superpoints), which can be enormous.

A fix (that we didn't keep in the original paper because it decreased the mIoU on S3DIS) would be to weight each term of the cross-entropy corresponding to a superpoint by its size, ie number of points (normalizing by the total size of course). It should be fairly straightforward, this size is given by segm_size_cpu. Just change the loss to something in the tune of:

(logit.index_select(-1, targets[:,0]+1e-7) * segm_size_cpu).sum() / segm_size_cpu.sum()

disclaimer: this exact line won't work because some tensors needs to be put on the GPU.

Another thing that you could do would be to merge the classes grass and road, especially in the RGB-less setting. To have such prominent indistinguishable classes really must make learning harder.

Zexin Yang · Answer 4 · Mon May 04 2020 23:24:16 GMT+0800 (China Standard Time)

Thanks, it does make sense! I'll revise the loss function and let you know the results.
Do you have any idea why NoEdgeFeat models perform quite well?

Zexin Yang · Answer 5 · Tue May 05 2020 16:32:07 GMT+0800 (China Standard Time)

Hi Loic,

I get your idea of weighting each term of the cross-entropy by the size of superpoints, but I'm not sure I understand your code here. You mean to drop the cross_entropy and replace the loss (line 205 and 256 in main.py) with
loss = outputs.index_select(-1, label_mode) * segm_size).sum() / segm_size.sum()?
It seems that the index_select here causes index out of bounds exception.

Merging the road and grass classes should improve the performance to a great extent, but I need to distinguish them, according to my task as well as my data. After shrinking superpoints and removing the jittering, I trained 8 (colored) models (without the loss fix) and found that 4 of them perform much poorer with meanIoUs under 58%. From the confusion matrixes below, it looks like the (latter two) bad models can't classify correctly bush and scape points. Have I done something wrong?
meanIoU 72.1%, 56.2%, 52.7%

Loic Landrieu · Answer 6 · Mon May 11 2020 16:28:50 GMT+0800 (China Standard Time)

This is strange, I haven't observed this behaviour myself.

If you plot the test performance at each epoch, is it that the models end up on an unlucky epoch, or that it get stuck in a bad local minima?

Zexin Yang · Answer 7 · Tue May 12 2020 16:59:13 GMT+0800 (China Standard Time)

Hi Loic,

I merged the road and grass classes (= ground) and trained several RGB models using the same hyperparameters. The trained models perform better (72-78% mIoU) and stable, but there are still a few (1 out of 10) bad models. These models did end up at an unlucky epoch, but it seems that the IoUs of some classes (e.g. car) can't converge. For the wildly and irregularly fluctuating mIoU curve, I'm not sure early-stopping works because the best model on the validation set could perform poorly on the test set.
A good model with mIoU 73.4%

A bad one with mIoU 57.3%

I also tried your loss fix (weighting each term by the size of each superpoint), but the model isn't trainable due to the extremely imbalanced superpoint size. Unfortunately, using --loss_weights didn't help. Besides, I was wondering if the unstable training curves generate because of the loss function. If so, why do the test curves remain stable at the same time?

Martin Simonovsky · Answer 8 · Thu May 14 2020 08:22:31 GMT+0800 (China Standard Time)

Hi,

this is a really high quality thread, thanks to both of you! I was wondering if perhaps the effect of batch normalization might be responsible for the fluctuations at the end of training to some degree (the running stats are independent of learning rate). One might try putting the following after https://github.com/loicland/superpoint_graph/blob/ssp%2Bspg/learning/main.py#L179:

if epoch >= args.epochs - 10:
    def set_eval(m):
        if isinstance(m, (nn.BatchNorm1d, nn.BatchNorm2d, nn.BatchNorm3d)):
            m.eval()
    model.apply(set_eval)

The snippet requires passing epoch to train(): https://github.com/loicland/superpoint_graph/blob/ssp%2Bspg/learning/main.py#L176 as def train(epoch): and https://github.com/loicland/superpoint_graph/blob/ssp%2Bspg/learning/main.py#L329 as acc, loss, oacc, avg_iou = train(epoch).