TF 2.0 code

Question

TF 2.0 code

ezorfa opened this issue 5 years ago · comments

Does anybody ( @tinghuiz , @ClementPinard , @Huang-Jin ) have TF 2.0 code for the SfMLearner? I tried to rewrite everthing from scratch., However I have few doubts:

The model predicts depth initially in only middle region of the image and slowly spreads to other areas. While this behaviour is not present in the original code TF 1.0

If anybody willing to review the code, I am happy to share.

Clément Pinard · Answer 1 · Thu Nov 21 2019 20:39:11 GMT+0800 (China Standard Time)

Sounds like a problem related to borders.
Without the code I'd focus my search on two hypothesis :

Inverse warping on image borders is done by interpolating color with "out of bound" gray (the value is 0). The inverse warped point is then incentived to stay colored because it's better than gray, which make it stay at the border. I had this problem when working on my pytorch version. I had started a thread about on pytorch forums that you might find interesting : https://discuss.pytorch.org/t/spatial-transformer-networks-boundary-grid-interpolation-behaviour/8891
Smooth Loss is not correctly defined on borders, if border points are not discarded, the optimization might try to zero them in order to have a low contrast with the gray out of bound color. This can happen if you compute the gradient/Laplacian map with convolution and padding.

I am not a TF2.0 specialist, but I can have a look on a github repo if you want (I don't have a pc to try the code for the moment)

/ Ezorfa · Answer 2 · Thu Nov 21 2019 20:52:21 GMT+0800 (China Standard Time)

Heyy @ClementPinard !
Thankyou for your reply. Please look at the github repo here: https://github.com/ezorfa/SfmLearner

Ofcourse, you need to change the dataset_dir path in train_one and rest remain the same.

Regards

/ Ezorfa · Answer 3 · Thu Nov 21 2019 20:54:28 GMT+0800 (China Standard Time)

Heyy @ClementPinard !

I have uploaded a github repo, please look here: https://github.com/ezorfa/SfmLearner

Also, ofcourse, you need to change the path 'dataset_dir' for the dataset in the file train_one.py

Thankyou for replying!

/ Ezorfa · Answer 4 · Thu Nov 21 2019 21:11:52 GMT+0800 (China Standard Time)

Heyy @ClementPinard !
Here is the better link to the code: https://github.com/ezorfa/sfm_tf2.git if you have problems to open the previous link.

Let me know if you have any problems opening it!

Thankyou!

Clément Pinard · Answer 5 · Thu Nov 21 2019 21:16:12 GMT+0800 (China Standard Time)

I have been able to look at it. What are the differences with tinghui's code ? I don't see anything suspect (or even different the original TF1.0 code) in the points I hinted you.

/ Ezorfa · Answer 6 · Thu Nov 21 2019 21:18:35 GMT+0800 (China Standard Time)

@ClementPinard Actually, There are no differences at all. Even then, the behaviour is as I shown in the photos above. I have rewritten the code completely to be able to be compatible with TF 2.0

For example the changes I made are: data loading pipeline and training loop.

/ Ezorfa · Answer 7 · Thu Nov 21 2019 21:22:22 GMT+0800 (China Standard Time)

In my understanding, if I have changed nothing (atleast in the loss functions), then, TF 1.0 or TF2.0 behaviour should be the same. But TF 1.0 has much better behaviour initially. So, do you think I need to adapt the code(loss functions) to its pytorch's version written by you?

Can you explain your statement, please: "I don't see anything suspect (or even different the original TF1.0 code) in the points I hinted you."

/ Ezorfa · Answer 8 · Thu Nov 21 2019 21:32:38 GMT+0800 (China Standard Time)

I evaluated depth on this version of the model after 180K iterations:

Abs Rel	Sq Rel	RMSE	RMSE(log)	Acc.1	Acc.2	Acc.3
0.2190	3.0390	7.2760	0.2991	0.7078	0.8873	0.9486

Sadly, it is far from the actual result sited on @tinghuiz github:

Abs Rel	Sq Rel	RMSE	RMSE(log)	Acc.1	Acc.2	Acc.3
0.183	1.595	6.709	0.270	0.734	0.902	0.959

Do you think its okay to get the results of depth as different as these?

/ Ezorfa · Answer 9 · Thu Nov 21 2019 21:56:34 GMT+0800 (China Standard Time)

@ClementPinard One more confusion that I am having :

Different models saved at different points during the training give different values of RMSE, SqRel etc. For example, a model which I saved at 80K iterations gave me better values than a model saved at 180K iterations. So, On what basis should I choose a best model? Is this the problem you faced too?

The same TF2.0 model at 80 K iterations gave:

Abs Rel	Sq Rel	RMSE	RMSE(log)	Acc.1	Acc.2	Acc.3
0.1936	1.9007	6.8840	0.2764	0.7272	0.8957	0.9555

while at 180K iterations:

Abs Rel	Sq Rel	RMSE	RMSE(log)	Acc.1	Acc.2	Acc.3
0.2190	3.0390	7.2760	0.2991	0.7078	0.8873	0.9486

Clément Pinard · Answer 10 · Thu Nov 21 2019 22:50:39 GMT+0800 (China Standard Time)

Hahaha welcome to the wonderful world of data snooping !

This is a typical case that shows that validation set is not a test set, since you can evaluate it on validation set as much as you want. Here, Early stopping is part of the hyperparameters, and the validation set helped get it right. If you want to have a real test, you can try the KITTI depth evaluation benchmark, but this is only for test, the more you evaluate on the test set (the admins won't let you anyway), the less information an evaluation actually carries, since you changed you hyperparameter in the hope of getting a better result next time, which will result in a method optimized for this very dataset and nothing else.

What this shows is that the phometric reprojection error is a good starting point for auto-supervision but will derive from the optimal. The loss dosn't perfectly modelize the problem and thus the loss minimum is not the perfect depth as those are not exactly equivalent.

As such you have to regularize your training workflow, either with smooth loss (there's a whole variety of them, TV loss, diffusion, gradient diffusion, texture aware and so on) or with a better loss modelization where the minimum (and the path to get to it) is indeed the perfect depth and pose, an attempt, with which I am not particularly convinced at is the SSIM loss, but there are several other photometric losses e.g. the ones used for optimization based optical flow that can be used.

Anyway, the point of the SFMLearner repo is not to get state of the art on KITTI anymore, but rather to provide a basis to get to it for further research, because research iterations with new photometric or smooth losses are often messy and hard to build code upon, even if they propose better results. I actually think that your results are quite honorable, and your mission now is rather to write it from scratch with the same algorithms and hopefully the same results but with a more readable TF2.0 friendly syntaxe.

As for the different behaviour between the two version, since the python code is exactly the same, it's hard to say without knowing the deep internal differences differences between tf1.0 and tf2.0 which I don't. However, I can tell that my pytorch version, if used with the exact same hyperparameters as this repo will fail to converge. Even though I tried to apply the exact same training algorithm. It took me some time to check every little detail, and the frustrating conclusion of this investigation was that something internal to pytorch and tensorflow had to be different somehow, and there's nothing I could do about, apart from tweaking the Hyperparameters a little to get roughly the same results.

/ Ezorfa · Answer 11 · Thu Nov 21 2019 23:14:33 GMT+0800 (China Standard Time)

Thankyou for such a detailed explanation. It took me some time to understand completely. Could you clarify the point you made:

I can tell that my pytorch version, if used with the exact same hyperparameters as this repo will fail to converge

By 'not converging' do you mean 'total loss' not getting converged or you dont see a reasonable depth map at all on tensorboard? (because such a situation(of depth map not properly generated for eternity) did happen to me when I tried to train first time with the same code)

/ Ezorfa · Answer 12 · Thu Nov 21 2019 23:21:51 GMT+0800 (China Standard Time)

@ClementPinard
Also, I think , to avoid such a situation, 'seeding' would help if we get to train properly with some seed. But in my case, the seed '8964' as given in the original code did not work !! I cannot wrap my head around about this : Why does TF versions behave differently !

This is the depth map that I get after 50K iterations for one instance of training and it stayed like this forever: So By not converging do you mean this??

Clément Pinard · Answer 13 · Fri Nov 22 2019 00:09:57 GMT+0800 (China Standard Time)

Yes, this not converging (although it doesn't move, what I mean is that you get stuck at this very early training state, you get the point I think). Mine was even worse, all depth map were zero everywhere. Since the output activation function is a sigmoid you could say that the network did diverge to -infinity.

On to why it's so hard to train it with the current set of hyperparameters mostly resides in the smooth loss, which set to 0.5. it is actually an edge value. In TF1.0 with the original code, it will converge very slowly but eventually get there. If you try higher values (even by a small margin) it will converge to what you currently, and if you try lower values, it will converge faster but with more "black holes" on the road and thus a lower quality.

Also, you might want to check on smooth loss downscaling factor. Initial value was 0.5 (loss weight is halved for every scale), I tried 1/sqrt(2) instead and got to converge.

My understanding of the problem is that the smallest depth map should converge first, and then upper scales should follow.

/ Ezorfa · Answer 14 · Fri Nov 22 2019 00:18:11 GMT+0800 (China Standard Time)

Hmm. Thank you for your time and effort; Highly appreciated!