About the AcfNet adaptive model results

Question

About the AcfNet adaptive model results

xy-guo opened this issue 4 years ago · comments

I just found there is no result for AcfNet adaptive models in ResultOfAcfNet.md. May I ask if there still exists 0.06px EPE improvement after updating the code? The AcfNet (adaptive) is 0.867 EPE reported in the paper, and AcfNet (uniform) reported in ResultOfAcfNet.md is 0.8511 EPE. It seems AcfNet (uniform) is even better than AcfNet (adaptive)?

Another question is what is modified to achieve better results compared with results reported in the paper.

Thank you so much!

youmi · Answer 1 · Wed Apr 22 2020 19:22:19 GMT+0800 (China Standard Time)

Hi, @xy-guo
No modification has been made for the EPE improvement, we just refactor our original code into this new architecture. As for the AcfNet adaptive, I'll provide the checkpoint in these days.

Xiaoyang Guo · Answer 2 · Wed Apr 22 2020 22:14:19 GMT+0800 (China Standard Time)

Thank you for your reply. I just tried to train the uniform model, but I only got 0.8971 EPE and 4.479% 3px error. Is it normal? I used 8 gpus and 2 imgs/per gpu to train the model for 10 epochs.

youmi · Answer 3 · Thu Apr 23 2020 08:41:55 GMT+0800 (China Standard Time)

I think it's normal, sometimes I also got similar EPE if I use 8 GPUs. But I still advise using 4 GPUs, it gives more iterations, and gets better result on KITTI.

Xiaoyang Guo · Answer 4 · Sat Apr 25 2020 13:23:52 GMT+0800 (China Standard Time)

Since my GPU memory is only 12GB, I failed to train the model with 3 image per gpu. I just tried 8 gpus and 1 imgs/gpu to train the model for 10 epochs, the results are even worse. 0.99 EPE 4.986% 3px error.

Xiaoyang Guo · Answer 5 · Sat Apr 25 2020 13:59:40 GMT+0800 (China Standard Time)

I am not sure whether it is because the imcompatibility of the latest libraries. The code is not compatible to the latest mmcv.

I did the following modification in order to run the code with latest mmcv.

--- a/dmb/apis/train.py
+++ b/dmb/apis/train.py
@@ -107,7 +107,8 @@ def _dist_train(
         )
 
     # put model on gpus
-    model = MMDistributedDataParallel(model)
+    model = MMDistributedDataParallel(model, device_ids=[torch.cuda.current_device()],
+                                                  broadcast_buffers=False)
     # build runner
     runner = Runner(
         model, batch_processor, optimizer, cfg.work_dir, cfg.log_level, logger
@@ -120,7 +121,10 @@ def _dist_train(
         optimizer_config = DistOptimizerHook(**cfg.optimizer_config)
     logger.info("Register Optimizer Hook...")
     runner.register_training_hooks(
-        cfg.lr_config, optimizer_config, cfg.checkpoint_config, log_config=None
+        cfg.lr_config, optimizer_config, cfg.checkpoint_config, log_config={
+                                                   "interval": cfg.log_config['interval'],
+                                                                                          "hooks": []
+                                                                                                                             }
     )
 
     # register self-defined logging hooks

youmi · Answer 6 · Sat Apr 25 2020 14:16:42 GMT+0800 (China Standard Time)

I'm sorry to hear that. I haven't expected the result to be that awful. The GPU I used is 1080Ti, so I can only set 4 GPUs and 2 images/GPU in my experiment.
And to be more helpful and informative, you can download the checkpoint I had uploaded. It includes all the information(e.g., TensorBoard) during the training process. You can compare your training details and results with them.

~~By the way, you can specify mmcv==0.2.8, it's compatible with the latest released architecture, I still haven't made it compatible with the latest mmcv.~~
Thanks for your help. According to your modification, the latest mmcv is supported.

Xiaoyang Guo · Answer 7 · Sat Apr 25 2020 18:41:50 GMT+0800 (China Standard Time)

Thank you so much for your information. The results of 4 GPUs x 2 images / GPU should be consistent with the results of 8 GPUs x 1 image / GPU. Let me try 4 GPUs x 2 images / GPU and 0.2.8 mmcv to see if there could be some improvement.

Xiaoyang Guo · Answer 8 · Sat Apr 25 2020 19:00:51 GMT+0800 (China Standard Time)

I just compared the loss curve, my loss seems to be a little bit smaller than yours. But the epe and 3px are worse. It just converges slower, the 20th-epoch results are closer, but my trained network converges slower in the first 10 epochs.

youmi · Answer 9 · Sat Apr 25 2020 21:19:49 GMT+0800 (China Standard Time)

@xy-guo
Interesting finding. Waiting for your experiment result.
And could you help integrate your GwcNet into this architecture? It's really excellent work, I have planned to do that but I'm afraid I cannot reimplement your result. If so, it will benefit the community to study each interesting work under the same architecture and experiment environment. Thanks.

Xiaoyang Guo · Answer 10 · Sat Apr 25 2020 21:46:05 GMT+0800 (China Standard Time)

Thanks, I will consider create a PR if time is available.

youmi · Answer 11 · Sat Apr 25 2020 22:34:11 GMT+0800 (China Standard Time)

Perfect! 👍

Xiaoyang Guo · Answer 12 · Fri May 01 2020 10:44:17 GMT+0800 (China Standard Time)

May I ask which model is used in KITTI submission? with localsoftargmin or not? Thanks!

youmi · Answer 13 · Fri May 01 2020 16:45:21 GMT+0800 (China Standard Time)

Hi, @xy-guo
Both uniform and adaptive can be used for KITTI submission, they perform little difference because of easy-overfitting on KITTI. Alternating LocalSoftArgmin for inference might get a more stable and great result, you can try it.