DeepMotionAIResearch / DenseMatchingBenchmark

Dense Matching Benchmark

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

About the AcfNet adaptive model results

xy-guo opened this issue · comments

I just found there is no result for AcfNet adaptive models in ResultOfAcfNet.md. May I ask if there still exists 0.06px EPE improvement after updating the code? The AcfNet (adaptive) is 0.867 EPE reported in the paper, and AcfNet (uniform) reported in ResultOfAcfNet.md is 0.8511 EPE. It seems AcfNet (uniform) is even better than AcfNet (adaptive)?

Another question is what is modified to achieve better results compared with results reported in the paper.

Thank you so much!

commented

Hi, @xy-guo
No modification has been made for the EPE improvement, we just refactor our original code into this new architecture. As for the AcfNet adaptive, I'll provide the checkpoint in these days.

Thank you for your reply. I just tried to train the uniform model, but I only got 0.8971 EPE and 4.479% 3px error. Is it normal? I used 8 gpus and 2 imgs/per gpu to train the model for 10 epochs.

commented

I think it's normal, sometimes I also got similar EPE if I use 8 GPUs. But I still advise using 4 GPUs, it gives more iterations, and gets better result on KITTI.

Since my GPU memory is only 12GB, I failed to train the model with 3 image per gpu. I just tried 8 gpus and 1 imgs/gpu to train the model for 10 epochs, the results are even worse. 0.99 EPE 4.986% 3px error.

I am not sure whether it is because the imcompatibility of the latest libraries. The code is not compatible to the latest mmcv.

I did the following modification in order to run the code with latest mmcv.

--- a/dmb/apis/train.py
+++ b/dmb/apis/train.py
@@ -107,7 +107,8 @@ def _dist_train(
         )
 
     # put model on gpus
-    model = MMDistributedDataParallel(model)
+    model = MMDistributedDataParallel(model, device_ids=[torch.cuda.current_device()],
+                                                  broadcast_buffers=False)
     # build runner
     runner = Runner(
         model, batch_processor, optimizer, cfg.work_dir, cfg.log_level, logger
@@ -120,7 +121,10 @@ def _dist_train(
         optimizer_config = DistOptimizerHook(**cfg.optimizer_config)
     logger.info("Register Optimizer Hook...")
     runner.register_training_hooks(
-        cfg.lr_config, optimizer_config, cfg.checkpoint_config, log_config=None
+        cfg.lr_config, optimizer_config, cfg.checkpoint_config, log_config={
+                                                   "interval": cfg.log_config['interval'],
+                                                                                          "hooks": []
+                                                                                                                             }
     )
 
     # register self-defined logging hooks
commented

I'm sorry to hear that. I haven't expected the result to be that awful. The GPU I used is 1080Ti, so I can only set 4 GPUs and 2 images/GPU in my experiment.
And to be more helpful and informative, you can download the checkpoint I had uploaded. It includes all the information(e.g., TensorBoard) during the training process. You can compare your training details and results with them.

image

By the way, you can specify mmcv==0.2.8, it's compatible with the latest released architecture, I still haven't made it compatible with the latest mmcv.
Thanks for your help. According to your modification, the latest mmcv is supported.

Thank you so much for your information. The results of 4 GPUs x 2 images / GPU should be consistent with the results of 8 GPUs x 1 image / GPU. Let me try 4 GPUs x 2 images / GPU and 0.2.8 mmcv to see if there could be some improvement.

I just compared the loss curve, my loss seems to be a little bit smaller than yours. But the epe and 3px are worse. It just converges slower, the 20th-epoch results are closer, but my trained network converges slower in the first 10 epochs.

commented

@xy-guo
Interesting finding. Waiting for your experiment result.
And could you help integrate your GwcNet into this architecture? It's really excellent work, I have planned to do that but I'm afraid I cannot reimplement your result. If so, it will benefit the community to study each interesting work under the same architecture and experiment environment. Thanks.

Thanks, I will consider create a PR if time is available.

commented

Perfect! 👍

May I ask which model is used in KITTI submission? with localsoftargmin or not? Thanks!

commented

Hi, @xy-guo
Both uniform and adaptive can be used for KITTI submission, they perform little difference because of easy-overfitting on KITTI. Alternating LocalSoftArgmin for inference might get a more stable and great result, you can try it.