songqi-github / AttaNet

Hi, thanks for your great work AttaNet and I'm pretty interested in your research.
After reading the papers and reviewing the code, I'm confused about the inference speed and evaluation results of the method.

AttaNet is tested with the input size 512x1024 and achieves 130 FPS with ResNet-18 backbone while the mIoU is evaluated with crop_eval and flip test
see:

AttaNet/evaluate.py

Line 59 in 32fd818

def crop_eval(self, im):

Therefore, the mIoU (78.5 on ResNet-18) is evaluated with crop and flip while the inference time is measured by a single 512x1024 input. The inference time and the evaluated results might not be consistent.

Besides, reporting the inference time and corresponding accuracy with the single-scale input without cropping or flipping will be more fair in comparison with other methods

Further, I've downloaded the code & models and evaluated the speed and accuracy in my local machine.
Specs: GPU: NVIDIA Titan Xp, CPU: 2 Intel Xeon E5-2620 v3.

Model: AttaNet w/ ResNet-18

Speed: 1024x2048 input size

inference.py outputs:

load resnet
start warm upwarm up done
=======================================
FPS: 24.972443
Inference time 40.044140 ms

Accuracy: 1024x2048 input size * w/o crop and flip*

evaluate.py outputs:

================================================================================
evaluating the model ...

setup and restore model
load resnet
compute the mIOU
 61%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▌                                                                            | 305/500 [03:10<02:55,  1.11it/s]100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 500/500 [06:09<00:00,  1.35it/s]
[0.98095101 0.84992488 0.91837538 0.64723809 0.65401004 0.5827008
 0.61811279 0.75313067 0.91223381 0.68902523 0.92999311 0.77462518
 0.5425117  0.94219012 0.85731529 0.87147848 0.79053572 0.52289506
 0.739424  ]
0.7671932295569451
mIOU is: 0.767193

Hi, thanks for your attention to our paper. There must be some problem with your fps testing. We have tested several times on our GPU, and we can achieve at least 120 fps even when the GPU is much slower than other same types. Please check your code and environment. About the accuracy testing, we mainly follow the evaluation method in BiSeNetV2 and SFNet to ensure fairness. We use this file for multi-scale testing of ResNet-50/101 and not for real-time accuracy testing in our paper. The inference time and corresponding accuracy use the same testing settings in our paper. We will upload the used one for real-time accuracy testing as soon as possible.

Hi @songqi-github
Indeed, AttaNet is a great work with some efficient designs.
To compare with SFNet, FANet and etc. which adopts the 1024x2048 input, I modified the script inference.py by changing the input size to 1024x2048 and removing the downsample operation and the model achieved 25 FPS with 76.7 mIoU.
Besides, BiSeNetV2 adopts 512x1024 input to evaluate mIoU and inference speed without evaluation tricks.

We do not adopt any evaluation tricks, e.g., sliding-window evaluation and multi-scale testing, which can improve accuracy but are time-consuming. With the input of 2048 × 1024 resolution, we first re- size it to 1024 × 512 resolution to inference and then resize the prediction to the original size of the input. We measure the inference time with only one GPU card and repeat 5000 iterations to eliminate the error fluctuation. We note that the time of resizing is included in the inference time measurement. In other words, when measuring the inference time, the practical input size is 2048 × 1024

In my opinion, reporting the inference time and mIoU without test-time augmentations is more convincing. In other words, the time of inferencing chips cropped or flipped for each input should be added.

Further, I've downloaded the code & models and evaluated the speed and accuracy in my local machine.
Specs: GPU: NVIDIA Titan Xp, CPU: 2 Intel Xeon E5-2620 v3.

Model: AttaNet w/ ResNet-18

Speed: 1024x2048 input size

inference.py outputs:

load resnet
start warm upwarm up done
=======================================
FPS: 24.972443
Inference time 40.044140 ms

Accuracy: 1024x2048 input size * w/o crop and flip*

evaluate.py outputs:

================================================================================
evaluating the model ...

setup and restore model
load resnet
compute the mIOU
 61%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▌                                                                            | 305/500 [03:10<02:55,  `1.11it/s]100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████|` 500/500 [06:09<00:00,  1.35it/s]
[0.98095101 0.84992488 0.91837538 0.64723809 0.65401004 0.5827008
 0.61811279 0.75313067 0.91223381 0.68902523 0.92999311 0.77462518
 0.5425117  0.94219012 0.85731529 0.87147848 0.79053572 0.52289506
 0.739424  ]
0.7671932295569451
mIOU is: 0.767193

I have re-trained and re-evaluated the code in my own machine without any changes.
My environment:
GPU: GeForce RTX 2080ti, CPU: Intel(R) Core(TM) i9-10900X CPU @ 3.70GHz

inference.py output

evaluate.py output

After change the inference.py with input size 1024*2048.
The result:

Hi @liuzhidemaomao, your results (76.7 mIoU and 55.2 FPS w.r.t 1024x2048 input) are consistent with mine regardless of the gpu. (Results from Table 1 in original paper are 78.5 mIoU and 130 FPS on 1080Ti, which is much slower than 2080Ti)
In my opinion, reporting the speed with the same setting (inference setting: single scale or test-time augmentation ) as the performance evaluation is more convincing and reasonable.
However, using test-time augmentation (crop and flip in evaluate.py) to reach higher accuracy but providing the speed in another setting (input is 512x1024) will be misleading for the community to use.
Moreover, other methods cited in Table 1 and Figure 1 adopt the same setting for both performance evaluation and speed evaluation as far as I know.

@wondervictor

Hi @liuzhidemaomao, your results (76.7 mIoU and 55.2 FPS w.r.t 1024x2048 input) are consistent with mine regardless of the gpu. (Results from Table 1 in original paper are 78.5 mIoU and 130 FPS on 1080Ti, which is much slower than 2080Ti)
In my opinion, reporting the speed with the same setting (inference setting: single scale or test-time augmentation ) as the performance evaluation is more convincing and reasonable.
However, using test-time augmentation (crop and flip in evaluate.py) to reach higher accuracy but providing the speed in another setting (input is 512x1024) will be misleading for the community to use.
Moreover, other methods cited in Table 1 and Figure 1 adopt the same setting for both performance evaluation and speed evaluation as far as I know.

I agree with you. I can not reproduce this work using my own codebase. with 1024*2048 input, I obtain 76.8 mIoU. With 512 x 1024 input , the result is very bad.
What is your results using 512 x 1024 input ?

I find that the speed test code does not use torch.cuda.synchronize().

I find that the speed test code does not use torch.cuda.synchronize().

Hi! @ydhongHIT Interesting, Did you test the speed using the torch.cuda.synchronize()
wutianyiRosun/CGNet#2

I find that the speed test code does not use torch.cuda.synchronize().

Hi! @ydhongHIT Interesting, Did you test the speed using the torch.cuda.synchronize()
wutianyiRosun/CGNet#2

I didn't test the speed but I think it may explain why the test speed of you is different from author's.

In the previous reply, we already said that we used the same settings in both speed testing and performance evaluation. The given evaluate.py is used for multi-scale testing for heavy models. We are still working on this repo, and we'll try to release the full code soon. Please check how to implement SAM and AFM first.

Just to be clear, SFNet adopts the single scale testing with input size 1024x2048 and BiSeNet adopts a downsampled input 1024x2048. Neither of them adopts multi-scale testing / sliding / flipping in test. Notably, we fixed the input size to 1024x2048 or 512x1024 of the AttaNet but failed to reach the results of the paper. (You can see the details of the discussions above. I'm not alone). Moreover, evaluating speed without torch.cuda.synchronize() is a serious bug and leads to wrong inference time(time w/ synchronize >> time w/o synchronize) .

Actual speed and accuracy of the proposed AttaNet grabs more attention. Providing correct evaluation scripts is urgent since the repo has been open sourced for several months.
Thanks.

MS/Flip
Be clear that SFNet uses MS/Flip when testing on ADE20K (see Tabel 5 in SFNet). In Table 5 of our paper, nearly all the comparison methods use MS/Flip, to compare with those methods, we also use MS/Flip on ADE20K.
w/ synchronize
In our code, w/ or w/o synchronize doesn't influence the inference speed.
Real-time evaluate
We will upload the weights and the evaluation file for real-time testing soon, please wait for that.

@wondervictor

Hi @liuzhidemaomao, your results (76.7 mIoU and 55.2 FPS w.r.t 1024x2048 input) are consistent with mine regardless of the gpu. (Results from Table 1 in original paper are 78.5 mIoU and 130 FPS on 1080Ti, which is much slower than 2080Ti)
In my opinion, reporting the speed with the same setting (inference setting: single scale or test-time augmentation ) as the performance evaluation is more convincing and reasonable.
However, using test-time augmentation (crop and flip in evaluate.py) to reach higher accuracy but providing the speed in another setting (input is 512x1024) will be misleading for the community to use.
Moreover, other methods cited in Table 1 and Figure 1 adopt the same setting for both performance evaluation and speed evaluation as far as I know.

I agree with you. I can not reproduce this work using my own codebase. with 1024*2048 input, I obtain 76.8 mIoU. With 512 x 1024 input , the result is very bad. What is your results using 512 x 1024 input ?

Have you reproduce this work？

Quantitive results of AttaNet