SHI-Labs / OneFormer

OneFormer: One Transformer to Rule Universal Image Segmentation, arxiv 2022 / CVPR 2023

Home Page:https://praeclarumjj3.github.io/oneformer

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Reproduction gap for Cityscapes

momo1986 opened this issue · comments

Script:
python train_net.py --num-gpus 1 --config-file configs/cityscapes/swin/oneformer_swin_large_bs16_90k.yaml --eval-only MODEL.IS_TRAIN False MODEL.WEIGHTS 250_16_swin_l_oneformer_cityscapes_90k.pth MODEL.TEST.TASK semantic
CUDA: 11.1
Pytorch: 1.10.1

Ideal result is:

OneFormer Swin-Ly [38] 219M 543G 250 512�1024 90k 67.2 45.6 83.0 84.4

However, my result is very wierd:

categories       IoU      nIoU
--------------------------------
flat          : 0.561      nan
construction  : 0.385      nan
object        : 0.140      nan
nature        : 0.565      nan
sky           : 0.072      nan
human         : 0.250    0.313
vehicle       : 0.133    0.362
--------------------------------
Score Average : 0.301    0.338
--------------------------------

[06/13 13:47:55 d2.evaluation.testing]: copypaste: Task: sem_seg
[06/13 13:47:55 d2.evaluation.testing]: copypaste: IoU,iIoU,IoU_sup,iIoU_sup
[06/13 13:47:55 d2.evaluation.testing]: copypaste: 11.7792,8.6192,30.0917,33.7640

Your sharing is great. It is my honor to apply OneFormer. However, this reproduction gap is an issue that I need to address.

I am sorry to bother your guys.

Thanks & Regards!
Momo

Hi, @momo1986, thanks for your interest in our work. Could you share the complete log from your evaluation? That should help me better understand the issue.

We evaluate our models on 8 GPUs, and you use 1 GPU. Different numbers of GPUs should not be the issue, but still could you try evaluating with 8 GPUs if possible?

Hi @momo1986, I tried evaluating our Swin-L OneFormer on a single GPU (--num_gpus=1) and it gives the expected result. You can find my evaluation log here.

classes          IoU      nIoU
--------------------------------
road          : 0.985      nan
sidewalk      : 0.869      nan
building      : 0.940      nan
wall          : 0.668      nan
fence         : 0.695      nan
pole          : 0.723      nan
traffic light : 0.767      nan
traffic sign  : 0.854      nan
vegetation    : 0.933      nan

terrain       : 0.659      nan
sky           : 0.959      nan
person        : 0.870    0.738
rider         : 0.728    0.621
car           : 0.965    0.885
truck         : 0.903    0.640
bus           : 0.931    0.772
train         : 0.847    0.692
motorcycle    : 0.697    0.616
bicycle       : 0.773    0.689
--------------------------------
Score Average : 0.830    0.707
--------------------------------


categories       IoU      nIoU
--------------------------------
flat          : 0.988      nan
construction  : 0.943      nan
object        : 0.781      nan
nature        : 0.936      nan
sky           : 0.959      nan
human         : 0.876    0.764
vehicle       : 0.950    0.876
--------------------------------
Score Average : 0.919    0.820
--------------------------------

[06/13 12:57:42 d2.evaluation.testing]: copypaste: Task: sem_seg
[06/13 12:57:42 d2.evaluation.testing]: copypaste: IoU,iIoU,IoU_sup,iIoU_sup
[06/13 12:57:42 d2.evaluation.testing]: copypaste: 82.9802,70.6712,91.9019,81.9933

Hi @praeclarumjj3.

Thanks for your kind reply.

I am currently working on this issue. It always reports the error log "error in ms_deformable_im2col_cuda".

I doubt that this error causes the performance gap.

Here is the evaluation log.

https://drive.google.com/file/d/1Kgf_NYITtZTkpx_6EilNjWZFO_s2hEpO/view?usp=sharing

I work on NVIDIA_3090 machine. Its defualt CUDA toolkit is 11.1. However, I installed the pytorch version and cuda toolkit with OneFormer official installation guidance.

Thanks & Regards!
Momo

Hi, @momo1986, thanks for the log. You have installed PyTorch with CUDA 11.3 build. However, the CUDA version on your local machine is 11.1. I suggest you install PyTorch with CUDA <= 11.1 build.

Screenshot 2023-06-15 at 3 06 56 PM Screenshot 2023-06-15 at 3 04 57 PM

I noticed you already opened an issue about this in #67. I am closing this issue. Let's have a further conversation about this under that issue. Feel free to re-open this if you face any other issues.