Reproduction gap for Cityscapes
momo1986 opened this issue · comments
Script:
python train_net.py --num-gpus 1 --config-file configs/cityscapes/swin/oneformer_swin_large_bs16_90k.yaml --eval-only MODEL.IS_TRAIN False MODEL.WEIGHTS 250_16_swin_l_oneformer_cityscapes_90k.pth MODEL.TEST.TASK semantic
CUDA: 11.1
Pytorch: 1.10.1
Ideal result is:
OneFormer Swin-Ly [38] 219M 543G 250 512�1024 90k 67.2 45.6 83.0 84.4
However, my result is very wierd:
categories IoU nIoU
--------------------------------
flat : 0.561 nan
construction : 0.385 nan
object : 0.140 nan
nature : 0.565 nan
sky : 0.072 nan
human : 0.250 0.313
vehicle : 0.133 0.362
--------------------------------
Score Average : 0.301 0.338
--------------------------------
[06/13 13:47:55 d2.evaluation.testing]: copypaste: Task: sem_seg
[06/13 13:47:55 d2.evaluation.testing]: copypaste: IoU,iIoU,IoU_sup,iIoU_sup
[06/13 13:47:55 d2.evaluation.testing]: copypaste: 11.7792,8.6192,30.0917,33.7640
Your sharing is great. It is my honor to apply OneFormer. However, this reproduction gap is an issue that I need to address.
I am sorry to bother your guys.
Thanks & Regards!
Momo
Hi, @momo1986, thanks for your interest in our work. Could you share the complete log from your evaluation? That should help me better understand the issue.
We evaluate our models on 8 GPUs, and you use 1 GPU. Different numbers of GPUs should not be the issue, but still could you try evaluating with 8 GPUs if possible?
Hi @momo1986, I tried evaluating our Swin-L OneFormer on a single GPU (--num_gpus=1
) and it gives the expected result. You can find my evaluation log here.
classes IoU nIoU
--------------------------------
road : 0.985 nan
sidewalk : 0.869 nan
building : 0.940 nan
wall : 0.668 nan
fence : 0.695 nan
pole : 0.723 nan
traffic light : 0.767 nan
traffic sign : 0.854 nan
vegetation : 0.933 nan
terrain : 0.659 nan
sky : 0.959 nan
person : 0.870 0.738
rider : 0.728 0.621
car : 0.965 0.885
truck : 0.903 0.640
bus : 0.931 0.772
train : 0.847 0.692
motorcycle : 0.697 0.616
bicycle : 0.773 0.689
--------------------------------
Score Average : 0.830 0.707
--------------------------------
categories IoU nIoU
--------------------------------
flat : 0.988 nan
construction : 0.943 nan
object : 0.781 nan
nature : 0.936 nan
sky : 0.959 nan
human : 0.876 0.764
vehicle : 0.950 0.876
--------------------------------
Score Average : 0.919 0.820
--------------------------------
[06/13 12:57:42 d2.evaluation.testing]: copypaste: Task: sem_seg
[06/13 12:57:42 d2.evaluation.testing]: copypaste: IoU,iIoU,IoU_sup,iIoU_sup
[06/13 12:57:42 d2.evaluation.testing]: copypaste: 82.9802,70.6712,91.9019,81.9933
Hi @praeclarumjj3.
Thanks for your kind reply.
I am currently working on this issue. It always reports the error log "error in ms_deformable_im2col_cuda".
I doubt that this error causes the performance gap.
Here is the evaluation log.
https://drive.google.com/file/d/1Kgf_NYITtZTkpx_6EilNjWZFO_s2hEpO/view?usp=sharing
I work on NVIDIA_3090 machine. Its defualt CUDA toolkit is 11.1. However, I installed the pytorch version and cuda toolkit with OneFormer official installation guidance.
Thanks & Regards!
Momo
Hi, @momo1986, thanks for the log. You have installed PyTorch with CUDA 11.3 build. However, the CUDA version on your local machine is 11.1. I suggest you install PyTorch with CUDA <= 11.1 build.
![Screenshot 2023-06-15 at 3 06 56 PM](https://private-user-images.githubusercontent.com/54928629/246066431-3f35cedf-ccc8-4387-8f6a-22cb08dd6a8e.png?jwt=eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJpc3MiOiJnaXRodWIuY29tIiwiYXVkIjoicmF3LmdpdGh1YnVzZXJjb250ZW50LmNvbSIsImtleSI6ImtleTUiLCJleHAiOjE3MjAzMzM2NDMsIm5iZiI6MTcyMDMzMzM0MywicGF0aCI6Ii81NDkyODYyOS8yNDYwNjY0MzEtM2YzNWNlZGYtY2NjOC00Mzg3LThmNmEtMjJjYjA4ZGQ2YThlLnBuZz9YLUFtei1BbGdvcml0aG09QVdTNC1ITUFDLVNIQTI1NiZYLUFtei1DcmVkZW50aWFsPUFLSUFWQ09EWUxTQTUzUFFLNFpBJTJGMjAyNDA3MDclMkZ1cy1lYXN0LTElMkZzMyUyRmF3czRfcmVxdWVzdCZYLUFtei1EYXRlPTIwMjQwNzA3VDA2MjIyM1omWC1BbXotRXhwaXJlcz0zMDAmWC1BbXotU2lnbmF0dXJlPTRhZGE1NjY4ZjQ4ZDM0NzgxM2FmYTUyYjgzYmFmOTJlOTIwNjQ0OTgxNjA5OGRhN2ZkYTBjNWIwYmNlZWQ0MzkmWC1BbXotU2lnbmVkSGVhZGVycz1ob3N0JmFjdG9yX2lkPTAma2V5X2lkPTAmcmVwb19pZD0wIn0.UymmlSn5LU4WUElczr_PQdQKc1uLNI-CPHt3fnlJMGQ)
![Screenshot 2023-06-15 at 3 04 57 PM](https://private-user-images.githubusercontent.com/54928629/246065853-1cecf440-4b32-4a33-9018-4856061309c4.png?jwt=eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJpc3MiOiJnaXRodWIuY29tIiwiYXVkIjoicmF3LmdpdGh1YnVzZXJjb250ZW50LmNvbSIsImtleSI6ImtleTUiLCJleHAiOjE3MjAzMzM2NDMsIm5iZiI6MTcyMDMzMzM0MywicGF0aCI6Ii81NDkyODYyOS8yNDYwNjU4NTMtMWNlY2Y0NDAtNGIzMi00YTMzLTkwMTgtNDg1NjA2MTMwOWM0LnBuZz9YLUFtei1BbGdvcml0aG09QVdTNC1ITUFDLVNIQTI1NiZYLUFtei1DcmVkZW50aWFsPUFLSUFWQ09EWUxTQTUzUFFLNFpBJTJGMjAyNDA3MDclMkZ1cy1lYXN0LTElMkZzMyUyRmF3czRfcmVxdWVzdCZYLUFtei1EYXRlPTIwMjQwNzA3VDA2MjIyM1omWC1BbXotRXhwaXJlcz0zMDAmWC1BbXotU2lnbmF0dXJlPWUxNDRlYzljY2QxNjdmY2VmZWE2OWU5M2NjZTI5ODBlZmViNjI3NzdjYmEyNmMwYWIzZDgyNmZmYWZkM2Y0MjImWC1BbXotU2lnbmVkSGVhZGVycz1ob3N0JmFjdG9yX2lkPTAma2V5X2lkPTAmcmVwb19pZD0wIn0.qE-58x2wyBnCNxlxneGLLztwMzeZ5vB8Z6GRRLjnaL8)
I noticed you already opened an issue about this in #67. I am closing this issue. Let's have a further conversation about this under that issue. Feel free to re-open this if you face any other issues.