can not reproduce the AP on coco dataset
HuYanchen-hub opened this issue · comments
When I downloaded the pre-trained model on the coco dataset you provided for inference, I found that the instance segmentation accuracy of the coco data set always differs by 0.2AP. The following is the experimental result.
METHOD | BACKBONE | PQ | PQTH | PQST | AP | MIOU | #PARAMS | CONFIG | CHECKPOINT |
---|---|---|---|---|---|---|---|---|---|
OneFormer | Swin-L† | 57.9 | 64.4 | 48.0 | 49.0 | 67.4 | 219M | config | model |
57.9 | 64.4 | 48.0 | 48.8 | 67.4 | |||||
OneFormer | DiNAT-L† | 58.0 | 64.3 | 48.4 | 49.2 | 68.1 | 223M | [config] | [model] |
58.0 | 64.3 | 48.3 | 49.0 | 68.1 |
The following is my experimental environment.
Environment info:
------------------------------- ------------------------------------------------------------------------------------------------
sys.platform linux
Python 3.8.16 (default, Jun 12 2023, 18:09:05) [GCC 11.2.0]
numpy 1.24.3
detectron2 0.6 @/home/bingxing2/gpuuser206/OneFormer/detectron2/detectron2
Compiler GCC 6.3
CUDA compiler CUDA 11.3
detectron2 arch flags 7.0
DETECTRON2_ENV_MODULE <not set>
PyTorch 1.10.1 @/home/bingxing2/gpuuser206/.conda/envs/oneformer/lib/python3.8/site-packages/torch
PyTorch debug build False
torch._C._GLIBCXX_USE_CXX11_ABI False
GPU available Yes
GPU 0,1,2,3,4,5,6,7 NVIDIA A100-PCIE-40GB (arch=8.0)
Driver version 510.47.03
CUDA_HOME /usr/local/cuda
Pillow 9.5.0
torchvision 0.11.2 @/home/bingxing2/gpuuser206/.conda/envs/oneformer/lib/python3.8/site-packages/torchvision
torchvision arch flags 3.5, 5.0, 6.0, 7.0, 7.5, 8.0, 8.6
fvcore 0.1.5.post20221221
iopath 0.1.9
cv2 4.7.0
------------------------------- ------------------------------------------------------------------------------------------------
PyTorch built with:
- GCC 7.3
- C++ Version: 201402
- Intel(R) oneAPI Math Kernel Library Version 2021.4-Product Build 20210904 for Intel(R) 64 architecture applications
- Intel(R) MKL-DNN v2.2.3 (Git Hash 7336ca9f055cf1bfa13efb658fe15dc9b41f0740)
- OpenMP 201511 (a.k.a. OpenMP 4.5)
- LAPACK is enabled (usually provided by MKL)
- NNPACK is enabled
- CPU capability usage: AVX512
- CUDA Runtime 11.3
- NVCC architecture flags: -gencode;arch=compute_37,code=sm_37;-gencode;arch=compute_50,code=sm_50;-gencode;arch=compute_60,code=sm_60;-gencode;arch=compute_61,code=sm_61;-gencode;arch=compute_70,code=sm_70;-gencode;arch=compute_75,code=sm_75;-gencode;arch=compute_80,code=sm_80;-gencode;arch=compute_86,code=sm_86;-gencode;arch=compute_37,code=compute_37
- CuDNN 8.2
- Magma 2.5.2
- Build settings: BLAS_INFO=mkl, BUILD_TYPE=Release, CUDA_VERSION=11.3, CUDNN_VERSION=8.2.0, CXX_COMPILER=/opt/rh/devtoolset-7/root/usr/bin/c++, CXX_FLAGS= -Wno-deprecated -fvisibility-inlines-hidden -DUSE_PTHREADPOOL -fopenmp -DNDEBUG -DUSE_KINETO -DUSE_FBGEMM -DUSE_QNNPACK -DUSE_PYTORCH_QNNPACK -DUSE_XNNPACK -DSYMBOLICATE_MOBILE_DEBUG_HANDLE -DEDGE_PROFILER_USE_KINETO -O2 -fPIC -Wno-narrowing -Wall -Wextra -Werror=return-type -Wno-missing-field-initializers -Wno-type-limits -Wno-array-bounds -Wno-unknown-pragmas -Wno-sign-compare -Wno-unused-parameter -Wno-unused-variable -Wno-unused-function -Wno-unused-result -Wno-unused-local-typedefs -Wno-strict-overflow -Wno-strict-aliasing -Wno-error=deprecated-declarations -Wno-stringop-overflow -Wno-psabi -Wno-error=pedantic -Wno-error=redundant-decls -Wno-error=old-style-cast -fdiagnostics-color=always -faligned-new -Wno-unused-but-set-variable -Wno-maybe-uninitialized -fno-math-errno -fno-trapping-math -Werror=format -Wno-stringop-overflow, LAPACK_INFO=mkl, PERF_WITH_AVX=1, PERF_WITH_AVX2=1, PERF_WITH_AVX512=1, TORCH_VERSION=1.10.1, USE_CUDA=ON, USE_CUDNN=ON, USE_EXCEPTION_PTR=1, USE_GFLAGS=OFF, USE_GLOG=OFF, USE_MKL=ON, USE_MKLDNN=ON, USE_MPI=OFF, USE_NCCL=ON, USE_NNPACK=ON, USE_OPENMP=ON,
Hi @HuYanchen-hub, thanks for your interest.
Do you set the task=instance
when running the evaluation script? The numbers you shared seem to correspond to task=panoptic
in the evaluation script. We mention the same in the instructions here.
I ran the evaluation on an A100 myself now and obtained the following results for the DiNAT-L backbone:
#### DiNAT-L Oneformer
[07/05 05:47:28 d2.evaluation.testing]: copypaste: Task: segm
[07/05 05:47:28 d2.evaluation.testing]: copypaste: AP,AP50,AP75,APs,APm,APl
[07/05 05:47:28 d2.evaluation.testing]: copypaste: 49.2071,73.8117,53.6113,29.4197,53.7316,70.9744
You might experience a variance of 0.1-0.2 units when running evaluations on different machines (I remember noticing something like that while experimenting).
Thanks for your reply, when I set task=instance
, I got the correct result. But when I use swin-L backbone to test instance semantic tasks with task=semantic
, I get mIoU which is lower than task=panoptic
. I would like to ask if the mIoU results you reported are the higher of them? Or is it due to variance caused by different experimental equipment?
copypaste: Task: sem_seg
[07/05 21:06:35 d2.evaluation.testing]: copypaste: mIoU,fwIoU,mACC,pACC
[07/05 21:06:35 d2.evaluation.testing]: copypaste: 67.2288,72.4984,78.5884,82.9312
And when I use DiNAT-L Backbone to test with task = panoptic
, the experimental result of PQ_{st}is also 0.1 different from your result.
Task: panoptic_seg
[07/05 21:14:12 d2.evaluation.testing]: copypaste: PQ,SQ,RQ,PQ_th,SQ_th,RQ_th,PQ_st,SQ_st,RQ_st
[07/05 21:14:12 d2.evaluation.testing]: copypaste: 57.9436,83.7602,68.4097,64.3089,84.9244,75.2713,48.3356,82.0030,58.0525
OneFormer is a very good work, and we want to support this algorithm in the open source object detection toolbox mmdetection, so we need to master more experimental details. Thank you for your help.
Hi @HuYanchen-hub, thanks for working on adding the support of OneFormer to detection!
We report the metric scores corresponding to the metric-focused task for each task. So, we report mIoU with task=semantic.
I believe the difference you notice is within the variance range both for mIoU and PQ_st.
Thanks for you reply!
Hi @HuYanchen-hub, thanks for working on adding the support of OneFormer to detection!
We report the metric scores corresponding to the metric-focused task for each task. So, we report mIoU with
task=semantic.
I believe the difference you notice is within the variance range both for mIoU and PQ_st.
Thank you very much, get it!