SHI-Labs / Neighborhood-Attention-Transformer

Neighborhood Attention Transformer, arxiv 2022 / CVPR 2023. Dilated Neighborhood Attention Transformer, arxiv 2022

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Cannot repeat the results of Mask2Former+DiNAT-Large on ADE20K

Linwei-Chen opened this issue · comments

Thank you for your interesting work! I am attempting to replicate the results of Mask2Former+DiNAT-Large on ADE20K. I have not made any changes to the settings, but the current training results show a relatively large gap compared to the reported results. I suspect that the pretrained weights have not been processed correctly.

I downloaded the weights from https://shi-labs.com/projects/dinat/checkpoints/imagenet22k/dinat_large_in22k_224_11x11interp.pth and used the following script for pretrained weight conversion:

import pickle as pkl
import sys

import torch

if __name__ == "__main__":
    input = sys.argv[1]

    obj = torch.load(input, map_location="cpu")
    if "model" in obj.keys():
        obj = obj["model"]
    if "state_dict" in obj.keys():
        obj = obj["state_dict"]

    res = {"model": obj, "__author__": "third_party", "matching_heuristics": True}

    with open(sys.argv[2], "wb") as f:
        # https://discuss.pytorch.org/t/runtime-error-when-loading-pytorch-model-from-pkl/38330
        # pkl.dump(res, f)
        torch.save(res, f)

Is there anything wrong with this approach? I am looking forward to hearing from you soon.

Thanks for your interest.

To be clear, are you trying to train the model on ADE20k, or are you running inference?

I'm also confused why weights need any preprocessing.

P.S. could you kindly remove your last message, and send logs, particularly ones this long as files so they're easily searchable and don't make it difficult to interact in the issue?

I am trying to train the model on ADE20K. Perhaps I used the wrong pretraining weights. Could you tell me which one is the correct Img22K weight for Mask2Former?
I downloaded the weights from https://shi-labs.com/projects/dinat/checkpoints/imagenet22k/dinat_large_in22k_224_11x11interp.pth which cannot directly loaded for Mask2Former training, because it's prefix has no 'backbone.'.

I see, you're right; Mask2Former requires converting the standard pth pickle file into one compatible with detectron2, and we provide that under mask2former/tools, which I think is what you're using.

However, I'd note that following Swin, we used the checkpoint that was fine-tuned on ImageNet at 384x384 (with the extended kernel size), and not the plain pre-trained checkpoint with lerped RPB.
You can find that checkpoint in the DiNAT table under classification, or directly through this link.

Thanks a lot! Just one more detail: The paper (Table 15) states that all backbones were pre-trained on ImageNet-22K. However, the checkpoint seems to be pre-trained on ImageNet-22K and fine-tuned on ImageNet-1k. Is this the correct weight to achieve 58.1 mIoU on ADE20K?

Yes it is.

I have cited your work in my new paper, thank you for your excellent work~

Thank you; good luck with your paper!

Hello! After replacing the pretrained weights as per your suggestion, the training results still do not seem to be good enough. It achieved ~54.0% accuracy at iteration 75,000, but it appears to be very difficult to reach 57.1% even after 160,000 iterations, according to my experiences with training Mask2Former.

  1. The frozen stages are set to -1, but config.yaml shows FREEZE_AT = 1. Is this correct? There is a discussion in stating that the frozen stages should be 5. I am confused.
  2. Could you provide the training log.txt? It could be very helpful.
  3. BTW, I only changed the code to use torch.checkpoint to save memory, but it should not affect the results.
class BasicLayer(nn.Module):
    ....
        self.use_checkpoint = use_checkpoint

    def forward(self, x):
        for blk in self.blocks:
            if self.use_checkpoint:
                x = checkpoint.checkpoint(blk, x)
            else:
                x = blk(x)
        if self.downsample is not None:
            x_down = self.downsample(x)
            return x, x_down
        else:
            return x, x

Would you mind sharing how many GPUs you're training with?

In our experience, ADE20K is overall very difficult to reproduce with even identical settings, as opposed to COCO and Cityscapes, and one key parameter to look out for is the number of GPUs, because it heavily affects batch statistics.

For your reference, I've attached a log from one of our last ADE20K runs, which finishes at 57.5 single scale mIoU, and evals at 58.2 multi scale mIoU.

dinat-l-ade20k-semseg.log

I'm not sure what config config you're referring to; freeze at is 0 in the base config and not overridden.

Thanks for your help! Now, I am training with 4 GPUs (4 imgs per GPU). Maybe I should increase it to 8?

Regarding frozen stages, the default value in this link is -1, and the default for FREEZE_AT is 0. I have verified that changing FREEZE_AT does not affect the frozen stages (I printed the self.frozen stages at this link). It seems like FREEZE_AT does not work as intended. But I think frozen_stages=1 is correct, am I right?

Yes, it's usually a good idea to try and replicate results with the same number of GPUs (or processes, not necessarily GPUs).

And thanks for bringing the freeze at issue to our attention. It doesn't break anything because our intention was to freeze nothing and allow optimization throughout the model, and that happens because the freeze_at argument is not passed from the config to the model here. As a result, regardless of the config, freeze_at will always be -1, meaning no weight freezing.

Closing this due to inactivity.