MzeroMiko / VMamba

VMamba: Visual State Space Models,code is based on mamba

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

The speed reported in the paper cannot be reproduced

LMMMEng opened this issue · comments

Hello,

I'm trying to reproduce the speed results mentioned in the paper. The paper states that VMamba is faster than convnext and swin during inference. However, when I used the timm version of swin and convnext, I found that they were much faster (>1.3x) than VMamba-v05noz on a single 3090 GPU (bs=128). The original implementation of swin also demonstrates this scenario. Although the paper's results were based on the A100 GPU, I don't think such a significant difference can be improved even with a better GPU. Could you please provide the version of swin and convnext used in your speed analysis experiments?

Thank you so much in advance!

May I ask which version of torch and cuda are you using? Are you using torch2?

Yes, I'm using torch 2.2.1+cu118.

Then the key is the GPU now.

For swin convnext, they are benefited from the speed acceleration for layout of bhwc, but for models based on mamba, bchw is the default layout of s6.

And that is the key to the inference speed difference.

You can try modify the config of normlayer from ln2d to ln and try again to test what i am saying. (though that can not change the layout of s6 and introduces extra permutation)

Thank you for your kind reply. I'll check that.

I apologize for reaching out once more, but I'm still unable to replicate the speed results mentioned in Table 1 of the paper. Based on my implementation, I find that ConvNeXt performs better than Swin, which in turn outperforms VMamba. I observed that your speed test involves loading real data, which could potentially be influenced by unidentified factors, thereby not completely reflecting the actual speed of the model. To ensure accurate speed testing, I recommend conducting tests that eliminate additional factors like CPU bottlenecks. If possible, you can use timm to test speed again (https://github.com/huggingface/pytorch-image-models/blob/main/benchmark.py). timm already provides implementations of many models, such as Swin and ConvNeXt, and you can easily test their speed with this script. Thank you for your attention to this matter.

Nice issue!

The conclusion is that torch.backends.cuda.matmul.allow_tf32 matters.

I run the code according to the benchmark.py and found that on A100 with Intel Xeon Gold 5320, the fps of convnext-tiny is 2998, swin_tiny_patch4_window7_224 is 1902, the original convnext-tiny is 2160, the original swin-tiny is 1985, VMamba-tiny (0230) is 1243.

Note that I hacked the code of InferenceBenchmarkRunner to make sure the input is in (3, 224, 224), rather than the default (3, 288, 288) for convnext-tiny.

class IBR(InferenceBenchmarkRunner):
    def __init__(self, batch_size=128, num_warm_iter=50, num_bench_iter=30, **kwargs):
        super().__init__("vgg16", batch_size=batch_size, num_warm_iter=num_warm_iter, num_bench_iter=num_bench_iter, **kwargs)

    def run(self, model):
        self.model = model.cuda()
        self.model.eval()
        self.compiled = True # to avoid profile
        return super().run()

Then I replaced some parts of benchmark.py with the corresponding code in swin-transformer, and finally found that though the position of synchronize, the data in random or real do influences somehow, it is torch.backends.cuda.matmul.allow_tf32 (which is False in default) that really matters.

When torch.backends.cuda.matmul.allow_tf32=False, the fps of convnext-tiny is 1306, swin_tiny_patch4_window7_224 is 1104, the original convnext-tiny is 1117, the original swin-tiny is 1142, VMamba-tiny (0230) is 1245. And that is what happens when directly using the code from swin-transformer.

Thank you for your prompt response. You have reached a well-founded conclusion!

Could you kindly inform me if the results of the paper will be updated? Since typically, torch.backends.cuda.matmul.allow_tf32 should be set to True.

Hard to say. I tend to say no. Because:

  1. tf32 is actually in 19bits, but not in 32bits, while we want to test speed in fp32 but not fp19 (actually, the model works well in fp16 which is even faster, and many models are trained with torch.amp including VMamba.)
  2. torch.backends.cuda.matmul.allow_tf32 is set to False rather than True by default currently, that is why in timm.benchmark, the value is set to True manually.

Very sorry for the confusion. In my case, torch.backends.cuda.matmul.allow_tf32 is already set to False, since the timm version I'm using is 0.6.13. Therefore, so far Swin-T and ConvX-T are still much faster than VMamba(v05_noz+ln2d) in my tests. This is the registration part of VMamba model, which gets the same params and flops with the reported results. I'm not sure what's wrong (. Many thanks in advance!

@register_model
def vmambav2_t(**kwargs):
    model =  VSSM(patch_size=4,
                  depths=[2, 2, 5, 2],
                  dims=[96, 192, 384, 768], 
                  # =========================
                  ssm_d_state=1,
                  ssm_ratio=2.0,
                  ssm_dt_rank="auto",
                  ssm_act_layer="silu",
                  ssm_conv=3,
                  ssm_conv_bias=False,
                  ssm_drop_rate=0.0,
                  ssm_init="v0",
                  forward_type="v05_noz",
                  # =========================
                  mlp_ratio=4.0,
                  mlp_act_layer="gelu",
                  mlp_drop_rate=0.0,
                  gmlp=False,
                # =========================
                  patch_norm=True, 
                  norm_layer="LN2D", # "BN", "LN2D"
                  downsample_version = "v3", # "v1", "v2", "v3"
                  patchembed_version = "v2", # "v1", "v2"
                  use_checkpoint=False,
                  **kwargs)

    return model