bowang-lab / U-Mamba

Hi, @JunMa11 . Thanks for your great work.
I have a small question related to the network setting.
Since the sequence length L is set to be the multiplication of C, H, W of the image patch according to the paper, then given an image patch such as 320x320, the C can be 32 if my understanding is correct according to the code, then L is 160x160x32=819.2K (after the first pooling) at the first scale of Unet which can be quite large.
Do I misunderstand some details? Or there are some strategies to avoid such a case?
Thanks again and look forward to your help :)

Hi, @IceClear

We followed the common practice in vision transformer and there is a transpose operation. Thus, C is the lengh.

U-Mamba/umamba/nnunetv2/nets/UMambaBot.py

Line 205 in 548f3b2

middle_feature_flat = middle_feature.view(B, C, n_tokens).transpose(-1, -2)

How to avoid large number of sequence length?