Clarification of speed
zehongs opened this issue · comments
Hi, thanks for the great work!
I'm curious about how the 0.3s per image is calculated. Is this the overall throughput with a batch size of 256?
I noticed that the diffusion MLP is still taking quite a bit of time, while the MAE encoder and decoder transformers are relatively fast. To improve speed, would it be possible or recommended to further reduce the size of this MLP?
Thanks for the interest! Yes, it is the overall throughput with 256 batch size. The problem with diffusion MLP is that it is too small to fully utilize the GPU, so reducing the MLP size will actually not help much (especially width). A large batch size is one way to alleviate this issue.
Thanks for the prompt reply!!
I'm also curious about the necessity of a MAGE-like encoder and decoder. Since only the MSE loss on the next set of tokens is used during training and no contrastive training like in MAGE is involved, is it still necessary to use such a masked encoding + decoding approach for the unmasked tokens? Any insight would be helpful!
We use this kind of sparse encoder to save computation: in this way, the FLOPs in the encoder will be just 10% of that in the decoder (if we don't consider the buffer tokens). Using a single transformer (aka decoder) is also totally fine (similar to MaskGIT)
Hi, thanks for the great work! I'm curious about how the 0.3s per image is calculated. Is this the overall throughput with a batch size of 256? I noticed that the diffusion MLP is still taking quite a bit of time, while the MAE encoder and decoder transformers are relatively fast. To improve speed, would it be possible or recommended to further reduce the size of this MLP?
I have some doubts about the 0.3s per image as well. Does this refer to the time it takes for 256 tokens to go through the model once? Because I tested generating 8 images, which took 10.4 seconds, and the GPUs used are NVIDIA A100.
@maxin-cn 0.3s per image is the time to generate 256 images in a batch divided by 256. Generating a small batch is typically much more inefficient due to the suboptimal utilization of the GPU.