Question about PTv3 Time and Memory Complexity
yxchng opened this issue · comments
Hi, the constant time and memory, regardless of patch size, sourced from FlashAttention, which fully utilized the L1 and L2 cache of GPU to operate attention. More details are available in their paper. Noted that PTv3 is also efficient without FlashAttnetion, yet without it, we can not scale up patch size with a constant time and memory cost.
The meaning of sequence length in NLP means token number feed to network each forward. It is close to the concept of numbers of points in 3D point clouds. It would be great if you could run our code and ablate the parameter by yourself.