Question about PTv3 Time and Memory Complexity

Question

Question about PTv3 Time and Memory Complexity

yxchng opened this issue 4 months ago · comments

Based on this table, it seems like PTv3 has a constant time and memory usage regardless of patch size. However, I can't find an explanation in the paper why this is the case. Can you kindly elaborate? What leads to constant time and memory complexity of PTv3?

Xiaoyang Wu · Answer 1 · Thu Mar 28 2024 15:49:20 GMT+0800 (China Standard Time)

Hi, the constant time and memory, regardless of patch size, sourced from FlashAttention, which fully utilized the L1 and L2 cache of GPU to operate attention. More details are available in their paper. Noted that PTv3 is also efficient without FlashAttnetion, yet without it, we can not scale up patch size with a constant time and memory cost.

yxchng · Answer 2 · Thu Mar 28 2024 15:58:53 GMT+0800 (China Standard Time)

The metrics given in their repo does not really show constant time and memory complexity, but increases when sequence length increases. Why PTv3 does not exhibit same characteristic?

Xiaoyang Wu · Answer 3 · Thu Mar 28 2024 16:24:41 GMT+0800 (China Standard Time)

The meaning of sequence length in NLP means token number feed to network each forward. It is close to the concept of numbers of points in 3D point clouds. It would be great if you could run our code and ablate the parameter by yourself.