feifeibear / long-context-attention

USP: Unified (a.k.a. Hybrid, 2D) Sequence Parallel Attention for Long Context Transformers Model Training and Inference

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

The impact of head number

GeneZC opened this issue · comments

Why could the number of attention heads impact the performance of zigzag ring attention and stripped ring attention so significantly?

That is, it is observed from the reported results that stripped ring attention is much better than zigzag ring attention when the #heads=2 while the reverse when #heads=8.

Thank you for your insightful observation. I have not yet conducted a thorough analysis of the impact of varying head numbers on the performance of different ring-attention variants.

Additionally, LongContextAttention currently utilizes only the ring_flash_attn_qkvpacked_func. Moving forward, I am going to explore different ring-attention variants for LongCtxAttn, which is a straightforward work in progress (WIP).

I see, thanks for your clarification.