The impact of head number

Question

The impact of head number

GeneZC opened this issue 6 months ago · comments

Why could the number of attention heads impact the performance of zigzag ring attention and stripped ring attention so significantly?

That is, it is observed from the reported results that stripped ring attention is much better than zigzag ring attention when the #heads=2 while the reverse when #heads=8.

Jiarui Fang · Answer 1 · Sat Mar 30 2024 18:02:33 GMT+0800 (China Standard Time)

Thank you for your insightful observation. I have not yet conducted a thorough analysis of the impact of varying head numbers on the performance of different ring-attention variants.

Additionally, LongContextAttention currently utilizes only the ring_flash_attn_qkvpacked_func. Moving forward, I am going to explore different ring-attention variants for LongCtxAttn, which is a straightforward work in progress (WIP).

GeneZC · Answer 2 · Sun Mar 31 2024 11:17:21 GMT+0800 (China Standard Time)

I see, thanks for your clarification.