In megatron/model/transformer.py, repeat_kv when using GQA can be applied after rotary embeddings; can give a performance gain for RoPE

Question

In megatron/model/transformer.py, repeat_kv when using GQA can be applied after rotary embeddings; can give a performance gain for RoPE

puneeshkhanna opened this issue 7 months ago · comments

Currently repeat_kv is applied before applying rotary position embeddings to key and query states.
https://github.com/huggingface/transformers/blob/main/src/transformers/models/llama/modeling_llama.py first applies RoPE followed by repeat kv.

Conglong Li · Answer 1 · Tue Nov 21 2023 09:53:23 GMT+0800 (China Standard Time)

@puneeshkhanna Feel free to submit a PR about what you suggested. I can review and merge after verifying with pretraining experiments.

andy wong · Answer 2 · Wed Nov 22 2023 03:38:28 GMT+0800 (China Standard Time)

Hello @conglongli,

I'm currently exploring Megatron-DeepSpeed for the first time, specifically aiming to implement SFT with Llama2.

Could you please share the versions of torch, cuda, and deepspeed that you've found to work well together? My previous colleague, who used the bloom version of megatron-deepspeed, had success with cuda 11.1, Torch 1.10.1, and DS 0.6.5. I'm curious to know if these versions are still optimal or if there are newer combinations that you would recommend.

Thank you for your advices!