microsoft / Megatron-DeepSpeed

Ongoing research training transformer language models at scale, including: BERT & GPT-2

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

In megatron/model/transformer.py, repeat_kv when using GQA can be applied after rotary embeddings; can give a performance gain for RoPE

puneeshkhanna opened this issue · comments

Currently repeat_kv is applied before applying rotary position embeddings to key and query states.
https://github.com/huggingface/transformers/blob/main/src/transformers/models/llama/modeling_llama.py first applies RoPE followed by repeat kv.

@puneeshkhanna Feel free to submit a PR about what you suggested. I can review and merge after verifying with pretraining experiments.

Hello @conglongli,

I'm currently exploring Megatron-DeepSpeed for the first time, specifically aiming to implement SFT with Llama2.

Could you please share the versions of torch, cuda, and deepspeed that you've found to work well together? My previous colleague, who used the bloom version of megatron-deepspeed, had success with cuda 11.1, Torch 1.10.1, and DS 0.6.5. I'm curious to know if these versions are still optimal or if there are newer combinations that you would recommend.

Thank you for your advices!