EleutherAI / gpt-neox

An implementation of model parallel autoregressive transformers on GPUs, based on the Megatron and DeepSpeed libraries

Home Page:https://www.eleuther.ai/

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Support for custom model architecture

itsnamgyu opened this issue · comments

I'm building a custom architecture involving multiple existing architectures as sub-components (Pythia, RoBERTa, T5, etc).

Does this library support custom architectures? If not, could someone give me some pointers on how to approach it? (e.g., use a different library, re-build the architecture using provided model components)

I'm planning to run pre-training from scratch up to 7B params. I'm mainly interested in using this library for its FlashAttention support and ease of multi-node training.

Hey there! Yes I think this is doable, but would take some effort to add the new architectures given that we only have GPT architectures supported here right now.

In terms of approaching things, since we're a megatron-based framework and many have added these architectures to other megatron-based frameworks, I'd recommend porting those implementations under our https://github.com/EleutherAI/gpt-neox/tree/main/megatron/model

There was a gpt-neox t5 effort at https://github.com/EleutherAI/gpt-neox/tree/t5-shared-params that you could start off from for t5 for example. T5 is now also in the upstream Megatron (https://github.com/NVIDIA/Megatron-LM/blob/main/megatron/model/t5_model.py)

I would be happy to discuss this with you along the way and help on the effort if you go for it!

Thanks I'll check them out!

@itsnamgyu This might be helpful... here's an example where I use lm eval in another unrelated repo with custom models:
foundation-model-stack/foundation-model-stack#154

@nairbv Thanks a lot!

Hey this sounds interesting, I'm planning to recreate model that's written in Pytorch with this library. Given it's custom architecture, what are things I need to consider and need to plan so that I can take benefit of Gpt-NeoX for distributed training. Any pointers or guidance would help.

I looked up the T5 model implementation as well on T5-shared-params branch I would like to know if it's only required to create a model file similar to gpt2_model.py in models directory or do I need to make changes with Megatron as well. It would be helpful if you can provide me with in idea of what changes are required to incorporate a custom model architecture.

@JDRanpariya I've actually decided to use the HuggingFace implementation of GPTNeoX with deepspeed and FlashAttention2 for now. I'm not working with T5 or RoBERTa at the moment.

OP has decided to pursue a different approach than mod this library.

Yep, got it! I guess people wanting to do it would do it anyhow but I think this issue is good starting point. Is it possible to move it to discussions? might help people who want to do similar in future.