hpcaitech / ColossalAI-Examples

Examples of training models with hybrid parallelism using ColossalAI

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

there maybe some bug about the train_gpt.py(https://github.com/hpcaitech/ColossalAI-Examples/blob/main/language/gpt/train_gpt.py)

lambda7xx opened this issue · comments

🐛 Describe the bug

I try to run a config by using the train_gpt.py. I add a model on the gpt.py .


def gpt2_test4gpu350M(**kwargs):
    model_kwargs = dict(hidden_size=1024, depth=24, num_heads=16,max_position_embeddings=2048, **kwargs)
    return _create_gpt_model(**model_kwargs)

And I change my dateset webtext to this .


@DATASETS.register_module
class  WebtextDataset(Dataset):

    def __init__(self, path=None, seq_len=1024, mbs = 4) -> None:
        super().__init__()
        if path is not None:
            root = os.path.dirname(path)
            encoded_data_cache_path = os.path.join(root, f'gpt_webtext_{seq_len}.pt')
        else:
            encoded_data_cache_path = f'gpt_webtext_{seq_len}.pt'

        self.data = torch.randint(0,10000,(seq_len, ), requires_grad=False, device=torch.device('cpu')).long()
        self.attention_mask = (torch.rand((seq_len, seq_len), requires_grad=False, device=torch.device('cpu')))
        self.attention_mask = torch.where(self.attention_mask < 0.5, 0, 1)
        print("self.atttntion_mask:",self.attention_mask[:20])
        
        self.mbs =mbs 
        print("self.mbs:",self.mbs)

        torch.save((seq_len, self.data, self.attention_mask), encoded_data_cache_path)

    def __len__(self):
        print("WebtextDataset,self.mbs*3:",self.mbs) ## len(train_loader) :返回的是len(dataset)/batch_size
        return self.mbs * 5

    def __getitem__(self, index):
        return {'input_ids':self.data,
            'attention_mask': self.attention_mask[0]}, self.data

I run this model and colossalai just spends 1s for 1 iteration. But I run the same model on Megatron-LM and I need about 100s for one iteration.

Environment

No response

🐛 Describe the bug

I try to run a config by using the train_gpt.py. I add a model on the gpt.py .


def gpt2_test4gpu350M(**kwargs):
    model_kwargs = dict(hidden_size=1024, depth=24, num_heads=16,max_position_embeddings=2048, **kwargs)
    return _create_gpt_model(**model_kwargs)

And I change my dateset webtext to this .


@DATASETS.register_module
class  WebtextDataset(Dataset):

    def __init__(self, path=None, seq_len=1024, mbs = 4) -> None:
        super().__init__()
        if path is not None:
            root = os.path.dirname(path)
            encoded_data_cache_path = os.path.join(root, f'gpt_webtext_{seq_len}.pt')
        else:
            encoded_data_cache_path = f'gpt_webtext_{seq_len}.pt'

        self.data = torch.randint(0,10000,(seq_len, ), requires_grad=False, device=torch.device('cpu')).long()
        self.attention_mask = (torch.rand((seq_len, seq_len), requires_grad=False, device=torch.device('cpu')))
        self.attention_mask = torch.where(self.attention_mask < 0.5, 0, 1)
        print("self.atttntion_mask:",self.attention_mask[:20])
        
        self.mbs =mbs 
        print("self.mbs:",self.mbs)

        torch.save((seq_len, self.data, self.attention_mask), encoded_data_cache_path)

    def __len__(self):
        print("WebtextDataset,self.mbs*3:",self.mbs) ## len(train_loader) :返回的是len(dataset)/batch_size
        return self.mbs * 5

    def __getitem__(self, index):
        return {'input_ids':self.data,
            'attention_mask': self.attention_mask[0]}, self.data

I run this model and colossalai just spends 1s for 1 iteration. But I run the same model on Megatron-LM and I need about 100s for one iteration.

Environment

No response

my config is below.

from colossalai.amp import AMP_TYPE
from titans.loss.lm_loss import GPTLMLoss
from titans.model.gpt import gpt2_1_3B, gpt2_test4gpu350M
from torch.optim import Adam

BATCH_SIZE = 4
SEQ_LEN = 2048 #here the num_embedings is equal to the seq_len
NUM_EPOCHS = 1
NUM_MICRO_BATCHES=4
TENSOR_PARALLEL = 2

optimizer = dict(
type=Adam,
lr=0.00015,
weight_decay=1e-2,
)

fp16 = dict(
mode=AMP_TYPE.NAIVE
)

loss = dict(
type=GPTLMLoss,
)

model = dict(
type=gpt2_1_3B,
checkpoint=True,
)

parallel = dict(
pipeline=2,
data = 1,
tensor=dict(size=TENSOR_PARALLEL, mode='1d'),
)

Currently, a config file is not necessary. See this as the latest CAI GPT example
https://github.com/hpcaitech/ColossalAI/blob/main/examples/language/gpt/README.md

if I want to add parallel parallelism and sequence parallelism(SP) on the gpt, how should I run the code? I am confused by the different code and different document. And your code seems tensor parallelism. BTW, I talk with your colleague and he told me the SP can not combined with TP.

He is right, SP and TP can not work together :(