there maybe some bug about the train_gpt.py(https://github.com/hpcaitech/ColossalAI-Examples/blob/main/language/gpt/train_gpt.py)
lambda7xx opened this issue · comments
🐛 Describe the bug
I try to run a config by using the train_gpt.py. I add a model on the gpt.py .
def gpt2_test4gpu350M(**kwargs):
model_kwargs = dict(hidden_size=1024, depth=24, num_heads=16,max_position_embeddings=2048, **kwargs)
return _create_gpt_model(**model_kwargs)
And I change my dateset webtext to this .
@DATASETS.register_module
class WebtextDataset(Dataset):
def __init__(self, path=None, seq_len=1024, mbs = 4) -> None:
super().__init__()
if path is not None:
root = os.path.dirname(path)
encoded_data_cache_path = os.path.join(root, f'gpt_webtext_{seq_len}.pt')
else:
encoded_data_cache_path = f'gpt_webtext_{seq_len}.pt'
self.data = torch.randint(0,10000,(seq_len, ), requires_grad=False, device=torch.device('cpu')).long()
self.attention_mask = (torch.rand((seq_len, seq_len), requires_grad=False, device=torch.device('cpu')))
self.attention_mask = torch.where(self.attention_mask < 0.5, 0, 1)
print("self.atttntion_mask:",self.attention_mask[:20])
self.mbs =mbs
print("self.mbs:",self.mbs)
torch.save((seq_len, self.data, self.attention_mask), encoded_data_cache_path)
def __len__(self):
print("WebtextDataset,self.mbs*3:",self.mbs) ## len(train_loader) :返回的是len(dataset)/batch_size
return self.mbs * 5
def __getitem__(self, index):
return {'input_ids':self.data,
'attention_mask': self.attention_mask[0]}, self.data
I run this model and colossalai just spends 1s for 1 iteration. But I run the same model on Megatron-LM and I need about 100s for one iteration.
Environment
No response
🐛 Describe the bug
I try to run a config by using the train_gpt.py. I add a model on the gpt.py .
def gpt2_test4gpu350M(**kwargs): model_kwargs = dict(hidden_size=1024, depth=24, num_heads=16,max_position_embeddings=2048, **kwargs) return _create_gpt_model(**model_kwargs)
And I change my dateset webtext to this .
@DATASETS.register_module class WebtextDataset(Dataset): def __init__(self, path=None, seq_len=1024, mbs = 4) -> None: super().__init__() if path is not None: root = os.path.dirname(path) encoded_data_cache_path = os.path.join(root, f'gpt_webtext_{seq_len}.pt') else: encoded_data_cache_path = f'gpt_webtext_{seq_len}.pt' self.data = torch.randint(0,10000,(seq_len, ), requires_grad=False, device=torch.device('cpu')).long() self.attention_mask = (torch.rand((seq_len, seq_len), requires_grad=False, device=torch.device('cpu'))) self.attention_mask = torch.where(self.attention_mask < 0.5, 0, 1) print("self.atttntion_mask:",self.attention_mask[:20]) self.mbs =mbs print("self.mbs:",self.mbs) torch.save((seq_len, self.data, self.attention_mask), encoded_data_cache_path) def __len__(self): print("WebtextDataset,self.mbs*3:",self.mbs) ## len(train_loader) :返回的是len(dataset)/batch_size return self.mbs * 5 def __getitem__(self, index): return {'input_ids':self.data, 'attention_mask': self.attention_mask[0]}, self.data
I run this model and colossalai just spends 1s for 1 iteration. But I run the same model on Megatron-LM and I need about 100s for one iteration.
Environment
No response
my config is below.
from colossalai.amp import AMP_TYPE
from titans.loss.lm_loss import GPTLMLoss
from titans.model.gpt import gpt2_1_3B, gpt2_test4gpu350M
from torch.optim import Adam
BATCH_SIZE = 4
SEQ_LEN = 2048 #here the num_embedings is equal to the seq_len
NUM_EPOCHS = 1
NUM_MICRO_BATCHES=4
TENSOR_PARALLEL = 2
optimizer = dict(
type=Adam,
lr=0.00015,
weight_decay=1e-2,
)
fp16 = dict(
mode=AMP_TYPE.NAIVE
)
loss = dict(
type=GPTLMLoss,
)
model = dict(
type=gpt2_1_3B,
checkpoint=True,
)
parallel = dict(
pipeline=2,
data = 1,
tensor=dict(size=TENSOR_PARALLEL, mode='1d'),
)
Currently, a config file is not necessary. See this as the latest CAI GPT example
https://github.com/hpcaitech/ColossalAI/blob/main/examples/language/gpt/README.md
if I want to add parallel parallelism and sequence parallelism(SP) on the gpt, how should I run the code? I am confused by the different code and different document. And your code seems tensor parallelism. BTW, I talk with your colleague and he told me the SP can not combined with TP.
He is right, SP and TP can not work together :(