Explain composer logs emitted during training + Replicate Benchmark Results
geodra opened this issue · comments
❓ Question
Hello, I am training an mpt-3B model on AWS SageMaker using an ml.p4d.24xlarge instance and trying to replicate the results displayed in this table: link.
Specifically, I am focusing on replicating the result for the mpt-3b model with the following configuration: max_seq_len: 2048, global_train_batch_size=320, device_train_microbatch_size=5, and 8 a100_40gb GPUs. According to the table, it should be able to process 39 sequences per second. Since I process 320 sequences within one batch, the batch should ideally finish within 8.2 seconds. However, when I run it, it takes around 10 seconds (screenshot attached).
I am also looking for an explanation of the logs emitted by the composer before the start of every batch. I have checked the documentation but couldn't find anything specific. I am particularly interested in understanding the meaning of the following logs:
- Train memory/allocated_mem: 6.8051
- Train memory/active_mem: 6.8051
- Train memory/inactive_mem: 1.9065
- Train memory/reserved_mem: 14.6740
- Train memory/alloc_retries: 0
- Train loss/train/total: 11.6525
- Train metrics/train/LanguageCrossEntropy: 11.6525
- Train metrics/train/LanguagePerplexity: 114977.1562
- Train time/train: 0.0081
- Train time/val: 0.0000
- Train time/total: 0.0081
- Train lr-DecoupledAdamW/group0: 0.0000
- Train time/remaining_estimate: 0.0225
Lastly, I would like to know if there is an easy way to calculate TFLOP/s using the above logs.
Here is the bash command that I am running:
composer train/train.py \
train/yamls/pretrain/mpt-3b.yaml \
data_local=my-copy-c4 \
train_loader.dataset.split=train_small \
eval_loader.dataset.split=val_small \
max_duration=10ba \
eval_interval=30ba \
save_folder=mpt-3b \
max_seq_len=2048 \
global_train_batch_size=320 \
device_train_microbatch_size=5
Closing as a duplicate of mosaicml/llm-foundry#444