mosaicml / examples

Fast and flexible reference benchmarks

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Explain composer logs emitted during training + Replicate Benchmark Results

geodra opened this issue · comments

❓ Question

Hello, I am training an mpt-3B model on AWS SageMaker using an ml.p4d.24xlarge instance and trying to replicate the results displayed in this table: link.

Specifically, I am focusing on replicating the result for the mpt-3b model with the following configuration: max_seq_len: 2048, global_train_batch_size=320, device_train_microbatch_size=5, and 8 a100_40gb GPUs. According to the table, it should be able to process 39 sequences per second. Since I process 320 sequences within one batch, the batch should ideally finish within 8.2 seconds. However, when I run it, it takes around 10 seconds (screenshot attached).

I am also looking for an explanation of the logs emitted by the composer before the start of every batch. I have checked the documentation but couldn't find anything specific. I am particularly interested in understanding the meaning of the following logs:

  • Train memory/allocated_mem: 6.8051
  • Train memory/active_mem: 6.8051
  • Train memory/inactive_mem: 1.9065
  • Train memory/reserved_mem: 14.6740
  • Train memory/alloc_retries: 0
  • Train loss/train/total: 11.6525
  • Train metrics/train/LanguageCrossEntropy: 11.6525
  • Train metrics/train/LanguagePerplexity: 114977.1562
  • Train time/train: 0.0081
  • Train time/val: 0.0000
  • Train time/total: 0.0081
  • Train lr-DecoupledAdamW/group0: 0.0000
  • Train time/remaining_estimate: 0.0225

Lastly, I would like to know if there is an easy way to calculate TFLOP/s using the above logs.

Here is the bash command that I am running:

composer train/train.py \
  train/yamls/pretrain/mpt-3b.yaml \
  data_local=my-copy-c4 \
  train_loader.dataset.split=train_small \
  eval_loader.dataset.split=val_small \
  max_duration=10ba \
  eval_interval=30ba \
  save_folder=mpt-3b \
  max_seq_len=2048 \
  global_train_batch_size=320 \
  device_train_microbatch_size=5

image

Closing as a duplicate of mosaicml/llm-foundry#444