mlfoundations / open_lm

A repository for research on medium sized language models.

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

"Number of shards requested for a single epoch is more than the number of shards available" in the middle of a training run

afang-story opened this issue · comments

ERROR:root:Number of shards requested for a single epoch is more than the number of shards available. This means that the amount of data requested to train on is more than the dataloader can serve. This can either happen because there are not enough data to begin with, or data being skipped due to rounding errors. To alleviate the latter, consider making more uniform shards, and using less workers/GPUs. This will allow for better use of the dataset.

2024-01-03,15:18:27 | ERROR | Number of shards requested for a single epoch is more than the number of shards available. This means that the amount of data requested to train on is more than the dataloader can serve. This can either happen because there are not enough data to begin with, or data being skipped due to rounding errors. To alleviate the latter, consider making more uniform shards, and using less workers/GPUs. This will allow for better use of the dataset.

Traceback (most recent call last):
File "/miniconda/lib/python3.10/runpy.py", line 196, in _run_module_as_main
return _run_code(code, main_globals, None,
File "/miniconda/lib/python3.10/runpy.py", line 86, in _run_code
exec(code, run_globals)
File "/mnt/task_runtime/open_lm/open_lm/main.py", line 841, in
main(sys.argv[1:])
File "/mnt/task_runtime/open_lm/open_lm/main.py", line 717, in main
) = get_string_for_epoch(
File "/mnt/task_runtime/open_lm/open_lm/file_utils.py", line 293, in get_string_for_epoch
return _single_epoch_string(num_samples, starting_points, paths, weights, num_workers_per_gpu, world_size)
File "/mnt/task_runtime/open_lm/open_lm/file_utils.py", line 424, in _single_epoch_string
raise e
File "/mnt/task_runtime/open_lm/open_lm/file_utils.py", line 405, in _single_epoch_string
shard_name = manifests[i][next_shard_per_source[i]]["shard"]
IndexError: list index out of range

Have tried decreasing num of workers

How many tokens is this training run for, and how many tokens/shards are there total?

Two different runs
Goal is 138B
Datasets had 231B tokens with 10031 tars and 151B with 6613 tars

Training on 128 GPUs with 4 workers

We've fixed this now right @GeorgiosSmyrnis @afang-story ?

I think it's fine to close for now unless @afang-story disagrees - with the flag that allows for multiple passes + improvements in tokenization it is no longer an issue I believe?