"Number of shards requested for a single epoch is more than the number of shards available" in the middle of a training run

Question

"Number of shards requested for a single epoch is more than the number of shards available" in the middle of a training run

afang-story opened this issue 7 months ago · comments

ERROR:root:Number of shards requested for a single epoch is more than the number of shards available. This means that the amount of data requested to train on is more than the dataloader can serve. This can either happen because there are not enough data to begin with, or data being skipped due to rounding errors. To alleviate the latter, consider making more uniform shards, and using less workers/GPUs. This will allow for better use of the dataset.

2024-01-03,15:18:27 | ERROR | Number of shards requested for a single epoch is more than the number of shards available. This means that the amount of data requested to train on is more than the dataloader can serve. This can either happen because there are not enough data to begin with, or data being skipped due to rounding errors. To alleviate the latter, consider making more uniform shards, and using less workers/GPUs. This will allow for better use of the dataset.

Traceback (most recent call last):
File "/miniconda/lib/python3.10/runpy.py", line 196, in _run_module_as_main
return _run_code(code, main_globals, None,
File "/miniconda/lib/python3.10/runpy.py", line 86, in _run_code
exec(code, run_globals)
File "/mnt/task_runtime/open_lm/open_lm/main.py", line 841, in
main(sys.argv[1:])
File "/mnt/task_runtime/open_lm/open_lm/main.py", line 717, in main
) = get_string_for_epoch(
File "/mnt/task_runtime/open_lm/open_lm/file_utils.py", line 293, in get_string_for_epoch
return _single_epoch_string(num_samples, starting_points, paths, weights, num_workers_per_gpu, world_size)
File "/mnt/task_runtime/open_lm/open_lm/file_utils.py", line 424, in _single_epoch_string
raise e
File "/mnt/task_runtime/open_lm/open_lm/file_utils.py", line 405, in _single_epoch_string
shard_name = manifests[i][next_shard_per_source[i]]["shard"]
IndexError: list index out of range

Have tried decreasing num of workers

GeorgiosSmyrnis · Answer 1 · Thu Jan 04 2024 04:15:36 GMT+0800 (China Standard Time)

How many tokens is this training run for, and how many tokens/shards are there total?

Alex Fang · Answer 2 · Thu Jan 04 2024 04:16:44 GMT+0800 (China Standard Time)

Two different runs
Goal is 138B
Datasets had 231B tokens with 10031 tars and 151B with 6613 tars

Training on 128 GPUs with 4 workers

Achal Dave · Answer 3 · Mon Apr 29 2024 07:41:59 GMT+0800 (China Standard Time)

We've fixed this now right @GeorgiosSmyrnis @afang-story ?

GeorgiosSmyrnis · Answer 4 · Mon Apr 29 2024 12:38:15 GMT+0800 (China Standard Time)

I think it's fine to close for now unless @afang-story disagrees - with the flag that allows for multiple passes + improvements in tokenization it is no longer an issue I believe?