accelerate doesn't work with auto:(>1)

Question

accelerate doesn't work with auto:(>1)

ozgurcelik opened this issue 2 months ago · comments

Hi. I realized that accelerate launch works perfectly when I set batch_size = "auto" but gets stuck at the very end when I use batch_size = "auto:2". The problem persists whether I use evaluator.simple_evaluate or terminal call accelerate launch -m lm_eval. This is a problem since different tasks may have different batch_sizes.

Passed argument batch_size = auto:2.0. Detecting largest batch size Running loglikelihood requests: 0%| | 0/1644 [00:00<?, ?it/s]Passed argument batch_size = auto:2.0. Detecting largest batch size Passed argument batch_size = auto:2.0. Detecting largest batch size Passed argument batch_size = auto:2.0. Detecting largest batch size Determined largest batch size: 16 Determined largest batch size: 16 Determined largest batch size: 16 Determined largest batch size: 16 Running loglikelihood requests: 40%|███████████████████████████████████████████▌ | 657/1644 [00:27<00:26, 37.06it/s]Passed argument batch_size = auto:2.0. Detecting largest batch size Passed argument batch_size = auto:2.0. Detecting largest batch size Running loglikelihood requests: 41%|████████████████████████████████████████████▌ | 673/1644 [00:28<00:26, 37.21it/s]Passed argument batch_size = auto:2.0. Detecting largest batch size Running loglikelihood requests: 42%|█████████████████████████████████████████████▋ | 689/1644 [00:28<00:25, 37.21it/s]Passed argument batch_size = auto:2.0. Detecting largest batch size Determined largest batch size: 16 Determined largest batch size: 16 Determined largest batch size: 16 Determined largest batch size: 16 Running loglikelihood requests: 90%|████████████████████████████████████████████████████████████████████████████████████████████████▏ | 1478/1644 [00:41<00:00, 209.33it/s]Passed argument batch_size = auto:2.0. Detecting largest batch size Running loglikelihood requests: 97%|████████████████████████████████████████████████████████████████████████████████████████████████████████ | 1598/1644 [00:41<00:00, 234.70it/s]Passed argument batch_size = auto:2.0. Detecting largest batch size Running loglikelihood requests: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1644/1644 [00:41<00:00, 39.17it/s] Map: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 5000/5000 [00:01<00:00, 3971.36 examples/s] Map: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 5000/5000 [00:01<00:00, 3402.27 examples/s]

This is how it looks like when it gets stuck with auto:2. It unnecessarily tries to find optimum batch size near very end and never finishes task.

Hongseok Oh · Answer 1 · Wed Apr 24 2024 11:05:01 GMT+0800 (China Standard Time)

You can just use auto, and refer here https://github.com/EleutherAI/lm-evaluation-harness/blob/main/docs/interface.md
auto:2 means search batch size twice 😅

Ozgur Celik · Answer 2 · Wed Apr 24 2024 11:36:46 GMT+0800 (China Standard Time)

Correct me if I am wrong but as we go down in the evaluation, the sample length may get shorter. So maybe we can fit more samples down the line. I was using auto:2 for such cases, precisely because I want to search the max batch size once again.

Onebula · Answer 3 · Tue Apr 30 2024 15:16:15 GMT+0800 (China Standard Time)

I also found this problem. After all loglikelihood requests are finished, the process hangs with no other outputs and CPU/GPU are full.
Mistral-7B-v0.1 on MMLU with auto:4 meets this problem, while on hellaswag with auto:4 not. Replace auto:4 with auto solves.
I believe there is a bug.

Hailey Schoelkopf · Answer 4 · Mon May 06 2024 23:22:33 GMT+0800 (China Standard Time)

Hi! I'll look into this--suspect padding across ranks is slightly off somewhere, or else the batch sizes get unsynced.

@ozgurcelik --do you have a sample command which exhibits this problem?