DROP Evaluation with Llama3 (vs. lm-evaluation-harness)

Question

DROP Evaluation with Llama3 (vs. lm-evaluation-harness)

vipulraheja opened this issue 2 months ago · comments

Evaluating Llama-3-8B on DROP throws a warning with the standard configuration (3-shot), as reported in Llama3, suggesting that the input size is greater than the maximum context size allowed by the model:

The smallest context of your batch (10010) is bigger than the maximum context size allowed by the model (8192) for a task in{'lighteval|drop|3'}. This is likely to lead to some errors.

Here is the command I use:

accelerate launch --num_processes=1 run_evals_accelerate.py \
    --model_args "pretrained=meta-llama/Meta-Llama-3-8B" \
    --tasks "lighteval|drop|3|0" \
    --override_batch_size 16 \
    --output_dir "./log/"

I am able to reproduce this even after progressively reducing the batch size to 1.

Log:

WARNING:lighteval.logging.hierarchical_logger:    Model info: ModelInfo(model_name='meta-llama/Meta-Llama-3-8B', model_sha='561487d18c41c76bcb5fc6cfb73a324982f04f47', model_dtype='torch.bfloat16', model_size='15.08 GB')
WARNING:lighteval.logging.hierarchical_logger:  } [0:00:15.762582]
WARNING:lighteval.logging.hierarchical_logger:  Tasks loading {
WARNING:lighteval.logging.hierarchical_logger:    If you want to use extended_tasks, make sure you installed their dependencies using `pip install -e .[extended_tasks]`.
WARNING:lighteval.logging.hierarchical_logger:    lighteval/drop_harness default
WARNING:lighteval.logging.hierarchical_logger:    Loading documents, and requests
WARNING:lighteval.logging.hierarchical_logger:  } [0:00:34.926653]
WARNING:lighteval.logging.hierarchical_logger:  Setting seeds and waiting for all processes {
WARNING:lighteval.logging.hierarchical_logger:    setting seed to 1234 for random and numpy
WARNING:lighteval.logging.hierarchical_logger:  } [0:00:00.000371]
WARNING:lighteval.logging.hierarchical_logger:  Evaluation {
WARNING:lighteval.logging.hierarchical_logger:    Evaluate on 1 tasks.
WARNING:lighteval.logging.hierarchical_logger:    Running RequestType.GREEDY_UNTIL requests
Splits:   0%|                                                                                                                                                    | 0/4 [00:00<?, ?it/s]
WARNING:lighteval.logging.hierarchical_logger:    The smallest context of your batch (10010) is bigger than the maximum context size allowed by the model (8192) for a task in{'lighteval|drop|3'}. This is likely to lead to some errors.  0/38 [00:00<?, ?it/s]

The process then either stays stuck indefinitely until manually killed, or crashes as follows:

note: The following traceback happened even after reducing the batch size to 1 with --override_batch_size.

WARNING:lighteval.logging.hierarchical_logger:    The smallest context of your batch (9262) is bigger than the maximum context size allowed by the model (8192) for a task in{'lighteval|drop|3'}. This is likely to lead to some errors.00:39<10:55,  3.48it/s]
WARNING:lighteval.logging.hierarchical_logger:    The smallest context of your batch (9192) is bigger than the maximum context size allowed by the model (8192) for a task in{'lighteval|drop|3'}. This is likely to lead to some errors.00:39<10:54,  3.48it/s]
WARNING:lighteval.logging.hierarchical_logger:    The smallest context of your batch (9538) is bigger than the maximum context size allowed by the model (8192) for a task in{'lighteval|drop|3'}. This is likely to lead to some errors.00:39<10:54,  3.48it/s]
Splits:   0%|                                                                                                                                                                                                                             | 0/4 [00:40<?, ?it/s]
WARNING:lighteval.logging.hierarchical_logger:  } [0:01:02.769634]
WARNING:lighteval.logging.hierarchical_logger:} [0:01:49.892104]
Traceback (most recent call last):
  File "/home/vipul.raheja/lighteval/run_evals_accelerate.py", line 82, in <module>
    main(args)
  File "/home/vipul.raheja/lighteval/src/lighteval/logging/hierarchical_logger.py", line 166, in wrapper
    return fn(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^
  File "/home/vipul.raheja/lighteval/src/lighteval/main_accelerate.py", line 111, in main
    evaluation_tracker = evaluate(
                         ^^^^^^^^^
  File "/home/vipul.raheja/lighteval/src/lighteval/evaluator.py", line 86, in evaluate
    full_resps = lm.greedy_until(requests, override_bs=override_bs)
                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/vipul.raheja/lighteval/src/lighteval/models/base_model.py", line 570, in greedy_until
    max_new_tokens = min(self.max_length - biggest_context, max_new_tokens)
                     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

Running the same evaluation directly in lm-evaluation-harness does not throw any such warning and proceeds at a reasonable speed.

~/lm-evaluation-harness$ lm_eval --model hf --model_args pretrained=meta-llama/Meta-Llama-3-8B --tasks drop --device cuda:0 --batch_size 16
2024-04-21:20:19:29,714 INFO     [__main__.py:251] Verbosity set to INFO
2024-04-21:20:19:33,062 INFO     [__main__.py:335] Selected Tasks: ['drop']
2024-04-21:20:19:33,063 INFO     [evaluator.py:131] Setting random seed to 0 | Setting numpy seed to 1234 | Setting torch manual seed to 1234
2024-04-21:20:19:33,064 INFO     [evaluator.py:177] Initializing hf model, with arguments: {'pretrained': 'meta-llama/Meta-Llama-3-8B'}
2024-04-21:20:19:33,164 INFO     [huggingface.py:164] Using device 'cuda:0'
Loading checkpoint shards: 100%|█████████████████████████████| 4/4 [00:06<00:00,  1.62s/it]
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
Downloading builder script: 100%|█████████████████████████████| 7.46k/7.46k [00:00<00:00, 35.8MB/s]
Downloading readme: 100%|█████████████████████████████| 26.0/26.0 [00:00<00:00, 384kB/s]
Downloading data: 100%|█████████████████████████████| 8.31M/8.31M [00:00<00:00, 8.66MB/s]
Generating train split: 77409 examples [00:05, 13452.43 examples/s]
Generating validation split: 9536 examples [00:00, 11649.32 examples/s]
Map: 100%|█████████████████████████████| 77409/77409 [00:10<00:00, 7060.41 examples/s]
Map: 100%|█████████████████████████████| 9536/9536 [00:01<00:00, 4788.74 examples/s]
2024-04-21:20:20:11,675 INFO     [task.py:395] Building contexts for drop on rank 0...
100%|█████████████████████████████| 9536/9536 [00:03<00:00, 2793.13it/s]
2024-04-21:20:20:16,260 INFO     [evaluator.py:379] Running generate_until requests
Running generate_until requests:   9%|█████▊                             | 833/9536 [07:44<1:06:05,  2.19it/s]

Env:
transformers version: 4.39.3
Platform: Ubuntu 20.04.6 LTS
Python version: 3.11.9
Huggingface_hub version: 0.22.2
Safetensors version: 0.4.2
Accelerate version: 0.29.2
Lighteval version: 0.4.0.dev0

Clémentine Fourrier · Answer 1 · Mon Apr 22 2024 14:52:23 GMT+0800 (China Standard Time)

Iirc, the harness just does not check if the context fits within the max length of the model (the few shot context is built here and used there - only the gold prediction must fit within max length).

We have decided to print a warning when the context is too long for the max length, as it means that the model is likely to have non trivial issues when working. However, the bug you're betting are not normal, I'll check them.