huggingface / evaluate

πŸ€— Evaluate: A library for easily evaluating machine learning models and datasets.

Home Page:https://huggingface.co/docs/evaluate

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

`.compute()` not compatible with `perplexity` for `text-generation`

hans-ekbrand opened this issue Β· comments

I'm trying to assess a model that does not fit into a single GPU, and therefore needs to be loaded with device_map="auto", and torch_dtype=torch.float16. As far as I understand, evaluate.evaluator.compute(), does not support passing arguments to AutoModelForCausalLM.from_pretrained(), nor does it support an instantiated model, so I had to resort to using a pipe.

$ python3.11
Python 3.11.2 (main, Mar 13 2023, 12:18:29) [GCC 12.2.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import evaluate, torch, accelerate
[2023-08-07 17:06:13,359] [INFO] [real_accelerator.py:133:get_accelerator] Setting ds_accelerator to cuda (auto detect)
>>> from transformers import AutoTokenizer, AutoModelForCausalLM, pipeline
>>> from datasets import load_dataset
>>> model_id = "TheBloke/Llama-2-7b-chat-fp16"
>>> my_pipe = pipeline(model=model_id, device_map="auto", torch_dtype=torch.float16, max_length=256)
Loading checkpoint shards: 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 2/2 [00:37<00:00, 18.81s/it]
>>> my_data = load_dataset("Abirate/english_quotes", split='train').select([1])
>>> my_metric=evaluate.load("perplexity")
>>> my_evaluator = evaluate.evaluator(task="text-generation")
>>> result = my_evaluator.compute(model_or_pipeline=my_pipe,
...                               data=my_data, input_column = "quote",
...                               metric=my_metric,
...                               tokenizer=AutoTokenizer.from_pretrained(model_id))
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/hans/.local/lib/python3.11/site-packages/evaluate/evaluator/base.py", line 261, in compute
    metric_results = self.compute_metric(
                     ^^^^^^^^^^^^^^^^^^^^
  File "/home/hans/.local/lib/python3.11/site-packages/evaluate/evaluator/base.py", line 467, in compute_metric
    result = metric.compute(**metric_inputs, **self.METRIC_KWARGS)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/hans/.local/lib/python3.11/site-packages/evaluate/module.py", line 433, in compute
    self._finalize()
  File "/home/hans/.local/lib/python3.11/site-packages/evaluate/module.py", line 385, in _finalize
    file_paths, filelocks = self._get_all_cache_files()
                            ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/hans/.local/lib/python3.11/site-packages/evaluate/module.py", line 302, in _get_all_cache_files
    raise ValueError(
ValueError: Evaluation module cache file doesn't exist. Please make sure that you call `add` or `add_batch` at least once before calling `compute`.

How can one compute perplexity with a custom pipe?

As a user, I found a walkaround: using customized metrics, carefully. It works for me and thus I am here to share with you.

Step 1: at some /path/to/somewhere, create a folder my_perplexity, under which further create a Python file my_perplexity.py. Copy all the content from the officially defined perplexity.py (e.g., here) and paste it there.

Step 2: Comment out the following two lines in the definition of _compute() in my_perplexity.py:

model = AutoModelForCausalLM.from_pretrained(model_id)
tokenizer = AutoTokenizer.from_pretrained(model_id)

and add something like this:

model, tokenizer = model_id

This way, one receives a customized model and tokenizer piggybacked by the argument model_id. By the way, the argument model_id is the only argument you can override with no side-effect in this context.

Step 3: In the original evaluation code, we can now use a customized model and tokenizer in this way:

perplexity = evaluate.load("/path/to/somewhere/my_perplexity", module_type="metric")
results = metric.compute(
    model_id=(model, tokenizer),
    predictions=predictions  # a list of input text
)

Still, I hope the developer can directly support this. It is of no intrinsic difficulty.