`.compute()` not compatible with `perplexity` for `text-generation`
hans-ekbrand opened this issue Β· comments
I'm trying to assess a model that does not fit into a single GPU, and therefore needs to be loaded with device_map="auto"
, and torch_dtype=torch.float16
. As far as I understand, evaluate.evaluator.compute()
, does not support passing arguments to AutoModelForCausalLM.from_pretrained()
, nor does it support an instantiated model, so I had to resort to using a pipe
.
$ python3.11
Python 3.11.2 (main, Mar 13 2023, 12:18:29) [GCC 12.2.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import evaluate, torch, accelerate
[2023-08-07 17:06:13,359] [INFO] [real_accelerator.py:133:get_accelerator] Setting ds_accelerator to cuda (auto detect)
>>> from transformers import AutoTokenizer, AutoModelForCausalLM, pipeline
>>> from datasets import load_dataset
>>> model_id = "TheBloke/Llama-2-7b-chat-fp16"
>>> my_pipe = pipeline(model=model_id, device_map="auto", torch_dtype=torch.float16, max_length=256)
Loading checkpoint shards: 100%|ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ| 2/2 [00:37<00:00, 18.81s/it]
>>> my_data = load_dataset("Abirate/english_quotes", split='train').select([1])
>>> my_metric=evaluate.load("perplexity")
>>> my_evaluator = evaluate.evaluator(task="text-generation")
>>> result = my_evaluator.compute(model_or_pipeline=my_pipe,
... data=my_data, input_column = "quote",
... metric=my_metric,
... tokenizer=AutoTokenizer.from_pretrained(model_id))
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/home/hans/.local/lib/python3.11/site-packages/evaluate/evaluator/base.py", line 261, in compute
metric_results = self.compute_metric(
^^^^^^^^^^^^^^^^^^^^
File "/home/hans/.local/lib/python3.11/site-packages/evaluate/evaluator/base.py", line 467, in compute_metric
result = metric.compute(**metric_inputs, **self.METRIC_KWARGS)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/hans/.local/lib/python3.11/site-packages/evaluate/module.py", line 433, in compute
self._finalize()
File "/home/hans/.local/lib/python3.11/site-packages/evaluate/module.py", line 385, in _finalize
file_paths, filelocks = self._get_all_cache_files()
^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/hans/.local/lib/python3.11/site-packages/evaluate/module.py", line 302, in _get_all_cache_files
raise ValueError(
ValueError: Evaluation module cache file doesn't exist. Please make sure that you call `add` or `add_batch` at least once before calling `compute`.
How can one compute perplexity with a custom pipe?
As a user, I found a walkaround: using customized metrics, carefully. It works for me and thus I am here to share with you.
Step 1: at some /path/to/somewhere
, create a folder my_perplexity
, under which further create a Python file my_perplexity.py
. Copy all the content from the officially defined perplexity.py
(e.g., here) and paste it there.
Step 2: Comment out the following two lines in the definition of _compute()
in my_perplexity.py
:
model = AutoModelForCausalLM.from_pretrained(model_id)
tokenizer = AutoTokenizer.from_pretrained(model_id)
and add something like this:
model, tokenizer = model_id
This way, one receives a customized model and tokenizer piggybacked by the argument model_id
. By the way, the argument model_id
is the only argument you can override with no side-effect in this context.
Step 3: In the original evaluation code, we can now use a customized model and tokenizer in this way:
perplexity = evaluate.load("/path/to/somewhere/my_perplexity", module_type="metric")
results = metric.compute(
model_id=(model, tokenizer),
predictions=predictions # a list of input text
)
Still, I hope the developer can directly support this. It is of no intrinsic difficulty.