Cant use HuggingFace Model for evaluation

Question

Cant use HuggingFace Model for evaluation

Kraebs opened this issue 4 months ago · comments

When i follow the example on this page:
https://docs.confident-ai.com/docs/metrics-introduction

and try to use Mistral-7B as evaluation-model, i always get this error when running the exact code in the tutorial.
It seems there is a mistake in the code when using HugigngFace models for evaluation instead of ChatGPT.

Error:

JSONDecodeError Traceback (most recent call last)
File ~/.conda/envs/evaluation/lib/python3.12/site-packages/deepeval/metrics/utils.py:58, in trimAndLoadJson(input_string, metric)
57 try:
---> 58 return json.loads(jsonStr)
59 except json.JSONDecodeError:

File ~/.conda/envs/evaluation/lib/python3.12/json/init.py:346, in loads(s, cls, object_hook, parse_float, parse_int, parse_constant, object_pairs_hook, **kw)
343 if (cls is None and object_hook is None and
344 parse_int is None and parse_float is None and
345 parse_constant is None and object_pairs_hook is None and not kw):
--> 346 return _default_decoder.decode(s)
347 if cls is None:

File ~/.conda/envs/evaluation/lib/python3.12/json/decoder.py:340, in JSONDecoder.decode(self, s, _w)
339 if end != len(s):
--> 340 raise JSONDecodeError("Extra data", s, end)
341 return obj

JSONDecodeError: Extra data: line 4 column 1 (char 110)

During handling of the above exception, another exception occurred:
ValueError Traceback (most recent call last)
Cell In[4], line 18
...
---> 63 raise ValueError(error_str)
64 except Exception as e:
65 raise Exception(f"An unexpected error occurred: {str(e)}")

ValueError: Evaluation LLM outputted an invalid JSON. Please use a better evaluation model.

Code:
from transformers import AutoModelForCausalLM, AutoTokenizer
from deepeval.models.base_model import DeepEvalBaseLLM
import asyncio

class Mistral7B(DeepEvalBaseLLM):
def init(
self,
model,
tokenizer
):
self.model = model
self.tokenizer = tokenizer

def load_model(self):
    return self.model

def generate(self, prompt: str) -> str:
    model = self.load_model()

    device = "cuda" # the device to load the model onto

    model_inputs = self.tokenizer([prompt], return_tensors="pt").to(device)
    model.to(device)

    generated_ids = model.generate(**model_inputs, max_new_tokens=100, do_sample=True)
    output = self.tokenizer.batch_decode(generated_ids)[0]
    #result = f"{{ {output} }}"
    return output
    

async def a_generate(self, prompt: str) -> str:
    loop = asyncio.get_running_loop()
    return await loop.run_in_executor(None, self.generate, prompt)

def get_model_name(self):
    return "Mistral 7B"

model = AutoModelForCausalLM.from_pretrained("mistralai/Mistral-7B-v0.1")
tokenizer = AutoTokenizer.from_pretrained("mistralai/Mistral-7B-v0.1")

mistral_7b = Mistral7B(model=model, tokenizer=tokenizer)

from deepeval import evaluate
from deepeval.metrics import AnswerRelevancyMetric
from deepeval.test_case import LLMTestCase

Replace this with the actual output from your LLM application

actual_output = "We offer a 30-day full refund at no extra cost."

metric = AnswerRelevancyMetric(
threshold=0.7,
model=mistral_7b,
include_reason=True
)
test_case = LLMTestCase(
input="What if these shoes don't fit?",
actual_output=actual_output
)

metric.measure(test_case)
print(metric.score)
print(metric.reason)

or evaluate test cases in bulk

evaluate([test_case], [metric])

Thanks for the help in advance and all the best!

Jeffrey Ip · Answer 1 · Mon May 06 2024 20:13:10 GMT+0800 (China Standard Time)

Hey @Kraebs can you try running the model outside of any metric to see if there are any errors?

YU-SHIANG HUANG · Answer 2 · Thu May 09 2024 01:16:35 GMT+0800 (China Standard Time)

I encounter the same problem when using Mistral-7B-Instruct-v0.2.
Also, I'm wondering if I need to add special tokens like [INST] [/INST] from mistral-instruct models to the implementation.

TheDominus · Answer 3 · Thu May 16 2024 16:25:26 GMT+0800 (China Standard Time)

same issue with me for another model

Jeffrey Ip · Answer 4 · Thu May 16 2024 21:12:22 GMT+0800 (China Standard Time)

@hyusterr @TheDominus Try using it outside of any metric. If you can't run model.generate() as shown in the docs, you know where the problem is

Nicolás Eiris · Answer 5 · Fri May 24 2024 23:58:28 GMT+0800 (China Standard Time)

The same happens to me using SummarizationMetric with default values.

Akash L P · Answer 6 · Wed Jun 26 2024 13:59:59 GMT+0800 (China Standard Time)

Hi, facing same issue, outside the metrics the model is able to generate from model.generate() but not with metrics

Faiza Qamar · Answer 7 · Fri Jun 28 2024 12:40:04 GMT+0800 (China Standard Time)

Facing the same issue. model.generate works but metric.measure doesn't. Here somebody provided a solution, but I couldn't understand it. Anybody else does?

MinJi Kim · Answer 8 · Wed Jul 03 2024 14:21:18 GMT+0800 (China Standard Time)

I also encountered this error. I just followed the instruction of the official website (https://docs.confident-ai.com/docs/metrics-introduction). Is there anyone else who can solve this error?

Jeffrey Ip · Answer 9 · Wed Jul 03 2024 15:24:44 GMT+0800 (China Standard Time)

@akashlp27 @FaizaQamar @MINJIK01 Can you show the error messages?

MinJi Kim · Answer 10 · Fri Jul 05 2024 16:12:07 GMT+0800 (China Standard Time)

My error is here.

============================================================================================================================ ERRORS =============================================================================================================================
______________________________________________________________________________________________________________ ERROR collecting test_mistral7b.py _______________________________________________________________________________________________________________
../../anaconda3/envs/graph_llm/lib/python3.10/site-packages/deepeval/metrics/utils.py:63: in trimAndLoadJson
return json.loads(jsonStr)
../../anaconda3/envs/graph_llm/lib/python3.10/json/init.py:346: in loads
return _default_decoder.decode(s)
../../anaconda3/envs/graph_llm/lib/python3.10/json/decoder.py:340: in decode
raise JSONDecodeError("Extra data", s, end)
E json.decoder.JSONDecodeError: Extra data: line 4 column 1 (char 110)

During handling of the above exception, another exception occurred:
test_mistral7b.py:64: in
metric.measure(test_case)
../../anaconda3/envs/graph_llm/lib/python3.10/site-packages/deepeval/metrics/answer_relevancy/answer_relevancy.py:67: in measure
self.statements: List[str] = self._generate_statements(
../../anaconda3/envs/graph_llm/lib/python3.10/site-packages/deepeval/metrics/answer_relevancy/answer_relevancy.py:229: in _generate_statements
data = trimAndLoadJson(res, self)
../../anaconda3/envs/graph_llm/lib/python3.10/site-packages/deepeval/metrics/utils.py:68: in trimAndLoadJson
raise ValueError(error_str)
E ValueError: Evaluation LLM outputted an invalid JSON. Please use a better evaluation model.
==================================================================================================================== short test summary info ====================================================================================================================
ERROR test_mistral7b.py - ValueError: Evaluation LLM outputted an invalid JSON. Please use a better evaluation model.
!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!! Interrupted: 1 error during collection !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
================================================================================================================= 4 warnings, 1 error in 17.68s =================================================================================================================
No test cases found, please try again.

PratikSitapara22 · Answer 11 · Thu Jul 18 2024 16:56:48 GMT+0800 (China Standard Time)

I am using "prometheus-eval/prometheus-7b-v2.0" model!. I am encountering the same problem. When I run the model without the metrics, it works fine. However, when I run the model with metrics, it throws a ValueError: Evaluation LLM outputted an invalid JSON. Please use a better evaluation model.