confident-ai / deepeval

The LLM Evaluation Framework

Home Page:https://docs.confident-ai.com/

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Implement an AnswerCorrectness Metric

AndresPrez opened this issue · comments

Is your feature request related to a problem? Please describe.
When dealing with evolving Q&A applications, it is important to have a way of evaluating if and how a golden set of question and answers changes over time. For this an Answer Correctness metric that lets an LLM compare an expected answer with an actual answer can be good solution to detect if the actual answers for the golden set question are changing or breaking.

Describe the solution you'd like
Add support for another deepeval metric that takes the above evaluation into account.

Describe alternatives you've considered
Ragas implements an AnswerCorrectness metric that uses both llms and embedding similarity. But it does not give you a reason and the fact that includes embedding similarity ruins the final score. Similarity in embeddings is not a very good idea to ensure two different answers are the same.

@AndresPrez Agreed with the embedding + similarity bit. Have you tried using GEval for this with strict=true? I'm sure it will give good results.

@penguine-ip g-eval is actually pretty nice! thanks for the advise ❤️

@AndresPrez No problem! I'm actually going to push out more tutorials soon on the documentation page, I feel like the full capabilities of deepeval mostly go undiscovered 😅 Be including Answer Correctness for sure

@penguine-ip - thanks for the explanation on this - I too got a bit fooled just because the docs state that ragas metrics are supported, and AnswerCorrectness is one of ragas metrics - bit of internal translation to do to get there.

@gtmtech Yes but we eventually deleted most of it except for the flagship ragas metrics since they weren't very good n well maintained