Implement an AnswerCorrectness Metric

Question

Implement an AnswerCorrectness Metric

AndresPrez opened this issue 3 months ago · comments

Is your feature request related to a problem? Please describe.
When dealing with evolving Q&A applications, it is important to have a way of evaluating if and how a golden set of question and answers changes over time. For this an Answer Correctness metric that lets an LLM compare an expected answer with an actual answer can be good solution to detect if the actual answers for the golden set question are changing or breaking.

Describe the solution you'd like
Add support for another deepeval metric that takes the above evaluation into account.

Describe alternatives you've considered
Ragas implements an AnswerCorrectness metric that uses both llms and embedding similarity. But it does not give you a reason and the fact that includes embedding similarity ruins the final score. Similarity in embeddings is not a very good idea to ensure two different answers are the same.

Jeffrey Ip · Answer 1 · Wed Apr 03 2024 02:19:30 GMT+0800 (China Standard Time)

@AndresPrez Agreed with the embedding + similarity bit. Have you tried using GEval for this with strict=true? I'm sure it will give good results.

Andrés · Answer 2 · Thu Apr 04 2024 04:32:15 GMT+0800 (China Standard Time)

@penguine-ip g-eval is actually pretty nice! thanks for the advise ❤️

Jeffrey Ip · Answer 3 · Thu Apr 04 2024 10:53:19 GMT+0800 (China Standard Time)

@AndresPrez No problem! I'm actually going to push out more tutorials soon on the documentation page, I feel like the full capabilities of deepeval mostly go undiscovered 😅 Be including Answer Correctness for sure

Geoff Meakin · Answer 4 · Mon May 20 2024 16:24:06 GMT+0800 (China Standard Time)

@penguine-ip - thanks for the explanation on this - I too got a bit fooled just because the docs state that ragas metrics are supported, and AnswerCorrectness is one of ragas metrics - bit of internal translation to do to get there.

Jeffrey Ip · Answer 5 · Mon May 20 2024 21:21:30 GMT+0800 (China Standard Time)

@gtmtech Yes but we eventually deleted most of it except for the flagship ragas metrics since they weren't very good n well maintained