Hallucination test is really a test for faithful summarization, not of hallucination-absence

Question

Hallucination test is really a test for faithful summarization, not of hallucination-absence

prescod opened this issue 4 months ago · comments

Paul Prescod commented 4 months ago

Describe the bug
My definition of a hallucination is a fact that occurs in the output which is not in the context.

As the docs say, a "contradicted context".

But DeepEval's engine is motivating the LLM to consider MISSING facts as contradictions.

So if the context list 10 facts about Paris, and the output lists only 5, that's a contradiction.

This part of the prompt is being ignored by GPT-4:

You should FORGIVE cases where the actual output is lacking in detail, you should ONLY provide a 'no' answer if IT IS A CONTRADICTION.

To Reproduce

from deepeval import assert_test
import pytest
from deepeval.metrics import HallucinationMetric
from deepeval.test_case import LLMTestCase


context = [
    "Paris is the capital of France.",
    "The city has a population of over 2.1 million people.",
    "The Eiffel Tower is 324 meters (1,063 feet) tall.",
    "Paris is home to over 150 museums.",
    "The Louvre Museum attracts over 10 million visitors annually.",
    "The River Seine runs for 13 kilometers (8 miles) through the city.",
    "Paris has 20 arrondissements (districts).",
    "The city is home to the famous Notre-Dame Cathedral.",
    "The Champs-Élysées is one of the most famous avenues in the world.",
    "Paris is known as the City of Love and the City of Light.",
]
llm_response = "\n".join(context)
hallucinatory_response = f"{llm_response}\nParis is the bowling capital of the world"


def test_hallucination_fails():
    "This test should fail because there is a hallucination added. 10 correct facts and 1 hallucination."
    test_case = LLMTestCase(
        input="",
        actual_output=hallucinatory_response,
        context=context,
    )
    with pytest.raises(AssertionError):
        assert_test(test_case, [HallucinationMetric(threshold=0.1)])


def test_subsetting_succeeds():
    "This test should succeed because there is no hallucination. 2 of the 10 facts are included."
    good_response = "\n".join(context[0:2])
    test_case = LLMTestCase(
        input="", actual_output=good_response, context=context
    )
    assert_test(test_case, [HallucinationMetric(threshold=0.1)])

Expected behavior
The test named test_hallucination_fails should get a high hallucination score because there is a blatant
hallucination in it. But it passes, because so many of the facts are correct.

The test named test_subsetting_succeeds should get a low hallucination score because every fact is in the context, but it gets a very high score.

┏━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━┓
┃ Test case                ┃ Metric        ┃ Score                                                                                ┃ Status ┃ Overall Success Rate ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━┩
│ test_hallucination_fails │               │                                                                                      │        │ 100.0%               │
│                          │ Hallucination │ 0.09 (threshold=0.1, evaluation model=gpt-4-turbo, reason=The hallucination score is │ PASSED │                      │
│                          │               │ 0.09 because the actual output largely aligns with the provided factual context      │        │                      │
│                          │               │ regarding multiple widely recognized aspects of Paris, such as its landmarks,        │        │                      │
│                          │               │ geography, and cultural attributes, with only one noted contradiction about Paris    │        │                      │
│                          │               │ being the 'bowling capital of the world,' which is not supported by the given        │        │                      │
│                          │               │ context. This minor inaccuracy contributes to a low but not zero hallucination       │        │                      │
│                          │               │ score., error=None)                                                                  │        │                      │
│                          │               │                                                                                      │        │                      │
│ test_subsetting_succeeds │               │                                                                                      │        │ 0.0%                 │
│                          │ Hallucination │ 0.8 (threshold=0.1, evaluation model=gpt-4-turbo, reason=The score is 0.80 because   │ FAILED │                      │
│                          │               │ the actual output includes significant factual omissions regarding key details about │        │                      │
│                          │               │ Paris, such as landmarks and cultural statistics, which indicates a high level of    │        │                      │
│                          │               │ factual incompleteness relative to the context provided. Despite the output being    │        │                      │
│                          │               │ factually correct on some basic aspects like the capital status and population, the  │        │                      │
│                          │               │ numerous omissions contribute to a high hallucination score., error=None)            │        │                      │
└──────────────────────────┴───────────────┴──────────────────────────────────────────────────────────────────────────────────────┴────────┴──────────────────────┘

Screenshots

Context (please complete the following information):
Default model: GPT-4
deepeval 0.21.36