princeton-nlp / tree-of-thought-llm

[NeurIPS 2023] Tree of Thoughts: Deliberate Problem Solving with Large Language Models

Home Page:https://arxiv.org/abs/2305.10601

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Requirement check missed in the evaluation of the text task

wangruicn opened this issue · comments

When assessing the quality of outputs for the text task, only the coherency score is taken into account.
However, the generated passage may not always adhere to the requirement that the end sentence of each paragraph must end with the 4 given sentences.

tree-of-thought-llm/src/tot/tasks/text.py

 def test_output(self, idx: int, output: str):
        output = output.split('Passage:\n')[-1]
        prompt = score_prompt + output
        score_outputs = gpt(prompt, n=5, model='gpt-4')
        scores = []
        for score_output in score_outputs:
            # print(score_output)
            pattern = r".*coherency score is (\d+).*"
            match = re.match(pattern, score_output, re.DOTALL)
            if match:
                score = int(match.groups()[0])
                scores.append(score)
            else:
                print(f'------------------score no match: {[score_output]}')
        print(scores)
        # print('------------')
        info = {'rs': scores, 'r': sum(scores) / len(scores) if scores else 0}
        return info

You are right, the evaluation focuses on coherency, empirically I think most of the times the format is satisfied. Should be easy to add new metrics and adapt ToT to add such self-evaluation.