Requirement check missed in the evaluation of the text task
wangruicn opened this issue · comments
WangRui commented
When assessing the quality of outputs for the text task, only the coherency score is taken into account.
However, the generated passage may not always adhere to the requirement that the end sentence of each paragraph must end with the 4 given sentences.
tree-of-thought-llm/src/tot/tasks/text.py
def test_output(self, idx: int, output: str):
output = output.split('Passage:\n')[-1]
prompt = score_prompt + output
score_outputs = gpt(prompt, n=5, model='gpt-4')
scores = []
for score_output in score_outputs:
# print(score_output)
pattern = r".*coherency score is (\d+).*"
match = re.match(pattern, score_output, re.DOTALL)
if match:
score = int(match.groups()[0])
scores.append(score)
else:
print(f'------------------score no match: {[score_output]}')
print(scores)
# print('------------')
info = {'rs': scores, 'r': sum(scores) / len(scores) if scores else 0}
return info
Shunyu Yao commented
You are right, the evaluation focuses on coherency, empirically I think most of the times the format is satisfied. Should be easy to add new metrics and adapt ToT to add such self-evaluation.