manyoso / haltt4llm

This project is an attempt to create a common metric to test LLM's for progress in eliminating hallucinations which is the most serious current problem in widespread adoption of LLM's for many real purposes.

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Automated reporting script?

AngainorDev opened this issue · comments

I'm a bit confused by the metric the test reports (score and number of wrong answers)
vs what is in the comparative table (different metrics)

Is there a ready made script to parse the test outputs and create the various metrics?

The score is output at the end of the run. And you see the total score in the file. To calculate the % just take 'total_score / number_of_questions * 2)' The number of correct answers and the number of uncertain answers are also in the files. That's how you reconcile with the comparative table.

Oh, my bad.
I had to check the code to see they were there indeed.

I was hoping for metrics in the header, rather than interleaved with the list of incorrect/idk questions.

Neither the number of correct answers nor the dataset len is in the output file btw.
I'll edit my fork before running more tests for clarity, thanks!

https://github.com/manyoso/haltt4llm/blob/main/results/test_results_fake_trivia_questions.json_alpaca-lora-4bit.txt shows the total score which is dataset len * 2 and also shows the number of correct answers as well as number of incorrect and unknown. latest code does.