allenai / strategyqa-evaluator

Evaluator for the StrategyQA dataset (AI2 Israel, Aristo)

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

StrategyQA Evaluator

This repo hosts the evaluator for StrategyQA leaderboard. You can read about StrategyQA on the dataset page.

This evaluator scores predictions provided in JSON format, and produces a file with the scores in JSON format.

Testing the evaluator

Run test.sh to build and test the evaluator.

The test will score the prediction files answers_file_small.json, decomps_file_small.json and paras_file_small.json against the gold annotations in gold_small.json. If everything is okay, then the test will pass.

(These gold and predictions JSON files are representative of the real gold and prediction files, but we put only 10 examples into them, thus the name "small".)

Running the evaluator locally

You can follow the steps in test.sh to build and run the evaluator yourself using Docker.

If you want to run the evaluator outside of Docker, look in the evaluator directory and first install the dependencies specified in requirements.txt. Then run eval.py as shown in the test.sh script.

Submitting to the Leaderboard

The file predictions_dummy.json is a valid dummy submission file for the StrategyQA leaderboard. It contains predictions for 490 questions. If you submit it, you'll get this dummy score:

  • Accuracy: 0.46122448979591835
  • SARI: 0.42750331591054463
  • Recall@10: 0.0

To submit your own predictions to the StrategyQA leaderboard, produce a JSON file like predictions_dummy.json with your predictions, and submit it.

About

Evaluator for the StrategyQA dataset (AI2 Israel, Aristo)


Languages

Language:Python 90.5%Language:Shell 7.4%Language:Dockerfile 2.1%