qtwang / zero_scrolls

Running inference on the ZeroSCROLLS benchmark

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

ZeroSCROLLS

This repository contains code to run inference on the ZeroSCROLLS benchmark.

Setup

  • Install torch
  • Install transformers 4.30.2
  • pip install -r requirements.txt

Load the data

from datasets import load_dataset

gov_report = load_dataset("tau/zero_scrolls", "gov_report", split="test")
"""
Options are: ["gov_report", "summ_screen_fd", "qmsum", "squality", "qasper","narrative_qa", "quality", "musique", "space_digest","book_sum_sort"]
There is also a small number of examples (~20 per task) in a "validation" split, meant for eyeballing purposes
"""

Inference with Huggingface models

python experiments/hf/run_hf_model.py --model-name=google/flan-t5-small

Supported models:

  • google/flan-t5-small
  • google/flan-t5-base
  • google/flan-t5-large
  • google/flan-t5-xl
  • google/flan-t5-xxl
  • google/flan-ul2
  • bigscience/T0pp

To add new models:

  • Add them to model_to_max_input_tokens in experiments/hf/run_hf_model.py
  • Make sure to load them with the appropriate architecture (i.e. modify the model initialization from T5ForConditionalGeneration in the same file, if needed)

Inference with APIs

To run with models used in the paper*:

# if you want to use openai models
export OPENAI_API_KEY=<insert token here> 
export OPENAI_ORG=<insert org here>

# if you want to use anthropic models
export ANTHROPIC_API_KEY=<insert token here>

# if you want to limit the number of examples to run per task
export MAX_EXAMPLES=10

python experiments/api/run_api_model.py --model_name=gpt-3.5-turbo --limit_to_n_examples=$MAX_EXAMPLES

*These models and APIs tend to update, see the paper for the versions used in the baselines.

Models supported:

  • text-davinci-003
  • gpt-3.5-turbo
  • gpt-4
  • claude-v1

To add new a new API, you need to:

When using a prompt that includes opening XML tags, (e.g. "... Assistant: <answer>"), ensure that you post-process the generations to retain only the prefix before the closing XML tag generated by the model before submitting.

Prepare submission

To create a CSV file in the correct format for a leaderboard submission we recommend using our conversion script, prepare_submission.py.

Its inputs:

For each task, the predictions should be in a JSON file that is a mapping from an ID to a textual prediction:

{
    "example_id1": "prediction1",
    "example_id2": "prediction2",
    ...
}

Please set:

  • {dataset_name}_PREDS_FILE to be the path to a JSON file in the format above containing your predictions for {dataset_name}.
  • OUTPUT_DIR to be the path you want the submission file will be saved to.

Run:

python submission/prepare_submission.py \
--gov_report_file GOV_REPORT_PREDS_FILE \
--summ_screen_file SUMM_SCREEN_FD_PREDS_FILE \
--qmsum_file QMSUM_PREDS_FILE \
--squality_file SQUALITY_PREDS_FILE \
--qasper_file QASPER_PREDS_FILE \
--narrative_qa_file NARRATIVE_QA_PREDS_FILE \
--quality_file QUALITY_PREDS_FILE \
--musique_file MUSIQUE_PREDS_FILE \
--space_digest_file SPACE_DIGEST_PREDS_FILE \
--book_sum_sort_file BOOK_SUM_SORT_PREDS_FILE \
--output_dir OUTPUT_DIR

Verify your submission file is valid

Run:

python submission/verify_submission.py \
--all_predictions SUBMMISION_FILE \
--output_dir OUTPUT_DIR

A valid submission file will result in the following line printed:

The verification was successful.

Please fix any errors before making your submission.

Leaderboard

The live leaderboard is here.

Citation

@misc{shaham2023zeroscrolls,
      title={ZeroSCROLLS: A Zero-Shot Benchmark for Long Text Understanding}, 
      author={Uri Shaham and Maor Ivgi and Avia Efrat and Jonathan Berant and Omer Levy},
      year={2023},
      eprint={2305.14196},
      archivePrefix={arXiv},
      primaryClass={cs.CL}
}

If you find the ZeroSCROLLS data useful, please make sure to cite also the original dataset papers: [bibtex]

About

Running inference on the ZeroSCROLLS benchmark

License:MIT License


Languages

Language:Python 100.0%