ZeroSCROLLS

This repository contains code to run inference on the ZeroSCROLLS benchmark.

Setup

Install torch
Install transformers 4.30.2
pip install -r requirements.txt

Load the data

via 🤗 Datasets (huggingface/datasets) library (recommended):

from datasets import load_dataset

gov_report = load_dataset("tau/zero_scrolls", "gov_report", split="test")
"""
Options are: ["gov_report", "summ_screen_fd", "qmsum", "squality", "qasper","narrative_qa", "quality", "musique", "space_digest","book_sum_sort"]
There is also a small number of examples (~20 per task) in a "validation" split, meant for eyeballing purposes
"""

via ZIP files, where each split is in a JSONL file:
- GovReport
- SummScreenFD
- QMSum
- SQuALITY
- Qasper
- NarrativeQA
- QuALITY
- MuSiQue
- SpaceDigest
- BookSumSort

Inference with Huggingface models

python experiments/hf/run_hf_model.py --model-name=google/flan-t5-small

Supported models:

google/flan-t5-small
google/flan-t5-base
google/flan-t5-large
google/flan-t5-xl
google/flan-t5-xxl
google/flan-ul2
bigscience/T0pp

To add new models:

Add them to model_to_max_input_tokens in experiments/hf/run_hf_model.py
Make sure to load them with the appropriate architecture (i.e. modify the model initialization from T5ForConditionalGeneration in the same file, if needed)

Inference with APIs

To run with models used in the paper*:

# if you want to use openai models
export OPENAI_API_KEY=<insert token here> 
export OPENAI_ORG=<insert org here>

# if you want to use anthropic models
export ANTHROPIC_API_KEY=<insert token here>

# if you want to limit the number of examples to run per task
export MAX_EXAMPLES=10

python experiments/api/run_api_model.py --model_name=gpt-3.5-turbo --limit_to_n_examples=$MAX_EXAMPLES

*These models and APIs tend to update, see the paper for the versions used in the baselines.

Models supported:

text-davinci-003
gpt-3.5-turbo
gpt-4
claude-v1

To add new a new API, you need to:

Implement a new class the inherits from APIRunner.
Working examples for OpenAI and Anthropic APIs can be found in openai_api.py and anthropic_api.py

When using a prompt that includes opening XML tags, (e.g. "... Assistant: <answer>"), ensure that you post-process the generations to retain only the prefix before the closing XML tag generated by the model before submitting.

Prepare submission

To create a CSV file in the correct format for a leaderboard submission we recommend using our conversion script, prepare_submission.py.

Its inputs:

For each task, the predictions should be in a JSON file that is a mapping from an ID to a textual prediction:

{
    "example_id1": "prediction1",
    "example_id2": "prediction2",
    ...
}

Please set:

{dataset_name}_PREDS_FILE to be the path to a JSON file in the format above containing your predictions for {dataset_name}.
OUTPUT_DIR to be the path you want the submission file will be saved to.

Run:

python submission/prepare_submission.py \
--gov_report_file GOV_REPORT_PREDS_FILE \
--summ_screen_fd_file SUMM_SCREEN_FD_PREDS_FILE \
--qmsum_file QMSUM_PREDS_FILE \
--squality_file SQUALITY_PREDS_FILE \
--qasper_file QASPER_PREDS_FILE \
--narrative_qa_file NARRATIVE_QA_PREDS_FILE \
--quality_file QUALITY_PREDS_FILE \
--musique_file MUSIQUE_PREDS_FILE \
--space_digest_file SPACE_DIGEST_PREDS_FILE \
--book_sum_sort_file BOOK_SUM_SORT_PREDS_FILE \
--output_dir OUTPUT_DIR

Verify your submission file

Run:

python submission/verify_submission.py \
--all_predictions SUBMMISION_FILE \
--output_dir OUTPUT_DIR

A valid submission file will result in the following line printed:

The verification was successful.

Please fix any errors before making your submission.

Leaderboard

The live leaderboard is here.

Citation

@inproceedings{shaham-etal-2023-zeroscrolls,
    title = "{Z}ero{SCROLLS}: A Zero-Shot Benchmark for Long Text Understanding",
    author = "Shaham, Uri  and
      Ivgi, Maor  and
      Efrat, Avia  and
      Berant, Jonathan  and
      Levy, Omer",
    editor = "Bouamor, Houda  and
      Pino, Juan  and
      Bali, Kalika",
    booktitle = "Findings of the Association for Computational Linguistics: EMNLP 2023",
    month = dec,
    year = "2023",
    address = "Singapore",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/2023.findings-emnlp.536",
    doi = "10.18653/v1/2023.findings-emnlp.536",
    pages = "7977--7989"
}

If you find the ZeroSCROLLS data useful, please make sure to cite also the original dataset papers: [bibtex]

tau-nlp / zero_scrolls