chenxwh / AVeriTeC

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool


This repository maintains the dataset and baseline described in our paper AVeriTeC: A Dataset for Real-world Claim Verification with Evidence from the Web.


The dataset is formatted as JSON, with each split located as a separate file in the data-folder. Each claim is an object of the following form:

    "claim": "The claim text itself",
    "required_reannotation": "True or False. Denotes that the claim received a second round of QG-QA and quality control annotation.",
    "label": "The annotated verdict for the claim",
    "justification": "A textual justification explaining how the verdict was reached from the question-answer pairs.",
    "claim_date": "Our best estimate for the date the claim first appeared",
    "speaker": "The person or organization that made the claim, e.g. Barrack Obama, The Onion.",
    "original_claim_url": "If the claim first appeared on the internet, a url to the original location",
    "cached_original_claim_url": "Where possible, an link to the original claim url",
    "fact_checking_article": "The fact-checking article we extracted the claim from",
    "reporting_source": "The website or organization that first published the claim, e.g. Facebook, CNN.",
    "location_ISO_code": "The location most relevant for the claim. Highly useful for search.",
    "claim_types": [
            "The types of the claim",
    "fact_checking_strategies": [
        "The strategies employed in the fact-checking article",
    "questions": [
            "question": "A fact-checking question for the claim",
            "answers": [
                    "answer": "The answer to the question",
                    "answer_type": "Whether the answer was abstractive, extractive, boolean, or unanswerable",
                    "source_url": "The source url for the answer",
                    "cached_source_url": "An link for the source url"
                    "source_medium": "The medium the answer appeared in, e.g. web text, a pdf, or an image.",


The official evaluation script can be found in To run the script, please use:

python --predictions your_predictions.json --references data/dev.json


Our baseline is a pipeline consisting of multiple steps. The first step is coarse retrieval, using Google search. Many pages will be downloaded, so please ensure access to a folder (e.g. "your_dataset_store") with plenty of space. Before running the script, you will need access to the Google Search API. You can obtain an API key and a CSE ID here. Please add these to the appropriate variables in coarse_retrieval/

python coarse_retrieval/ --target_file data/dev.json > data/dev.generated_questions.json
python coarse_retrieval/ --averitec_file data/dev.generated_questions.json --store_folder your_dataset_store > search_results.tsv

The next step is question generation for each retrieved document, followed by reranking. This can be done with:

python retrieval_reranking/ --url_file search_results.tsv --store_folder your_dataset_store > data/dev.bm25_questions.json
python --averitec_file data/dev.bm25_questions.json > data/dev.reranked_questions.json

Then, predict verdicts:

python veracity_prediction/ --averitec_file data/dev.reranked_questions.json > data/dev.verdicts.json

Finally, predict justifications:

python justification_production/ --averitec_file data/dev.verdicts.json > data/dev.verdicts_and_justifications.json


If you used our dataset or code, please cite our paper as:

      title={AVeriTeC: A Dataset for Real-world Claim Verification with Evidence from the Web}, 
      author={Michael Schlichtkrull and Zhijiang Guo and Andreas Vlachos},
      journal={arXiv preprint arXiv:2305.13117},


Creative Commons License
This work is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License.



Language:Python 99.4%Language:Shell 0.6%