This repository contains a script designed to parse the Natural Questions dataset. The Natural Questions dataset comprises entries annotated with start and end byte offsets for each answer. Our script, main.py
, performs two primary functions:
- Extraction: It extracts content from the HTML page associated with each dataset entry.
- Transformation: It transforms the annotations into a text format for easier accessibility and understanding.
The script converts entries in the dataset from their original format to a more friendly format. For example, the entry in example.json
is transformed into the one presented below:
{
"id": "4549465242785278785",
"question": "when is the last episode of season 8 of the walking dead",
"document": "The Walking Dead (season 8) - Wikipedia The Walking Dead (...)",
"candidates": [
"The Walking Dead (season 8)",
"Promotional poster",
"(...)"
],
"long_answer": "No. overall No. in season (...) March 18, 2018 (...)",
"short_answers": [["March 18, 2018"]],
"yes_no_answer": [-1]
}
- Clone the repository to your local machine.
- Create a virtual environment:
python3.11 -m venv .venv
- Activate the environment:
source .venv/bin/activate
- Install dependencies:
pip install datasets==2.15.0
- Run the
main.py
script:python main.py
- Runtime: Please note that the script takes approximately one day to run, depending on your system's capabilities.
- Output: The results of the parsing are published on the HuggingFace Hub.
Contributions to this project are welcome! If you have suggestions for improvement or have identified issues, please submit a pull request or open an issue in this repository.