Linked-DocRED – Enhancing DocRED with Entity-Linking to Evaluate End-To-End Document-Level Information Extraction Pipelines
Dataset, source code for the entity-linking annotation, the baseline and the metrics for the paper Linked-DocRED – Enhancing DocRED with Entity-Linking to Evaluate End-To-End Document-Level Information Extraction Pipelines.
The dataset is located in the Linked-DocRED/ folder. We hope Linked-DocRED will participate in discovering and developing more performant IE pipelines.
We also propose the alternative dataset Linked-Re-DocRED, located in the Linked-Re-DocRED/ folder. It is based on Re-DocRED [1], which is an improved version of DocRED.
Information Extraction (IE) pipelines aim to extract meaningful entities and relations from documents and structure them into a knowledge graph that can then be used in downstream applications. Training and evaluating such pipelines requires a dataset annotated with entities, coreferences, relations, and entity-linking. However, existing datasets either lack entity-linking labels, are too small, not diverse enough, or automatically annotated (that is, without a strong guarantee of the correction of annotations).
Therefore, we propose Linked-DocRED, to the best of our knowledge, the first manually-annotated, large-scale, document-level IE dataset. We enhance the existing and widely-used DocRED [2] dataset with entity-linking labels that are generated thanks to a semi-automatic process that guarantees high-quality annotations. In particular, we use hyperlinks in Wikipedia articles to provide disambiguation candidates. The dataset is located in the Linked-DocRED folder. The source code for the disambiguation is accessible at Disambiguation Process.
We also propose a complete framework of metrics to benchmark end-to-end IE pipelines, and we define an entity-centric metric to evaluate entity-linking (see Metrics).
The evaluation of a baseline shows promising results while highlighting the challenges of an end-to-end IE pipeline (see Baseline).
- Linked-DocRED data and format
- Metrics
- Baseline
- DocRED disambiguation process
- Linked-Re-DocRED data and format
Linked-DocRED, the source code for the baseline, the metrics, and the disambiguation process are licensed under the GPLv3 License. For more details, please refer to the LICENSE.md file.
Linked-DocRED
Copyright (C) 2023 Alteca.
This program is free software: you can redistribute it and/or modify
it under the terms of the GNU General Public License as published by
the Free Software Foundation, either version 3 of the License, or
(at your option) any later version.
This program is distributed in the hope that it will be useful,
but WITHOUT ANY WARRANTY; without even the implied warranty of
MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
GNU General Public License for more details.
You should have received a copy of the GNU General Public License
along with this program. If not, see <https://www.gnu.org/licenses/>.
If you have questions using Linked-DocRED, please e-mail us at pygenest@alteca.fr.
If you make use of Linked-DocRED or this code in your work, please kindly cite the following paper:
Genest, Pierre-Yves, Pierre-Edouard Portier, Előd Egyed-Zsigmond, and Martino Lovisetto. “Linked-DocRED – Enhancing DocRED with Entity-Linking to Evaluate End-To-End Document-Level Information Extraction Pipelines.” In Proceedings of the 46th International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR’23), 11. Taipei, Taiwan: Association for Computing Machinery, 2023. https://doi.org/10.1145/3539618.3591912.
@inproceedings{10.1145/3539618.3591912,
title = {{Linked-DocRED – Enhancing DocRED with Entity-Linking to Evaluate End-To-End Document-Level Information Extraction Pipelines}},
booktitle = {{Proceedings of the 46th International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR'23)}},
author = {Genest, Pierre-Yves and Portier, Pierre-Edouard and Egyed-Zsigmond, El\"{o}d and Lovisetto, Martino},
year = {2023},
pages = {11},
publisher = {{Association for Computing Machinery}},
address = {{Taipei, Taiwan}},
doi = {10.1145/3539618.3591912},
isbn = {978-1-4503-9408-6},
}