hmosousa / professor_heideltime

Create a multilingual corpus weakly labeled with HeidelTime.

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Professor HeidelTime

Paper DOI License Download

Professor HeidelTime is a project to create a multilingual corpus weakly labeled with HeidelTime, a temporal tagger.

Getting Started

Download the Annotated Data

To download the Professor HeidelTime corpus, click on the following link: Professor HeidelTime corpus.

The downloaded archive contains six folders, each representing a different language. Inside each folder, there is one .json file for each annotated news article. The English, Italian, German, and French files contain text, dct, and timexs keys. However, due to licensing issues, the Portuguese and Spanish corpus files currently lack the text key. We are actively working with news sources to license these datasets for redistribution.

In the meantime, you can access the texts by running the following scrapping scripts: Spanish and Portuguese.

Corpus Details

The weak labeling was performed in six languages. Here are the specifics of the corpus for each language:

Dataset Language Documents From To Tokens Timexs
[All the News 2.0] EN 24,642 2016-01-01 2020-04-02 18,755,616 254,803
[Italian Crime News] IT 9,619 2011-01-01 2021-12-31 3,296,898 58,823
[ElMundo News] ES 33,266 2003-01-01 2022-12-31 21,617,888 348,011
[German News Dataset] DE 19,095 2005-12-02 2021-10-18 12,515,410 194,043
[French Financial News] FR 27,154 2017-10-19 2021-03-19 1,673,053 83,431
[Público News] PT 24,293 2000-11-14 2002-03-20 5,929,377 111,810

Running Annotations

Set up Development Environment

To start with, set up a virtual environment and activate it. Then, install the necessary packages from the requirements file:

virtualenv venv --python=python3.10
source venv/bin/activate
pip install -r requirements.txt

Run pytest to ensure that everything is working correctly: python -m pytest tests

Kaggle API Key

To add the Kaggle API keys to your machine, follow the instructions provided on kaggle-api.

Download Raw Data

You can download the raw data by executing the following command:

sh data/download.sh

Execute the Annotation

To run the annotation, use the following command (replace 'english' with the language you want to annotate):

python src/run.py --language english

Contact

For more information, reach out to Hugo Sousa at hugo.o.sousa@inesctec.pt.

This framework is a part of the Text2Story project. This project is financed by the ERDF – European Regional Development Fund through the North Portugal Regional Operational Programme (NORTE 2020), under the PORTUGAL 2020 and by National Funds through the Portuguese funding agency, FCT - Fundação para a Ciência e a Tecnologia within project PTDC/CCI-COM/31857/2017 (NORTE-01-0145-FEDER-03185).

Cite

If you use this work, please cite the following paper:

@inproceedings{10.1145/3583780.3615130,
    author = {Sousa, Hugo and Campos, Ricardo and Jorge, Al\'{\i}pio},
    title = {TEI2GO: A Multilingual Approach for Fast Temporal Expression Identification},
    year = {2023},
    isbn = {9798400701245},
    publisher = {Association for Computing Machinery},
    url = {https://doi.org/10.1145/3583780.3615130},
    doi = {10.1145/3583780.3615130},
    booktitle = {Proceedings of the 32nd ACM International Conference on Information and Knowledge Management},
    pages = {5401–5406},
    numpages = {6},
    keywords = {temporal expression identification, multilingual corpus, weak label},
    location = {Birmingham, United Kingdom},
    series = {CIKM '23}
}

About

Create a multilingual corpus weakly labeled with HeidelTime.

License:MIT License


Languages

Language:Python 90.7%Language:Shell 9.3%