qanastek / MORFITT

MORFITT: A multi-label topic classification for French Biomedical literature

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

MORFITT

Data (Zenodo) | Publication (arXiv / HAL / ACL Anthology)

Yanis LABRAK, Richard DUFOUR, Mickaël ROUVIER

or Python

We introduce MORFITT, the first multi-label corpus for the classification of specialties in the medical field, in French. MORFITT is composed of 3,624 summaries of scientific articles from PubMed, annotated in 12 specialties. The article details the corpus, the experiments and the preliminary results obtained using a classifier based on the pre-trained language model CamemBERT.

For more details, please refer to our paper:

MORFITT: A multi-label topic classification for French Biomedical literature (arXiv / HAL / ACL Anthology)

Key Features

Documents distribution

Train Dev Test
1,514 1,022 1,088

Multi-label distribution

Train Dev Test Total
Vétérinaire 320 250 254 824
Étiologie 317 202 222 741
Psychologie 255 175 179 609
Chirurgie 223 169 157 549
Génétique 207 139 159 505
Physiologie 217 125 148 490
Pharmacologie 112 84 103 299
Microbiologie 115 72 86 273
Immunologie 106 86 70 262
Chimie 94 53 65 212
Virologie 76 57 67 200
Parasitologie 68 34 50 152
Total 2,110 1,446 1,560 5,116

Number of labels per document distribution

drawing

Co-occurences distribution

drawing

If you use HuggingFace Transformers

from datasets import load_dataset
dataset = load_dataset("qanastek/MORFITT")
print(dataset)

or

from datasets import load_dataset
dataset_base = load_dataset(
    'csv',
    data_files={
        'train': f"./train.tsv",
        'validation': f"./dev.tsv",
        'test': f"./test.tsv",
    },
    delimiter="\t",
)

License and Citation

The code is under Apache-2.0 License.

The MORFITT dataset is licensed under Attribution-ShareAlike 4.0 International (CC BY-SA 4.0). If you find this project useful in your research, please cite the following papers:

Yanis Labrak, Mickaël Rouvier, Richard Dufour. MORFITT : A multi-label corpus of French scientific articles in the biomedical domain. 30e Conférence sur le Traitement Automatique des Langues Naturelles (TALN) Atelier sur l'Analyse et la Recherche de Textes Scientifiques, Florian Boudin, Jun 2023, Paris, France. ⟨hal-04125879⟩

or using the bibtex:

@inproceedings{labrak:hal-04125879,
  TITLE = {{MORFITT : A multi-label corpus of French scientific articles in the biomedical domain}},
  AUTHOR = {Labrak, Yanis and Rouvier, Micka{\"e}l and Dufour, Richard},
  URL = {https://hal.science/hal-04125879},
  BOOKTITLE = {{30e Conf{\'e}rence sur le Traitement Automatique des Langues Naturelles (TALN) Atelier sur l'Analyse et la Recherche de Textes Scientifiques}},
  ADDRESS = {Paris, France},
  ORGANIZATION = {{Florian Boudin}},
  YEAR = {2023},
  MONTH = Jun,
  KEYWORDS = {BERT ; RoBERTa ; Transformers ; Biomedical ; Clinical ; Topics ; multi-labels ; BERT ; RoBERTa ; Transformers ; Biom{\'e}dical ; Clinique ; Sp{\'e}cialit{\'e}s ; multi-labels},
  PDF = {https://hal.science/hal-04125879/file/_ARTS___TALN_RECITAL_2023__MORFITT__Multi_label_topic_classification_for_French_Biomedical_literature%20%285%29.pdf},
  HAL_ID = {hal-04125879},
  HAL_VERSION = {v1},
}

About

MORFITT: A multi-label topic classification for French Biomedical literature

License:MIT License


Languages

Language:Python 100.0%