This project aims to fine-tune pre-trained BERT models for named-entity recognition (NER) on Italian Data from the Wikineural Dataset
-
fine-tune the pre-trained BERT models:
-
test the fine-tuned BERT models for NER on:
- sentences from the validation set of the Italian Wikineural Dataset
- sentences predicted by the pre-trained model wav2vec2-xls-r fine-tuned on Italian data for ASR
Wikineural IT comprises 111k sentences from Wikipedia, tokenized and ner tagged. The Dataset is organized in 3 splits: train, test, and validation. The sentences are cased and contain punctuation. The entity categories are encoded as illustrated below:
{'O': 0, 'B-PER': 1, 'I-PER': 2, 'B-ORG': 3, 'I-ORG': 4, 'B-LOC': 5, 'I-LOC': 6, 'B-MISC': 7, 'I-MISC': 8}
- bert-base-multilingual-cased pre-trained on 104 languages with the largest Wikipedia Dataset.
- bert-base-italian-cased pre-trained on Wikipedia texts and OPUS corpora for a total Corpus of the size of 13GB.