elerdg/NER-BERT-Italian

bert-model italian-language named-entity-recognition

Named Entity Recognition with BERT on Italian

This project aims to fine-tune pre-trained BERT models for named-entity recognition (NER) on Italian Data from the Wikineural Dataset

Overview:

fine-tune the pre-trained BERT models:
- bert-base-multilingual-cased
- bert-base-italian-cased
test the fine-tuned BERT models for NER on:
- sentences from the validation set of the Italian Wikineural Dataset
- sentences predicted by the pre-trained model wav2vec2-xls-r fine-tuned on Italian data for ASR

The dataset:

Wikineural IT comprises 111k sentences from Wikipedia, tokenized and ner tagged. The Dataset is organized in 3 splits: train, test, and validation. The sentences are cased and contain punctuation. The entity categories are encoded as illustrated below:

{'O': 0, 'B-PER': 1, 'I-PER': 2, 'B-ORG': 3, 'I-ORG': 4, 'B-LOC': 5, 'I-LOC': 6, 'B-MISC': 7, 'I-MISC': 8}

The pre-trained models in this project:

bert-base-multilingual-cased pre-trained on 104 languages with the largest Wikipedia Dataset.
bert-base-italian-cased pre-trained on Wikipedia texts and OPUS corpora for a total Corpus of the size of 13GB.

About

Fine-tune Bert models for Italian NER

bert-model italian-language named-entity-recognition

Languages

Language:Jupyter Notebook 100.0%