Alberto-00/Estrazione-Automatica-di-Informazioni-da-Testi

english english-language english-learning natural-language-processing nlp nlp-machine-learning roberta-large roberta-model roberta-tokenizer spacy

1 Introduction

1.1 Problem

More and more people are exchanging text messages through the use of social media, and the analysis of the information can be used to make statistics in the behavior and in people's psychology. Using Natural Language Processing (NLP), we can extrapolate key words from each message that allow us to achieve the proposed goals. The following paper discusses the development of an Automatic Information Extraction system from English-language text messages by using of the spaCy library that provides a set of pre-trained templates using the NER technique. In the following case, the model considered is RoBERTa which we will go on to analyze in the following paragraphs.

1.2 Workflow

The first task performed was to identify the dataset to be used for the task introduced in the previous paragraph. The dataset used was: SMS-NER-Dataset-165-Annotations found on kaggle at the following link. Next, a data cleaning was performed on the dataset in order to ensure uniformity in data representation. After that, the cleaned dataset was divided into training and testing set and converted to .spacy format so that it could be computed by the chosen model. Next, the config.cfg file was generated, which is nothing but a configuration file with all the hyperparameters and settings that the model has to comply with. After that, the training part was given as input to the pre-trained model and in output were saved two models:

model-last: the model trained in the last iteration (it could be used to resume the training at a later time);
model-best: the model that scored highest on the test dataset;

Finally, Precision, Recall and F1-Score metrics were reported. In order to best perform the information extraction task, 3 different pre-trained models were used in accuracy for the prediction of tags and compared with each other. The models used were: 1. en_core_web_sm; 2. en_core_web_md; 3. en_core_web

2 Approach

In this section we are going to cover the implementation parts. In particular, we will discuss the structure of the dataset and the configuration files.

2.1 Dataset

The dataset is in json format and is structured as follows:

"classes": contains the list of tags to be identified within the messages: "MONEY", "TITLE", "OTP", "TRANSAC", "TIME", "PURPOSE".
"annotations": contains the message list and entity class for each message;

"entities": each entity is an array of tuples where each tuple has within it two integers and a tag (the integers are the coordinates of the tag associated with a specific phrase, e.g. [19,26, "TRANSAC"]).

Next, the dataset is divided into two parts: train and test set. If a message has the associated entity class empty, then this is filled with the tuple [(0, 0, 'PEARSON')].

2.3 Configuration File

Within the SMS-NER-Dataset-165-Annotations folder we find the base_config.cfg configuration file used to set up the model that will be trained on the previous dataset. To set up the model structure we run the command:

python -m spacy init fill-config dataset/SMS-NER-Dataset-165-Annotations/base_config.cfg config.cfg

After that, it will start the training phase and finally the of testing by running the command:

python -m spacy train config.cfg -output ./output -paths.train train.spacy -paths.dev test.spacy --gpu-id 0

To conclude, we print the metrics produced by the best model by running the command:

python -m spacy benchmark accuracy model/large/model-best model/large/test.spacy -output -code -gold-preproc -gpu-id 0 -displacy-path model/large

3 Report

The report can be found at the follow link: Report.

4 Author & Contacts

Name	Description
Alberto Montefusco	Developer - Alberto-00 Email - a.montefusco28@studenti.unisa.it LinkedIn - Alberto Montefusco My WebSite - alberto-00.github.io
Alessandro Aquino	Developer - AlessandroUnisa Email - a.aquino33@studenti.unisa.it LinkedIn - Alessandro Aquino
Mattia d'Argenio	Developer - mattiadarg Email - m.dargenio5@studenti.unisa.it LinkedIn - Mattia d'Argenio

Name

Description

Alberto Montefusco

Developer - Alberto-00

Email - a.montefusco28@studenti.unisa.it

LinkedIn - Alberto Montefusco

My WebSite - alberto-00.github.io

Alessandro Aquino

Developer - AlessandroUnisa

Email - a.aquino33@studenti.unisa.it

LinkedIn - Alessandro Aquino

Mattia d'Argenio

Developer - mattiadarg

Email - m.dargenio5@studenti.unisa.it

LinkedIn - Mattia d'Argenio

About

More and more people are exchanging text messages through the use of social media, and the analysis of the information can be used to make statistics in the behavior and in people's psychology. Using Natural Language Processing (NLP), we can extrapolate key words from each message that allow us to achieve the proposed goals.

english english-language english-learning natural-language-processing nlp nlp-machine-learning roberta-large roberta-model roberta-tokenizer spacy

Languages

Language:Jupyter Notebook 100.0%