geektoni / concept_tagging_NLP

Concept Tagging Module for Movie Domain using the NL-SPARQL Data Set

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Concept Tagging for the Movie Domain by using Entity Recognition Tools

Brief Description

This work focuses on the well-known task of concept tagging sentences. It is the starting point for more complex techniques and it represents a relatively important challenge when building Spoken Language Understanding applications. This repository contains the mid-term project for the Language Understanding System course at the University of Trento. It also contains the code for building two SLU modules by using Entity Recognition tools for concept tagging phrases taken from the movie domain. Code for evaluating the various models is also provided.

You can find a detailed report about this project here.

Requirements and Install

This project was done using Ubuntu 18.04, python3.6, conda and the libraries OpenFST and OpenGRM. Please make sure to have everything installed before actually using the various scripts.I suggest to set up a virtual environment to develop and test this code more freely.

In order to install and create the environment, please follow these steps:

git clone https://github.com/geektoni/concept_tagging_NLP
cd concept_tagging_NLP
conda create --name ctnlp --file requirements.txt
conda activate ctnlp
python -m spacy download en_core_web_sm
git submodule update --init

Usage

To run the experiments we used the utility make. More specifically, if you need to run the SLU model just run:

cd concept_tagging_NLP
conda activate ctnlp
make evaluate

Inside the evaluation_results directory you will find the evaluation results over the test set.

To run the complete hyperparameter search procedure, please follow the following instructions:

cd concept_tagging_NLP
conda activate ctnlp
bash concept_tagging_tuning.sh

This will generate a file called complete_results.txt with all the evaluations. Be aware that the entire procedure take quite some time.

By calling make help an help message will be generated.

Advanced Usage

There are several options that can be changed such to run the model with different hyperparameters. More specifically, the options available are:

Option Description Possible Value
NC N-gram value Integer greater than 1. 4 (default)
ER Entity Recognition Tool we want to use. spacy, nltk, none (default)
METHOD Smoothing method used. witten_bell, absolute, katz, kneser_ney (default), presmoothed, unsmoothed
PRUNE_TRESH Prune threshold. Integer greater than 1. 5 (default)
OUTPUT_DIR Path to the directory where to save the final results. ./evaluation_results (default)
REPLACE Method used to replace the "O" concepts. word, lemma, stem, keep (default)
TRAIN_DATASET The path to the train dataset used. NL2SparQL4NLU/dataset/NL2SparQL4NLU.train.conll.txt (default)
TEST_DATASER The path to the test dataset used. NL2SparQL4NLU/dataset/NL2SparQL4NLU.train.conll.txt (default)
VERBOSE Emit more messages when running the model. ON (default), OFF

As an example, imagine we want to run the model by using spaCy as ER classifier and by using the Witten-Bell smoothing method. We also want to replace the "O" concepts by using the lemma of the corresponding tokens and to make the entire procedure more verbose. The command we would need to give would be:

cd concept_tagging_NLP
conda activate ctnlp
make ER=spacy METHOD=witten_bell REPLACE=lemma VERBOSE=ON evaluate

License

This work is released under the MIT License. Please have a look at the License file.

Author(s)

Giovanni De Toni - giovanni.detoni@studenti.unitn.it

About

Concept Tagging Module for Movie Domain using the NL-SPARQL Data Set

License:MIT License


Languages

Language:TeX 46.2%Language:Python 19.1%Language:Shell 17.8%Language:Perl 13.9%Language:Makefile 3.0%