cnn convolutional-neural-network embeddings italian-nlp natural-language-processing nlp pytorch pytorch-cnn pytorch-cnn-classification pytorch-implementation pytorch-nlp sentence-classification tensorboardx text-classification torchtext word-embeddings

CNNs applied to text classification

Work in progress repository that implements Multi-Class Text Classification using a CNN (Convolutional Neural Network) for a Deep Learning university exam using PyTorch 1.3, TorchText 0.5 and Python 3.7. It also integrates TensorboardX, a module for visualization with Google’s tensorflow’s Tensorboard (web server to serve visualizations of the training progress of a neural network). TensorboardX gives us the possibility to visualize embeddings, PR and Loss/Accuracy curves.

I've started this project following this awesome tutorial that perfectly shows how to perform sentiment analysis with PyTorch. The convolutional approach to sentences classification takes inspiration from Yoon Kim's paper.

The achievement of this particular application of convolutional neural networks to Text Classification is to being able to categorize 7 different classes of Italian Math/Calculus exercises, using a small, but balanced dataset. For example, given a 4D optimization Calculus exercise to the input, the NN should be able to categorize that exercise as a 4D optimization problem.

Feats

Export datasets with a folder hierarchy (eg. test/label_1/point.txt) from txt to JSON files (with TEXT and LABEL as fields)
Import custom datasets
Train, evaluate and save model's state
Make a prediction from user's input
Print infos about dataset and model
Test NLPAug (NLP Data Augmentation)
Plot Accuracy, Loss and PR curves - TensorboardX
Visualize the embedding space projection - TensorboardX

Model summary

Embedding dimension: 100 
N. of filters: 400 
Vocab dimension: 4363 
Filter Sizes = [2, 3, 4]
Batch Size = 32
Categories:
   3D geometric figures in spatial diagrams
   arithmetic
   crypto-arithmetic
   numbers in spatial diagrams
   temporal reasoning
   spatial reasoning
   geometric figures in spatial diagrams OR puzzle
===================================================
Layer (type)         Output Shape         Param 
===================================================
Embedding-1          [-1, 32, 100]        436,300
Conv2d-2           [-1, 400, 31, 1]         80,400
Conv2d-3           [-1, 400, 30, 1]        120,400
Conv2d-4           [-1, 400, 29, 1]        160,400
Dropout-5                [-1, 1200]              0
Linear-6                    [-1, 7]          8,407
===================================================
Total params: 805,907
Trainable params: 805,907
Non-trainable params: 0

Best Test Loss achieved: 0.654
Best Test Accuracy achieved: 83.33%

Getting Started

To install PyTorch, see installation instructions on the PyTorch website.

Install TorchText, Inquirer, TorchSummary* and TensorboardX:

pip install inquirer
pip install tensorboardX
pip install torchsummary
pip install torchtext

SpaCy is required to tokenize our data. To install spaCy, follow the instructions here making sure to install the Italian (or other language) module with:

python -m spacy download it_core_news_sm

I've tried two different Italian embeddings - to build the vocab and load the pre-trained word embeddings - you can download them here:

Human Language Technologies - CNR
[Suggested] Italian CoNLL17 corpus (filtering by language)

You should extract one of them to vector_cache folder and load it from dataset.py. For example:

vectors = vocab.Vectors(name='model.txt', cache='vector_cache/word2vec_CoNLL17')

I'd anyway suggest to use Word2Vec models as I've found them easier to integrate with libraries such nlpaug - Data Augmentation for NLP

After have performed any TensorboardX related operation remember to run

 tensorboard --logdir=tensorboard

Due to a Torchsummary issue with embeddings you should change the dtype from FloatTensor to LongTensor in its source file in order to have the summary of the model in a Keras-like way

Dataset

This project works on a custom private dataset. You can import your own .txt dataset just adopting the following folder pattern.

/ezmath/
...
...
|-- dataset_folder
|   |-- test
|   |-- train
|   |-- validation
|   |   |-- whatever_label_1
|   |   |-- whatever_label_1
|   |   |-- ...
|   |   |-- ...
|   |   |-- whatever_label_X
|   |   |   |-- whatever_1.txt
|   |   |   |-- whatever_2.txt
|   |   |   |-- ...
|   |   |   |-- ...
|   |   |   |-- whatever_Y.txt

load_dataset() will create a data folder with three JSON files test.json, train.json, validation.json every JSON will contain Y entries. Every entry will have 2 fields: text and label.

For eg.

{"text": ["This", "was", "a", ".txt"], "label": "whatever_label_X"}

TensorboardX screenshots

Embeddings

Accuracy and Loss curves

PR curves

References

About

Convolutional Neural Network (CNN) for text classification implemented with PyTorch and TorchText

cnn convolutional-neural-network embeddings italian-nlp natural-language-processing nlp pytorch pytorch-cnn pytorch-cnn-classification pytorch-implementation pytorch-nlp sentence-classification tensorboardx text-classification torchtext word-embeddings

Languages

Language:Python 100.0%