RaffaeleGalliera / pytorch-cnn-text-classification

Convolutional Neural Network (CNN) for text classification implemented with PyTorch and TorchText

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

CNNs applied to text classification

Work in progress repository that implements Multi-Class Text Classification using a CNN (Convolutional Neural Network) for a Deep Learning university exam using PyTorch 1.3, TorchText 0.5 and Python 3.7. It also integrates TensorboardX, a module for visualization with Google’s tensorflow’s Tensorboard (web server to serve visualizations of the training progress of a neural network). TensorboardX gives us the possibility to visualize embeddings, PR and Loss/Accuracy curves.

I've started this project following this awesome tutorial that perfectly shows how to perform sentiment analysis with PyTorch. The convolutional approach to sentences classification takes inspiration from Yoon Kim's paper.

The achievement of this particular application of convolutional neural networks to Text Classification is to being able to categorize 7 different classes of Italian Math/Calculus exercises, using a small, but balanced dataset. For example, given a 4D optimization Calculus exercise to the input, the NN should be able to categorize that exercise as a 4D optimization problem.

Feats

  • Export datasets with a folder hierarchy (eg. test/label_1/point.txt) from txt to JSON files (with TEXT and LABEL as fields)
  • Import custom datasets
  • Train, evaluate and save model's state
  • Make a prediction from user's input
  • Print infos about dataset and model
  • Test NLPAug (NLP Data Augmentation)
  • Plot Accuracy, Loss and PR curves - TensorboardX
  • Visualize the embedding space projection - TensorboardX

Model summary

Embedding dimension: 100 
N. of filters: 400 
Vocab dimension: 4363 
Filter Sizes = [2, 3, 4]
Batch Size = 32
Categories:
   3D geometric figures in spatial diagrams
   arithmetic
   crypto-arithmetic
   numbers in spatial diagrams
   temporal reasoning
   spatial reasoning
   geometric figures in spatial diagrams OR puzzle
===================================================
Layer (type)         Output Shape         Param 
===================================================
Embedding-1          [-1, 32, 100]        436,300
Conv2d-2           [-1, 400, 31, 1]         80,400
Conv2d-3           [-1, 400, 30, 1]        120,400
Conv2d-4           [-1, 400, 29, 1]        160,400
Dropout-5                [-1, 1200]              0
Linear-6                    [-1, 7]          8,407
===================================================
Total params: 805,907
Trainable params: 805,907
Non-trainable params: 0

Best Test Loss achieved: 0.654
Best Test Accuracy achieved: 83.33%

Getting Started

To install PyTorch, see installation instructions on the PyTorch website.

Install TorchText, Inquirer, TorchSummary* and TensorboardX:

pip install inquirer
pip install tensorboardX
pip install torchsummary
pip install torchtext

SpaCy is required to tokenize our data. To install spaCy, follow the instructions here making sure to install the Italian (or other language) module with:

python -m spacy download it_core_news_sm

I've tried two different Italian embeddings - to build the vocab and load the pre-trained word embeddings - you can download them here:

You should extract one of them to vector_cache folder and load it from dataset.py. For example:

vectors = vocab.Vectors(name='model.txt', cache='vector_cache/word2vec_CoNLL17')

I'd anyway suggest to use Word2Vec models as I've found them easier to integrate with libraries such nlpaug - Data Augmentation for NLP

After have performed any TensorboardX related operation remember to run

 tensorboard --logdir=tensorboard    

Due to a Torchsummary issue with embeddings you should change the dtype from FloatTensor to LongTensor in its source file in order to have the summary of the model in a Keras-like way

Dataset

This project works on a custom private dataset. You can import your own .txt dataset just adopting the following folder pattern.

/ezmath/
...
...
|-- dataset_folder
|   |-- test
|   |-- train
|   |-- validation
|   |   |-- whatever_label_1
|   |   |-- whatever_label_1
|   |   |-- ...
|   |   |-- ...
|   |   |-- whatever_label_X
|   |   |   |-- whatever_1.txt
|   |   |   |-- whatever_2.txt
|   |   |   |-- ...
|   |   |   |-- ...
|   |   |   |-- whatever_Y.txt

load_dataset() will create a data folder with three JSON files test.json, train.json, validation.json every JSON will contain Y entries. Every entry will have 2 fields: text and label.

For eg.

{"text": ["This", "was", "a", ".txt"], "label": "whatever_label_X"}

TensorboardX screenshots

Embeddings

Embedding Space

Embedding Space

Accuracy and Loss curves

Acc/Loss Curves

PR curves

PR Curves

References