walter090/toxic_comments

Detecting Toxic Comments

Introduction

CNN and LSTM models for text classification. The model is tested on a multi-label classification task with Wikimedia comments dataset. The model achieved an AUROC of 0.896 with randomly initialized word embeddings; using FastText, the AUC is 0.972 with Kim Yoon's CNN, and 0.983 with a stacked LSTM with attention.

Usage

Training

To train with default layer configurations

python training/train.py --data dataset.csv --vocab 30000 --embedding 300 --mode cnn

where vocab flag is for specifying vocabulary size and embedding embedding size; in this example, the real vocabulary size will be 30002 since unknown word and padding word tokens are added. There are three modes: use 'cnn' for training CNN for classification, 'lstm' for training LSTM for classification, 'emb' for training word embeddings, and 'test' for testing a trained model.

To train with a pre trained word vector file, use the 'vector' flag:

python training/train.py --data dataset.csv --vocab 30000 --embedding 300 --mode lstm --vector fasttext.vec

You can also optionally add a tsv metadata file for TensorBoard projector using the metadata flag.

Use Deployed example model trained on Wikimedia dataset

Make requests to the deployed saved model:

python training/client.py --server 35.227.88.30:9000 -d "metadata/word2id.pickle" -t "Enter your potential abusive text here."

Output is a JSON file:

outputs {
  key: "output"
  value {
    dtype: DT_FLOAT
    tensor_shape {
      dim {
        size: 1
      }
      dim {
        size: 6
      }
    }
    float_val: 1.0
    float_val: 0.0
    float_val: 1.0
    float_val: 0.0
    float_val: 0.0
    float_val: 0.0
  }
}

Each of the six float_vals represents toxic, severe_toxic, obscene, threat, insult, identity_hate.

Custom CNN layers

You can also change the layer configuration if you decide to write your own code for training and testing, by providing values to layer_config and fully_conn_config attributes to the ToxicityCNN object. layer_config is a list and follows the structure:

[
    [
        # Parellel layer 1
        [ksize, stride, out_channels, pool_ksize, pool_stride],
    ],
    [
        # Parellel layer 2
        [ksize, stride, out_channels, pool_ksize, pool_stride],
    ],
]

For Example, a configuration like this:

[
    # Convolution layer configuration
    # ksize, stride, out_channels, pool_ksize, pool_stride
    [
        [2, 1, 256, 59, 1],
    ],
    [
        [3, 1, 256, 58, 1],
    ],
    [
        [4, 1, 256, 57, 1],
    ],
    [
        [5, 1, 256, 56, 1],
    ],
]

represents a structure like this:

walter090 / toxic_comments