Transformer-Encoder-with-Char

1. Model structure

Input words are represented with Char-CNN, Word2vec concatenated together(64 dimensions each)
Normal Transformer Encoder from (Attention is all you need) is used
Model is composed of 7 Transformer Encoder layers with 4 attention heads
Global Average Pooling layer with softmax is used at the end, for predicting class

$ git clone https://github.com/MSWon/Transformer-Encoder-with-Char.git

$ unzip data.zip
$ unzip embedding.zip

$ python train.py --batch_size 128 --training_epochs 12 --char_mode char_cnn

The AG’s news topic classification dataset is constructed by choosing 4 largest classes from the original news corpus
4 classes are ‘world’, ‘sports’, ‘business’ and ‘science/technology’
Each class contains 30,000 training samples and 1,900 testing samples
The total number of training samples is 120,000 and 7,600 for test