Recurrent Convolutional Neural Networks for Chinese Question Classification on BQuLD

Architecture Overview

For more details Click Here.

Bilingual Question Labelling Dataset (BQuLD)

This dataset is a bilingual (traditional Chinese & English) question labelling dataset designed for NLP researchers.
It originally consists of 1216 pairs of question and question label, which first published by the author of this GitHub tim5go
There are 9 question types in total, namely:

NUMBER
PERSON
LOCATION
ORGANIZATION
ARTIFACT
TIME
PROCEDURE
AFFIRMATION
CAUSALITY

Embedding Preparation

In my experiment, I built a word2vec model on 全網新聞數據(SogouCA) Sogou Labs

For example, in Linux:

clean XML tag

$ cat news_tensite_xml.dat | iconv -f gbk -t utf-8 -c | grep "<content>" 
  | sed 's\<content>\\' | sed 's\</content>\\' > corpus.txt

word segmentation using LTP command line

$ cws_cmdline --threads 4 --input corpus.txt --segmentor-model cws.model > corpus.seg.txt

simplified to traditional Chinese conversion using OpenCC

$ opencc -i corpus.seg.txt -o corpus_trad.txt -c s2t.json

word2Vec training using Google Word2vec

$ nohup ./word2vec -train corpus_trad.txt -output sogou_vectors.bin -cbow 0 
  -size 200 -window 10 -negative 5 -hs 0 -sample 1e-4 -threads 24 -binary 1 -iter 20 -min-count 1 &

Result

Training Loss	Training Accuracy	Validation Loss	Validation Accuracy
0.7000	87.11%	0.8945	77.87%

About

Chinese Question Classifier (Keras Implementation) on BQuLD

https://www.aaai.org/ocs/index.php/AAAI/AAAI15/paper/view/9745

Languages

Language:Python 100.0%