parshakova / cnn-q-class

Chinese Question Classifier (Keras Implementation) on BQuLD

Home Page:https://www.aaai.org/ocs/index.php/AAAI/AAAI15/paper/view/9745

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

For classifying arbitrary spans of words:

  1. run python main_qc_bigg.py
  2. then with saved checkpoint run python test_qc_bigg.py

Recurrent Convolutional Neural Networks for Chinese Question Classification on BQuLD

Architecture Overview

Alt text

For more details Click Here.

Bilingual Question Labelling Dataset (BQuLD)

This dataset is a bilingual (traditional Chinese & English) question labelling dataset designed for NLP researchers.
It originally consists of 1216 pairs of question and question label, which first published by the author of this GitHub tim5go
There are 9 question types in total, namely:

  1. NUMBER
  2. PERSON
  3. LOCATION
  4. ORGANIZATION
  5. ARTIFACT
  6. TIME
  7. PROCEDURE
  8. AFFIRMATION
  9. CAUSALITY

Embedding Preparation

In my experiment, I built a word2vec model on 全網新聞數據(SogouCA) Sogou Labs

For example, in Linux:

  1. clean XML tag
$ cat news_tensite_xml.dat | iconv -f gbk -t utf-8 -c | grep "<content>" 
  | sed 's\<content>\\' | sed 's\</content>\\' > corpus.txt
  1. word segmentation using LTP command line
$ cws_cmdline --threads 4 --input corpus.txt --segmentor-model cws.model > corpus.seg.txt
  1. simplified to traditional Chinese conversion using OpenCC
$ opencc -i corpus.seg.txt -o corpus_trad.txt -c s2t.json
  1. word2Vec training using Google Word2vec
$ nohup ./word2vec -train corpus_trad.txt -output sogou_vectors.bin -cbow 0 
  -size 200 -window 10 -negative 5 -hs 0 -sample 1e-4 -threads 24 -binary 1 -iter 20 -min-count 1 &

Result

Training Loss Training Accuracy Validation Loss Validation Accuracy
0.7000 87.11% 0.8945 77.87%

About

Chinese Question Classifier (Keras Implementation) on BQuLD

https://www.aaai.org/ocs/index.php/AAAI/AAAI15/paper/view/9745


Languages

Language:Python 100.0%