Overview

The code in this repository implements ideas from 'Convolutional Neural Networks for Sentence Classification' by Yoon Kim from the New York University (https://arxiv.org/pdf/1404.2188.pdf) in context of detecting question duplicates on Quora. The implementation relies on pre-trained word2vec embeddings (https://code.google.com/archive/p/word2vec/) and adds a convolutional neural network to detect words in certain sequences. Specifically, there is a single embedding layer followed by one convolution - max-pooling - ReLu layer with dropout followed by a softmax classification layer.

The implementation is done in Tensorflow and therefore easily scalable.

Requirements

The computer this is run on should have at least 16Gb of RAM.

Install python 3 and wget
Install tensorflow for python 3 (https://www.tensorflow.org/install/)
Install gensim (pip install gensim)
Run ./prepare.sh

To train the model, use

python train.py

The resultant network is dumped every epoch. To use the trained model to predict the outcome for the submission set, run

python evaluate.py

About

Tensorflow implementation of natural language processing task - detecting duplicate questions from Quora

GNU General Public License v3.0

Languages

Language:Python 97.5%Language:Shell 2.5%