Bilingual Word Embeddings with Bucketed CNN for Parallel Sentence Extraction

A TensorFlow implementation of our recent paper in ACL 2017(Student track) (Bilingual Word Embeddings with Bucketed CNN for Parallel Sentence Extraction.)

Two sentences are said to be aligned or semantically similar if they convey the same semantics in both the languages. Our code makes use of the Bilingual Word Embeddings for capturing the semantic relatedness of two words across languages. A similarity matrix is constructed between the words of two sentences, which is dynamically pooled to a fixed size dimension 'dim' for classification tasks. We split the data into different bucket sizes as one fixed size representation would not work effectively for all sentence-pair sizes. Separate CNN's were trained on each data split.

Pre-requisites

Python 2.7
TensorFlow
Numpy
Scikit Learn
Matplotlib

Usage

To replicate the results from our paper, use the testing command below.

$ python main.py

The results would be appended at the end of corresponding files in the results folder. If you want to retrain the model, then uncomment the lines specified in the main.py

Attribution / Thanks

Bilingual Word Representations with Monolingual Quality in Mind Link
BUCC 2017 dataset Link

License

MIT

About

Code for our paper in ACL 2017

cnn deep-learning sentence-alignment

MIT License

Languages

Language:Python 100.0%