Detecting Semantic Equivalence of Quora Question Pairs

Lakshya A Agrawal (lakshya18242@iiitd.ac.in)
Shrikant Garg (shrikant18312@iiitd.ac.in)
Bhavya Chopra (bhavya18333@iiitd.ac.in)

Link to dataset: Kaggle Quora Question Pairs

Problem Statement

Popular QnA websites, such as StackExchange, Quora and Yahoo Answers are visited by millions of users on a daily basis, which leads different users to ask separate questions with similar intent. This makes it difficult for users to seek the best answer, and such ‘duplicate’ questions require to be manually moderated to be taken down or merged with similar threads. These repeated questions can result in poor recommendations for the users as well. This motivates us to identify duplicate question pairs on such forums. The learning task is to train a classifier which can predict if a given question pair is semantically equivalent or not with the classes being “equivalent” and “not equivalent”.

Dataset

The chosen dataset is manually annotated, with a large number of training samples belonging to Quora. The table below shows the training, test and validation set splits.