DarekarA / Quora-Question-Pair-Similarity

- Identify which questions asked on Quora are duplicates of questions that have already been asked. - This could be useful to instantly provide answers to questions that have already been answered. - We are tasked with predicting whether a pair of questions are duplicates or not.

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Quora-Question-Pair-Similarity

  • Identify which questions asked on Quora are duplicates of questions that have already been asked. - This could be useful to instantly provide answers to questions that have already been answered. - We are tasked with predicting whether a pair of questions are duplicates or not.

  • Source : https://www.kaggle.com/c/quora-question-pairs

Real world/Business Objectives and Constraints

  1. The cost of a mis-classification can be very high. 2. You would want a probability of a pair of questions to be duplicates so that you can choose any threshold of choice. 3. No strict latency concerns. 4. Interpretability is partially important. Machine Learning Probelm
  • Data will be in a file Train.csv
  • Train.csv contains 5 columns : qid1, qid2, question1, question2, is_duplicate
  • Size of Train.csv - 60MB
  • Number of rows in Train.csv = 404,290

Mapping the real world problem to an ML problem It is a binary classification problem, for a given pair of questions we need to predict if they are duplicate or not.

Performance Metric Source: https://www.kaggle.com/c/quora-question-pairs#evaluation

Metric(s):

log-loss : https://www.kaggle.com/wiki/LogarithmicLoss Binary Confusion Matrix We build train and test by randomly splitting in the ratio of 70:30 or 80:20 whatever we choose as we have sufficient points to work with.

About

- Identify which questions asked on Quora are duplicates of questions that have already been asked. - This could be useful to instantly provide answers to questions that have already been answered. - We are tasked with predicting whether a pair of questions are duplicates or not.


Languages

Language:Jupyter Notebook 100.0%