Quora-Dupliacate-Question-pairs-using-Text-mining, Logistic Regression, Random Forests and XG-Boost

An important product principle for Quora is that there should be a single question page for each logically distinct question. As a simple example, the queries “What is the most populous state in the USA?” and “Which state in the United States has the most people?” should not exist separately on Quora because the intent behind both is identical. The goal of this competition is to predict which of the provided pairs of questions contain two questions with the same meaning. The data set consists of over 400,000 lines of potential question duplicate pairs. Each line contains IDs for each question in the pair, the full text for each question, and a binary value that indicates whether the line truly contains a duplicate pair.

Data set Reference: Quora (website)

About

Quora duplicate questions pairs prediction using NLP, text mining and XG Boost

Languages

Language:R 100.0%