aman-sawarn / Quora-Question-Pair-Similarity

Identifying if two questions are similar, when two questions mean something similar, but have different words

Quora-Question-Pair-Similarity

Identifying if two questions are similar, when two questions mean something similar, but have different words

Quora is a question-and-answer website where questions are asked, answered, edited, and organized by its community of users

in the form of opinions.

BOW and TF-IDF are two of the most common methods people use in information retrieval. Generally speaking, SVMs and Naive Bayes

are more common for classification problem, however, because their accuracy is dependent of the training data, Xgboost provided

the best accuracy in this particular data set. XGBoost is a gradient boosting framework that has become massively popular, especially

in the Kaggle community. Therefore, I decided to use this model as a baseline model, because it is simple to set up, easy to understand

and has a reasonable chance of providing decent results. Our baseline model will allow us to get a quick performance benchmark. If we

find that the performance it provides is not sufficient, then inspecting what the simple model is struggling with can help us choose

a next approach.

The classes are not perfectly balanced, but it is not bad, we are not going to balance them.

It can be noticed that we have off a lot work to do in terms of text cleaning. After some inspections, a few tries I took ideas from

https://www.kaggle.com/currie32/the-importance-of-cleaning-text

Not to remove stop words, because words like “what”, “which” and “how” may have strong signals.

Not to stem words.

Remove punctuation.

Correct typos.

Change abbreviations to its original terms.

Remove comma between numbers.

Change special chars to words. And so on.

About

Identifying if two questions are similar, when two questions mean something similar, but have different words

Languages

Language:Jupyter Notebook 100.0%