aman-sawarn / Quora-Question-Pair-Similarity

Identifying if two questions are similar, when two questions mean something similar, but have different words

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Quora-Question-Pair-Similarity

Identifying if two questions are similar, when two questions mean something similar, but have different words
Quora is a question-and-answer website where questions are asked, answered, edited, and organized by its community of users
in the form of opinions.
BOW and TF-IDF are two of the most common methods people use in information retrieval. Generally speaking, SVMs and Naive Bayes
are more common for classification problem, however, because their accuracy is dependent of the training data, Xgboost provided
the best accuracy in this particular data set. XGBoost is a gradient boosting framework that has become massively popular, especially
in the Kaggle community. Therefore, I decided to use this model as a baseline model, because it is simple to set up, easy to understand
and has a reasonable chance of providing decent results. Our baseline model will allow us to get a quick performance benchmark. If we
find that the performance it provides is not sufficient, then inspecting what the simple model is struggling with can help us choose
a next approach.
The classes are not perfectly balanced, but it is not bad, we are not going to balance them.
It can be noticed that we have off a lot work to do in terms of text cleaning. After some inspections, a few tries I took ideas from
Not to remove stop words, because words like “what”, “which” and “how” may have strong signals.
Not to stem words.
Remove punctuation.
Correct typos.
Change abbreviations to its original terms.
Remove comma between numbers.
Change special chars to words. And so on.

About

Identifying if two questions are similar, when two questions mean something similar, but have different words


Languages

Language:Jupyter Notebook 100.0%