smadha / MlTrio

CSCI-567 course project

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Classification

smadha opened this issue · comments

For every user, question (U_i,Q_i) pair in test data. Find all the users in training data who have already answered/ignored this question Q_i. let's call then U_a and U_ig
From here we can make it a binomial classifier where users who ignored become class 0 and users who answered become class 1.

  • We can then calculate P(Answered Q_i | U), P(Ignored Q_i | U) using naive bayes as we can calculate P(U | Ignored Q_i) and P(U | Answered Q_i) from training data and features of users U_ig, U_a
  • Or make a KNN and find who is more closer to U_i, U_a OR U_ig

Similarly we can find all the questions in training data already answered/ignored by user U_i and build a classifier using similar method

Clusters can add as features in this model.

Actually we can try a much simple way. We expand training pairs with features of question and users.

invited_info_train.txt
Q_1    U_1    0
Q_2    U_2    1
..
..
Q_n    U_n    0

Final training data -

Q_1F_1    Q_1F_2    Q_1F_3    U_1F_1    U_1F_2    U_1F_3    U_1F_4    0
Q_2F_1    Q_2F_2    Q_2F_3    U_2F_1    U_2F_2    U_2F_3    U_2F_4    1
..
..
Q_nF_1    Q_nF_2    Q_nF_3    U_nF_1    U_nF_2    U_nF_3    U_nF_4    0

We can now train any classifier like BDT, SVM and get a model.

Test data can be formed as

Q_iF_1    Q_iF_2    Q_iF_3    U_iF_1    U_iF_2    U_iF_3    U_iF_4 

Once we create clusters we can remove bag of word features and replace them with cluster ids.
We will create different clusters on basis of Word ID sequence and Character ID sequence in user_info.txt and question_info.txt