smadha / MlTrio

CSCI-567 course project

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Data analysis

smadha opened this issue · comments

245752 LABELED SAMPLES
8095 questions
28763 users

27,324 questions answered
218,428 questions not answered

6182 users answered at least one question
23 users answered more than 50 questions
690 users answered more than 10 questions

5877 questions answered at least once
28 questions answered more than 30 times
705 questions answered more than 10 questions

30467 TEST SAMPLE

Probability of user answering question again if they didn't answer it the first time: 0.029131121643
Probability of user not answering the question again if they didn't answer it the first time: 0.970868878357

There isn't a case where the user answers the same question again

Sample Data stats:

Number of users in list irrespective of the question was answered or not::: 27127/ 28763
Number of questions in list irrespective of the question was answered or not::: 7708 / 8095

most common question asked irrespective of it was answered or not:: [('8cc470e1c655b5bbf6e8684509b44205', 1016 times it was asked in the given sample)]
most common user:: [('d66397df46f4e33cb608c322f751d884', 110 entries for the user are given for this user)]
least common user:: ('09d89cf0a43005b22b015b24fe8b29ad', 1 entry is given for this user)
least common question asked:: ('09698971cfdcca1b0eb9fd444edc596f', 1 entry is given for this question)

The training sample seems to be skewed: Adding features after taking into account these labels(1/0) can increase the skewness in our features.