Data analysis

Question

Data analysis

smadha opened this issue 8 years ago · comments

245752 LABELED SAMPLES
8095 questions
28763 users

27,324 questions answered
218,428 questions not answered

6182 users answered at least one question
23 users answered more than 50 questions
690 users answered more than 10 questions

5877 questions answered at least once
28 questions answered more than 30 times
705 questions answered more than 10 questions

30467 TEST SAMPLE

Kushaan Kumar · Answer 1 · Mon Oct 24 2016 06:34:12 GMT+0800 (China Standard Time)

Probability of user answering question again if they didn't answer it the first time: 0.029131121643
Probability of user not answering the question again if they didn't answer it the first time: 0.970868878357

Kushaan Kumar · Answer 2 · Mon Oct 24 2016 06:34:28 GMT+0800 (China Standard Time)

There isn't a case where the user answers the same question again

Arpita Agrawal · Answer 3 · Wed Oct 26 2016 07:03:18 GMT+0800 (China Standard Time)

Sample Data stats:

Number of users in list irrespective of the question was answered or not::: 27127/ 28763
Number of questions in list irrespective of the question was answered or not::: 7708 / 8095

most common question asked irrespective of it was answered or not:: [('8cc470e1c655b5bbf6e8684509b44205', 1016 times it was asked in the given sample)]
most common user:: [('d66397df46f4e33cb608c322f751d884', 110 entries for the user are given for this user)]
least common user:: ('09d89cf0a43005b22b015b24fe8b29ad', 1 entry is given for this user)
least common question asked:: ('09698971cfdcca1b0eb9fd444edc596f', 1 entry is given for this question)

Arpita Agrawal · Answer 4 · Wed Oct 26 2016 07:06:04 GMT+0800 (China Standard Time)

The training sample seems to be skewed: Adding features after taking into account these labels(1/0) can increase the skewness in our features.