AliMorty / Text-Classification

In this project, we used 3 different metrics (Information Gain, Mutual Information, Chi Squared) to find important words and then we used them for the classification task. We compared the result at the end.

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool


un-lock-me opened this issue · comments


why did you consider each line of a text file as a document?


Hi! Actually it is because in my dataset every document was separated by '\n'.


Thanks for replying back. that would be nice if you have shared a sample of your data set.


I am having difficulty making sense of this part;
p_class_condition_on_not_w[i] = (count_of_that_class[i]-tmp[i])/(number_of_docs-word_occurance_frequency)
do you mind explaining why did you calculate (1-p)log(1-p) in this way?
why did you mix with number of documents?

The confusion for me is that I have 20 classes, and each class 1000 documents, but for my understanding I do not need to consider the number of documents, because the only thing which matters here is the frequency of words in each class versus frequency of words in all classes.
