AliMorty / Text-Classification

In this project, we used 3 different metrics (Information Gain, Mutual Information, Chi Squared) to find important words and then we used them for the classification task. We compared the result at the end.

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

processing

un-lock-me opened this issue · comments

commented

why did you consider each line of a text file as a document?

Thanks.

Hi! Actually it is because in my dataset every document was separated by '\n'.

commented

Thanks for replying back. that would be nice if you have shared a sample of your data set.

commented

I am having difficulty making sense of this part;
p_class_condition_on_not_w[i] = (count_of_that_class[i]-tmp[i])/(number_of_docs-word_occurance_frequency)
do you mind explaining why did you calculate (1-p)log(1-p) in this way?
why did you mix with number of documents?

The confusion for me is that I have 20 classes, and each class 1000 documents, but for my understanding I do not need to consider the number of documents, because the only thing which matters here is the frequency of words in each class versus frequency of words in all classes.

Thanks.