processing

Question

processing

un-lock-me opened this issue 6 years ago · comments

mg commented 6 years ago

why did you consider each line of a text file as a document?

Thanks.

Ali Mortazavi · Answer 1 · Wed Nov 14 2018 18:23:14 GMT+0800 (China Standard Time)

Hi! Actually it is because in my dataset every document was separated by '\n'.

mg · Answer 2 · Thu Nov 15 2018 00:30:59 GMT+0800 (China Standard Time)

Thanks for replying back. that would be nice if you have shared a sample of your data set.

mg · Answer 3 · Fri Nov 16 2018 00:32:01 GMT+0800 (China Standard Time)

I am having difficulty making sense of this part;
p_class_condition_on_not_w[i] = (count_of_that_class[i]-tmp[i])/(number_of_docs-word_occurance_frequency)
do you mind explaining why did you calculate (1-p)log(1-p) in this way?
why did you mix with number of documents?

The confusion for me is that I have 20 classes, and each class 1000 documents, but for my understanding I do not need to consider the number of documents, because the only thing which matters here is the frequency of words in each class versus frequency of words in all classes.

Thanks.

Ali Mortazavi · Answer 4 · Fri Nov 16 2018 23:34:28 GMT+0800 (China Standard Time)

Your welcome! Actually, my dataset is a little weird :D But it has Class label for each document. In fact, the structure of my dataset is something like this: Class@@@@its Context '\n' Class @@@@its Context'\n'

…

On Wed, Nov 14, 2018 at 8:01 PM saria Goudarzvand ***@***.***> wrote: Thanks for replying back. that would be nice if you have shared a sample of your data set. so you mean cls, sep, text = line.partition('@@@@@@@@@@'), you have a file that your documents have been separated by @@@? what about you classes, consider you have five classes, and in each class you have 1000 documents. how did you difrentiate classes versus documents in your source data? — You are receiving this because you commented. Reply to this email directly, view it on GitHub <#1 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AOHO6myHcy8mlRZSENWpq1hy9pdwL1G0ks5uvEVDgaJpZM4Ycxl1> .

-- Ali Mortazavi BSc graduated in Computer Engineering | Amirkabir University of Technology (Tehran Polytechnic) http://ceit.aut.ac.ir/~mortazavi, ali_mortazavi@aut.ac.ir

Ali Mortazavi · Answer 5 · Fri Nov 16 2018 23:48:26 GMT+0800 (China Standard Time)

(count_of_that_class[i]-tmp[i]) is actually the number of documents in the class j that do not have the word w[j] (number_of_docs-word_occurance_frequency) means number of documents in which the w[j] does not exist and dividing the above numbers can be interpreted as a probability of occurrence of class[i] in the set of documents that do not have w[j] Please note that the j does not appear in the code because it was not necessary. Because Numpy arrays can handle multiple operations through vectorization. For instance, tmp is a 2-d array and tmp[i] is a 1-d array.

…

On Thu, Nov 15, 2018 at 8:02 PM saria Goudarzvand ***@***.***> wrote: I am having difficulty making sense of this part; p_class_condition_on_not_w[i] = (count_of_that_class[i]-tmp[i])/(number_of_docs-word_occurance_frequency) do you mind explaining why did you calculate (1-p)log(1-p) in this way? why did you mix with number of documents? Thanks. — You are receiving this because you commented. Reply to this email directly, view it on GitHub <#1 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AOHO6i9dxYYPOTflqjmp48Ue8hC4VIHqks5uvZcBgaJpZM4Ycxl1> .

-- Ali Mortazavi BSc graduated in Computer Engineering | Amirkabir University of Technology (Tehran Polytechnic) http://ceit.aut.ac.ir/~mortazavi, ali_mortazavi@aut.ac.ir