yoonkim / CNN_sentence

CNNs for sentence classification

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Confused with vocab in process_data.py, need heeeeeeelp

Larry955 opened this issue · comments

I'm a new bee in Sentiment Analysis and recently I'm trying to use CNN to apply to Sentiment Analysis. Yoon's paper helps me a lot and I really appreicate that.

I want to understand every piece of code in this repo, but I get some trouble when I read process_data.py. Variable vocab is a type of dictionary and it should store the frequency of each word occurred in MR datas, which is {word, word_frequency}, but in the function build_data_cv, Yoon used set to store words in each line, which means the duplicate words will be removed, in this case how can we calculate the occurred times of each word ?

    vocab = defaultdict(float)   # dict to store words with its frequences
    with open(pos_file, "rb") as f:
        for line in f:       
            rev = []
            rev.append(line.strip())
            if clean_string:
                orig_rev = clean_str(" ".join(rev))
            else:
                orig_rev = " ".join(rev).lower()
            words = set(orig_rev.split()) # use set to store words, which means duplicate words will be removed in current line

IS THERE ANYBODY CAN HELP ME? THANKS A LOT!!!

It seems we get frequency per review.
It is more likely word W is an indicator for bad reviews if it appeared in many bad reviews rather than appeared many times in a single review.

This is later used when adding 'unknown words'.
If you scroll down the code you'll find

def add_unknown_words(word_vecs, vocab, min_df=1, k=300):
    """
    For words that occur in at least min_df documents, create a separate word vector.    
    0.25 is chosen so the unknown vectors have (approximately) same variance as pre-trained ones
    """
    for word in vocab:
        if word not in word_vecs and vocab[word] >= min_df:
            word_vecs[word] = np.random.uniform(-0.25,0.25,k)  

Here we don't consider words that appear in a single review.
I think it would have been clearer for a higher threshold.
For example: filter out words that appear in less than 10 reviews.

@talevy23
Thanks a lot!! your opinion really inspair me and solve my confusion. It's a good explanation for filtering out words that appears in less than 10(or any other number) reviews. From that we can conclude that the code only cares how many times a word appears in the reviews but doesn't care about its frequency in a single review, right?