yoonkim / CNN_sentence

CNNs for sentence classification

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

sentence length and padding

attardi opened this issue · comments

Why do you pad all sentences to the same length, currently fixed to 56?
It should not be necessary, since in the paper you say that the "pooling scheme naturally deals with variable sentence lengths".
Shouldn't padding depend on filter size?
Right now it is fixed at 5 in the call to
make_idx_data_cv(revs, word_idx_map, i, max_l=56, k=300, filter_h=5)
BTW: k is not used.

it's because we do SGD with mini-batches, and each mini-batch has sentences of varying lengths. one could sort/group the batches based on sentence length and then there would be no need to pad (as is often done in NMT).

A carry-on question: if the sentence length allowed n is greater than the real length of a sentence, what would the vector be for the remaining vectors? Are they set to zero? Or given random values to the vector elements?

Traceback (most recent call last):
File "conv_net_sentence.py", line 311, in
datasets = make_idx_data_cv(revs, word_idx_map, i, max_l=56,k=300, filter_h=5)
File "conv_net_sentence.py", line 283, in make_idx_data_cv
train = np.array(train,dtype="int")
ValueError: setting an array element with a sequence.
follow your code,I meet a quesion,Is that the same reason you're talking about?

You should change this line train = np.array(train,dtype="int") as following:
train = np.array(train,dtype="object")