sentence length and padding

Question

sentence length and padding

attardi opened this issue 9 years ago · comments

Why do you pad all sentences to the same length, currently fixed to 56?
It should not be necessary, since in the paper you say that the "pooling scheme naturally deals with variable sentence lengths".
Shouldn't padding depend on filter size?
Right now it is fixed at 5 in the call to
make_idx_data_cv(revs, word_idx_map, i, max_l=56, k=300, filter_h=5)
BTW: k is not used.

Yoon Kim · Answer 1 · Tue Jan 12 2016 15:25:05 GMT+0800 (China Standard Time)

it's because we do SGD with mini-batches, and each mini-batch has sentences of varying lengths. one could sort/group the batches based on sentence length and then there would be no need to pad (as is often done in NMT).

Kwan Yuet Stephen Ho · Answer 2 · Wed Sep 14 2016 04:16:11 GMT+0800 (China Standard Time)

A carry-on question: if the sentence length allowed n is greater than the real length of a sentence, what would the vector be for the remaining vectors? Are they set to zero? Or given random values to the vector elements?

yingying.huang · Answer 3 · Mon Dec 09 2019 10:01:14 GMT+0800 (China Standard Time)

Traceback (most recent call last):
File "conv_net_sentence.py", line 311, in
datasets = make_idx_data_cv(revs, word_idx_map, i, max_l=56,k=300, filter_h=5)
File "conv_net_sentence.py", line 283, in make_idx_data_cv
train = np.array(train,dtype="int")
ValueError: setting an array element with a sequence.
follow your code,I meet a quesion,Is that the same reason you're talking about?

moses9591 · Answer 4 · Tue Sep 22 2020 15:50:43 GMT+0800 (China Standard Time)

You should change this line train = np.array(train,dtype="int") as following:
train = np.array(train,dtype="object")