Type and expected format of some featurization techniques unclear

Question

Type and expected format of some featurization techniques unclear

pcallier opened this issue 8 years ago · comments

Patrick Callier · Answer 1 · Sat Oct 01 2016 02:50:44 GMT+0800 (China Standard Time)

1st three args to bow() are str, str, list, and all three should have normalized text w/o stop words. Second arg currently unnormalized.

Patrick Callier · Answer 2 · Sat Oct 01 2016 02:54:05 GMT+0800 (China Standard Time)

1st two args to st() are str, list and the skipthoughts code itself does some normalization very similar to ours (tokenization + adding spaces). Looks like we send it the right input

tukeyclothespin · Answer 3 · Sat Oct 01 2016 03:38:38 GMT+0800 (China Standard Time)

1st two args to lda() are str, str and both should be normalized w/o stop words. It looks like we send it the right input.

However, in preprocess.py we do LDA training build_lda() on the CountVectorizer results for an array of raw body_text entries. I think that we would want to train LDA on the CountVectorizer results for an array of normalized w/o stop words entries.

Anna Bethke · Answer 4 · Sat Oct 01 2016 04:04:09 GMT+0800 (China Standard Time)

1st two args in cnn() are list, list and both are normalized w/o stop words, but as they use the one_hot encoding they likely should include stop words (their vocab has stop words and punctuation). The training is also mis-matched as it is being trained on raw data - so fixes should go into two places. [this may be why it is doing so poorly].

Anna Bethke · Answer 5 · Sat Oct 01 2016 04:05:38 GMT+0800 (China Standard Time)

1st two args in wordonehot are list, list and are raw. There is xml normalization before being sent to one_hot so it seems good.

tukeyclothespin · Answer 6 · Sat Oct 01 2016 07:08:41 GMT+0800 (China Standard Time)

1st two args in w2v() are str, str and we are currently sending a raw document and normalized (w/o stop words and w/o punctuation) background document text.

Our w2v() approach analyzes the first and last sentence of the document and we have follow-on variants in run_w2v(), run_w2v_elemwise(), run_w2v_matrix. All three of these variants use punkt to find the first and last sentence of the text passed in (doc one time, background text the next time) and then call normalize.remove_stop_words() on the resulting sentences.

Two potential issues:

Both original inputs to w2v() should be changed to normalized with stops words and with punctuation via xml_normalize()
At least run_w2v() and run_elemwise() shouldn't be removing stop words. Possibly also true for run_w2v_matrix