Allow multiple tokens per feature data row

Question

Allow multiple tokens per feature data row

de-code opened this issue 4 years ago · comments

This is carried over from #90 (comment)

Since the segmentation data is using the first two tokens of a line, it would make sense to have an option to be able to use that in DeLFT. Currently it would only use the first one.

Potential solution:

an option to specify the columns with the tokens (similar to the features)
concatenate the word embeddings and other token related vectors

Probably need to change a few places that expect a single token as an input.

/cc @kermitt2 @lfoppiano

Daniel Ecer · Answer 1 · Tue Apr 07 2020 20:50:03 GMT+0800 (China Standard Time)

I have now implemented something in: elifesciences/sciencebeam-trainer-delft#185

I also included low-level results. I am not sure whether they conclusive as I only have a single run with the updated dataset (that has line numbers removed). There seem to be about 1 percentage point different.

Luca Foppiano · Answer 2 · Sat Apr 11 2020 05:44:20 GMT+0800 (China Standard Time)

We can think about it once the features channel is merged.

Daniel Ecer · Answer 3 · Tue Aug 04 2020 21:55:50 GMT+0800 (China Standard Time)

Related to that, for the segmentation model I have now implemented an optional feature where the I add the whole line as a separate feature (at the end), which is then tokenized within DeLFT, or not if it's only using character features. I could see a slight improvement with max chars 30 for example.

Related PRs: