Allow multiple tokens per feature data row
de-code opened this issue · comments
This is carried over from #90 (comment)
Since the segmentation data is using the first two tokens of a line, it would make sense to have an option to be able to use that in DeLFT. Currently it would only use the first one.
Potential solution:
- an option to specify the columns with the tokens (similar to the features)
- concatenate the word embeddings and other token related vectors
Probably need to change a few places that expect a single token as an input.
/cc @kermitt2 @lfoppiano
I have now implemented something in: elifesciences/sciencebeam-trainer-delft#185
I also included low-level results. I am not sure whether they conclusive as I only have a single run with the updated dataset (that has line numbers removed). There seem to be about 1 percentage point different.
We can think about it once the features channel is merged.
Related to that, for the segmentation model I have now implemented an optional feature where the I add the whole line as a separate feature (at the end), which is then tokenized within DeLFT, or not if it's only using character features. I could see a slight improvement with max chars 30 for example.
Related PRs: