kermitt2 / delft

a Deep Learning Framework for Text

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Allow multiple tokens per feature data row

de-code opened this issue · comments

This is carried over from #90 (comment)

Since the segmentation data is using the first two tokens of a line, it would make sense to have an option to be able to use that in DeLFT. Currently it would only use the first one.

Potential solution:

  • an option to specify the columns with the tokens (similar to the features)
  • concatenate the word embeddings and other token related vectors

Probably need to change a few places that expect a single token as an input.

/cc @kermitt2 @lfoppiano

I have now implemented something in: elifesciences/sciencebeam-trainer-delft#185

I also included low-level results. I am not sure whether they conclusive as I only have a single run with the updated dataset (that has line numbers removed). There seem to be about 1 percentage point different.

We can think about it once the features channel is merged.

Related to that, for the segmentation model I have now implemented an optional feature where the I add the whole line as a separate feature (at the end), which is then tokenized within DeLFT, or not if it's only using character features. I could see a slight improvement with max chars 30 for example.

Related PRs: