decouple reading input and process_sentence

Question

decouple reading input and process_sentence

amir-zeldes opened this issue 6 years ago · comments

Refactor to read all sentences first, then send them incrementally to process_sentence in xrenner_xrenner.py. This will allow document level logic, such as genre classification on all tokens or doing a document length pass for entity recognition using CRF.

Amir Zeldes · Answer 1 · Thu Jun 21 2018 05:20:56 GMT+0800 (China Standard Time)

@loganpeng1992 : this is an important architectural change which will allow us to incorporate document-level features in prediction. We'll do this in a couple of steps to make things less complicated. For now step 1:

Look at the code here: https://github.com/amir-zeldes/xrenner/blob/master/xrenner/modules/xrenner_xrenner.py#L135-L179
Try debugging a run through a sample document to see what it does
The goal is to first read in all sentences, and only call process_sentence on each sentence after they have all been read:
- Note that variables like current_sentence and tokoffset are needed by process_sentence and change along the way, so you will have to move their calculation to the 'lower' loop calling process_sentence
- The purpose of the new 'higher' loop is to have a chance to see all tokens in the document before any coref/entity processing has actually been done, which happens in process_sentence.

No need to add any new functionality in step 1: just try to decouple reading the input for all sentences, so that we can call process_sentence on each of them later.

Amir Zeldes · Answer 2 · Thu Jun 21 2018 05:21:20 GMT+0800 (China Standard Time)

Another important thing: use the develop branch for this, not Chinese-dev

Amir Zeldes · Answer 3 · Wed Aug 19 2020 23:59:56 GMT+0800 (China Standard Time)

Implemented in V2.1