Properly blender entity guesser features

Question

Properly blender entity guesser features

amir-zeldes opened this issue 6 years ago · comments

in xrenner_marker, guessing entities from affix morphology, syntax, prior likelihood, and potentially CRF should be done using a classifier. Should ideally be done after #90 .

Amir Zeldes · Answer 1 · Thu Jun 21 2018 05:27:59 GMT+0800 (China Standard Time)

This part will become more useful once Logan does #88 , but we can already get started on this. We'll break it into a couple of steps to make things simple:

Check out the develop branch (not Chinese-dev)
Look at the code here and see what it does on a document in the debugger: https://github.com/amir-zeldes/xrenner/blob/master/xrenner/modules/xrenner_marker.py#L268-L291
This code is responsible for disambiguating ambiguous entity types (e.g. 'star' can be a person, or a celestial object). Currently it just adds up the probabilities assigned by different kinds of evidence:
* Morphological, i.e. substrings (words in -er are often 'person')
* Dependency based (subjects of 'said' are often person, patient of eat is 'object', etc.)
* Similarity based (using most similar words from embeddings)
* Bias for default entity (usually 'abstract', but configurable)
Ideally we'd like a classifier to get all of these scores and learn to decide, rather than just believe whoever shouts the strongest. In a further step we'll add a document-wide pass using a CRF classifier (#90 )

So for now, the first task is to add code to dump out what xrenner is seeing to a text file with feature values and run that on data for which we have a gold answer, such as GUM. The next task will be to train a classifier on this data, and finally integrate it into the framework so that models which include a classifier can consult it at this point.

Amir Zeldes · Answer 2 · Thu Jun 21 2018 05:29:48 GMT+0800 (China Standard Time)

This can be out checklist:

Code to dump feature values in resolve_mark_entity()
Script to look up gold answer and add to dump file
Train classifier
Integrate classifier into resolve_mark_entity()