Properly blender entity guesser features
amir-zeldes opened this issue · comments
in xrenner_marker, guessing entities from affix morphology, syntax, prior likelihood, and potentially CRF should be done using a classifier. Should ideally be done after #90 .
This part will become more useful once Logan does #88 , but we can already get started on this. We'll break it into a couple of steps to make things simple:
- Check out the develop branch (not Chinese-dev)
- Look at the code here and see what it does on a document in the debugger: https://github.com/amir-zeldes/xrenner/blob/master/xrenner/modules/xrenner_marker.py#L268-L291
- This code is responsible for disambiguating ambiguous entity types (e.g. 'star' can be a person, or a celestial object). Currently it just adds up the probabilities assigned by different kinds of evidence:
* Morphological, i.e. substrings (words in -er are often 'person')
* Dependency based (subjects of 'said' are often person, patient of eat is 'object', etc.)
* Similarity based (using most similar words from embeddings)
* Bias for default entity (usually 'abstract', but configurable) - Ideally we'd like a classifier to get all of these scores and learn to decide, rather than just believe whoever shouts the strongest. In a further step we'll add a document-wide pass using a CRF classifier (#90 )
So for now, the first task is to add code to dump out what xrenner is seeing to a text file with feature values and run that on data for which we have a gold answer, such as GUM. The next task will be to train a classifier on this data, and finally integrate it into the framework so that models which include a classifier can consult it at this point.
This can be out checklist:
- Code to dump feature values in resolve_mark_entity()
- Script to look up gold answer and add to dump file
- Train classifier
- Integrate classifier into resolve_mark_entity()