amir-zeldes / xrenner

eXternally configurable REference and Non Named Entity Recognizer

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Properly blender entity guesser features

amir-zeldes opened this issue · comments

in xrenner_marker, guessing entities from affix morphology, syntax, prior likelihood, and potentially CRF should be done using a classifier. Should ideally be done after #90 .

This part will become more useful once Logan does #88 , but we can already get started on this. We'll break it into a couple of steps to make things simple:

  1. Check out the develop branch (not Chinese-dev)
  2. Look at the code here and see what it does on a document in the debugger: https://github.com/amir-zeldes/xrenner/blob/master/xrenner/modules/xrenner_marker.py#L268-L291
  3. This code is responsible for disambiguating ambiguous entity types (e.g. 'star' can be a person, or a celestial object). Currently it just adds up the probabilities assigned by different kinds of evidence:
    * Morphological, i.e. substrings (words in -er are often 'person')
    * Dependency based (subjects of 'said' are often person, patient of eat is 'object', etc.)
    * Similarity based (using most similar words from embeddings)
    * Bias for default entity (usually 'abstract', but configurable)
  4. Ideally we'd like a classifier to get all of these scores and learn to decide, rather than just believe whoever shouts the strongest. In a further step we'll add a document-wide pass using a CRF classifier (#90 )

So for now, the first task is to add code to dump out what xrenner is seeing to a text file with feature values and run that on data for which we have a gold answer, such as GUM. The next task will be to train a classifier on this data, and finally integrate it into the framework so that models which include a classifier can consult it at this point.

This can be out checklist:

  • Code to dump feature values in resolve_mark_entity()
  • Script to look up gold answer and add to dump file
  • Train classifier
  • Integrate classifier into resolve_mark_entity()