amir-zeldes / xrenner

eXternally configurable REference and Non Named Entity Recognizer

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool


asharkinasuit opened this issue · comments

As I dig into the concept of pronoun, it becomes murky what exactly should be included in the list. Looking to the English model as an example, it seems it includes at least personal, reflexive and possessive pronouns, and some demonstrative pronouns ("these" is a notable absence). It leaves out several other categories, like interrogative and indefinite pronouns (see also Wikipedia). With the assumptions of the tool in mind, what would be the best choice for which classes of pronoun to include in the list?

Yes, you're right, and thanks for spotting the missing 'these' - I'll add that.

The question what should be in the list is not one you can answer without having a specific goal in mind. What language are you working on? For the English model, the exclusion of interrogatives etc. is based on the most commonly used guidelines for English, from the OntoNotes corpus, though some aspects of the model have been customized to comply with the GUM corpus instead (this is the purpose of the override file). Part of the purpose of xrenner is to allow such configurations to be done just by changing a line in a configuration file, rather than supplying a new training corpus with those items included or excluded.

In other words, you can absolutely add interrogatives, or anything else you like! It all depends on your target annotation scheme.

Thank you for clearing that up. I've been working on a model for Dutch, so I'll have a look into what requirements that would introduce.

I don't know much about existing Dutch annotation guidelines, but if there aren't any, looking at the TüBa-D/Z guidelines for German might be a good starting point, or the coref annotated Potsdam Commentary Corpus (PCC).