amir-zeldes / xrenner

eXternally configurable REference and Non Named Entity Recognizer

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Hasa/isa ontology

asharkinasuit opened this issue · comments

It seems the hasa.tab and isa.tab are the rudiments of an ontology with knowledge about entities and categories of entities. For instance, using these lists you can derive that an actor is a man, and a man has a body, but also that a person has a body and that a man is a person. On the other hand, the ontology seems somewhat incomplete because you can't derive that a member of parliament is a person, to name just one example.
Two questions arise.

  • Wouldn't it be better to not include facts that can be derived transitively (actor is man is person, and so from person we know he has a body, making [man hasa body] obsolete)?
  • Wouldn't this run the risk of exploding if it is to be anything like a complete ontology? I suppose we might try to limit what's included based on some heuristic about what information could conceivably be needed for the specific purpose of coreference resolution, but then what would the heuristic be? (Maybe frequency in a given corpus could be one option, so one would only include things most relevant with respect to how often they occur, but then what corpus is to be used?)

There are a few different answers to these questions, theoretical and practical.

The practical answer is that the choice of isa and hasa information, rather than a complete ontology, is relatively simple to obtain and relatively useful for the system. There is little or no manual curation of these resources, so redundancies of the type 'actor HASA body' and 'person HASA body' are not really a concern, since they're not harmful. There is also a redundancy with entity_deps, BTW, but not all of the hasa information comes from possessive dependencies (a lot of it is from high accuracy patterns in large Web corpus data).

The more theoretical answer is that even if an actor HASA body, this doesn't mean we are likely to mention this - facts in the world and what languages do in practice are correlated, but not the same. Maybe more importantly, it's possible that x HASA y, but we are not likely to use a pronoun to say its y when the antecedent of the possessor is x. The benefit of this corpus-derived data is that, if it was collected using a good strategy, it may correlate with actual uses of pronominalizing possessors. So in a way, a rationalistic ontology could be missing some usage information that the purely harvested data will represent.

But really, the practicality is what motivated the design choice so far, as well as the availability of PPDB in the case of isa. Then again, a lot of these ideas come from students and then we implement something more idiosyncratic (like the dynamic hasa facility). I like asking when we do error analysis: "as a human being, how would you know to get this right?", and then thinking how we can make that information available to the system. So if you have ideas about expanding these, please let us know!

I was wondering about how several of the .tab files seem to include frequency information, because that would make them dependent on a specific corpus. In my case I haven't done much with that, except when running the make_entity_deps script. To me it feels like something is off if you propose a model for a language but then base yourself on a specific corpus, although I can see that there isn't all that much you can do about it since you have to start somewhere, and if it's a decent-sized corpus, that's probably enough. By the way, if there is also something like a make_hasa script that you used, I would of course be interested in that, though it seems Dutch has less resources available than English :)
Interesting to know the theory behind it, which of course does make sense as there are a lot more relations you could define in an ontology than are interesting to talk about or mention. I actually had a look at the corpus I've been using to glean info for my model but it seemed trying to extract info from possessive dependencies would probably be rather too impure: many things that are possessed are specific to a given context and don't seem like the kind of stuff you'd care to note explicitly as compatible with a HASA relationship. (Then again, I guess that's why you would include the frequency, but then still it would seem to add an unnecessarily long tail of low-frequency things to the list.)

The frequency information is helpful for disambiguation. It's true that these absolute counts correspond to some specific corpora, but that's just laziness on my part: you can really imagine that they are normalized to some corpus size and represent probabilities. I don't think that makes them irrelevant BTW: I think the way humans do coreference resolution also factors in a lot of prior probabilities and plausibility checks.

More concretely, if you have some data and you're training a classifier using utils/make_classifier.py, it can learn how much importance to attach to features such as hasa in different situations. So far I've mainly been using scale-invariant learning approaches (e.g. Gradient Boosting), but you can also scale and normalize your data and use regression or neural networks, or whatever you like. In that case I would recommend passing features like hasa, lexdep and lexsimdep through a StandardScaler. Depending on the learning approach you take, you don't necessarily have to worry about having a long tail or irrelevant features, since the classifier will learn about those. And besides, if something is attested in that tail, it should still be preferred to something that isn't attested!

As for building hasa, I basically generated that from English using CQP on some large Web corpora, so there isn't a script at the moment. I literally just downloaded the matches from queries such as "X's Y" or "Y of X". I imagine you could get similar things for Dutch by searching a tagged web corpus for something like "N van N" and variants.