knowitall / taggers

Easily identify and label sentence intervals using various taggers.

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Handle recursive types

schmmd opened this issue · comments

It would be great if taggers could be recursive. I.e. we first tag "Number" and then when we make a date tagger we can refer to tokens that span "Number".

This would be an addition to the PatternTagger. The OpenRegex library must operate over a sequence of tokens. Presently, this is a Seq[Lemmatized[ChunkedToken]]. To use type information, we would need this to operate over a Seq[TypedToken].

case class TypedToken(token: Lemmatized[ChunkedToken], types: Set[Type])

The types collection would need to contain all types that overlap that token position. It might be additionally helpful to know which types start and end on this particular token, so an index might need to be stored as well (Type has a token interval so we can compute this if we have the token index).

We would then need an override for findTags that takes a Seq[Lemmatized[ChunkedToken]] and a collection of the types found in the sentence so far. From this we would need to build a Seq[TypedToken] and rework the OpenRegex wrapper to use this additional information. This will be a small slowdown because we need to create new object for each token, but it shouldn't be major. We might want to think about the cost of running multiple pattern taggers on a single sentence however--or at least keep it in mind.

In the regular expression language we will add additional aspects of the token to work on. For example, presently we have string and postag but we will have type=Person which is true if any type with the descriptor Person overlaps the token. We might also want to be able to specify typeStart=Person and typeEnd=Person.

John can you close this out?