knowitall / taggers

Easily identify and label sentence intervals using various taggers.

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Create TypePatternTagger to ease tagging types

schmmd opened this issue · comments

Hi John, how about we do move your contribution into Taggers. Often I just need to think of a good way it fits in--any help is appreciated ;-)

Maybe we can create a new tagger called TypePatternTagger. Maybe you can think of a better name. This tagger would perform a substitution for the type matching syntax. Do you have any suggestions? I thought of <<TypeName>> but I only somewhat like it. I think it would need to create the sequence <typeStart='TypeName'> <typeCont='TypeName'>*.

With this new tagger, we could have patterns such as:

<<VerbPhrase>> <<NounPhrase>> <pos='JJ'>

What do you think? Any chance you could look at this on Monday? I think it would be pretty straightforward and it would get you used to my changes.

Nope, didn't get an e-mail when it was opened.
It seems to me that the replacement would need to be

( < typeStart='x' & typeEnd='x'> | ( <typeStart='x'> <typeCountinue='x'>* <typeEnd='x'>) )

I'll look at this in the afternoon, I'm trying to get Dan some Entity Linking results on different data.

John

I think they are the same because it's greedy. Note typeCont means
not typeStart (but it could be typeEnd too).

lazy val typesContinuingAtToken = types -- typesBeginningAtToken -- typesEndingAtToken

but I guess we could change that.

OH, woops, I must have an older version.

I agree the replacement pattern you suggested should work.

Yeah, I changed it and it's confusing. Do you think the current definition
is OK? It seems better than the old one to me (typeCont just means that
were on a token where the type is continuing).

On Mon, Sep 30, 2013 at 9:09 AM, John Gilmer notifications@github.comwrote:

OH, woops, I must have an older version.


Reply to this email directly or view it on GitHubhttps://github.com//issues/12#issuecomment-25377020
.

The definition seems fine.

I think <> is ok, but I've come to the think of "<>" as meaning token, what other characters are at our disposal?

{VerbPhrase}
'VerbPhrase'
^VerbPhrase^

Let's do {VerbPhrase} but you will want to be careful because it's also a
regular expression syntax. I think you will need to:

  1. Split by whitespace.
  2. See if a token matches '{.*}'.r and perform the substitution if there
    is a match.
  3. Join back together on space.

Example pattern (to make sure we still like it!):

{VerbPhrase} {NounPhrase} <postag='JJ'>

Fyi backticks put your text in code mode. Argh, but they don't work when sent as an email!

Argh... this was a horrible suggestion. You can't split by whitespace!