Normalize Date Expressions in Training Set

Question

Normalize Date Expressions in Training Set

marfox opened this issue 9 years ago · comments

Lots of FEs are dates:

absolute, e.g., May 2008
relative, e.g., the previous season
interval, e.g., from 2008 to 2015

A normalizer based on this CFG grammar should be implemented at training set building time.
It is written as an ANTLR grammar.

N.B.

Please submit your pull request to (or work on) the date-normalizer branch.

Emilio Dorigatti · Answer 1 · Wed May 06 2015 01:53:06 GMT+0800 (China Standard Time)

Hello Marco, I found out that antlr4 is able to produce python output given a grammar file. This would be ideal but the grammar you provided contains java code instead of python code. Are you using some kind of tool to automate the generation of this grammar? How hard do you think it is to convert that code to python?

Edit: I am trying to do it via vim regexes right now, results seem promising ;)

Marco Fossati · Answer 2 · Wed May 06 2015 16:15:45 GMT+0800 (China Standard Time)

On 5/5/15 7:53 PM, Emilio Dorigatti wrote:

Hello Marco, I found out that antlr4 is able to produce python output
given a grammar file.
Perfect!
This would be ideal but the grammar you provided
contains java code instead of python code. Are you using some kind of
tool to automate the generation of this grammar?
Nope, it's manually curated.
How hard do you think
it is to convert that code to python?
I don't know, but shouldn't the grammar syntax/rules be independent from
the implementation?
Maybe you just have to change the import statements and the classes we
are using, please check.

—
Reply to this email directly or view it on GitHub
#43 (comment).

Emilio Dorigatti · Answer 3 · Wed May 06 2015 16:32:32 GMT+0800 (China Standard Time)

There is some java code in the grammar file which is simply copy-pasted
into the generated python code. I think I managed to convert it correctly,
now I am trying to use the generated code

On Wed, May 6, 2015, 10:15 Marco Fossati notifications@github.com wrote:

On 5/5/15 7:53 PM, Emilio Dorigatti wrote:

Hello Marco, I found out that antlr4 is able to produce python output
given a grammar file.
Perfect!
This would be ideal but the grammar you provided
contains java code instead of python code. Are you using some kind of
tool to automate the generation of this grammar?
Nope, it's manually curated.
How hard do you think
it is to convert that code to python?
I don't know, but shouldn't the grammar syntax/rules be independent from
the implementation?
Maybe you just have to change the import statements and the classes we
are using, please check.

—
Reply to this email directly or view it on GitHub
<
#43 (comment)
.

—
Reply to this email directly or view it on GitHub
#43 (comment)
.

Marco Fossati · Answer 4 · Wed May 06 2015 16:47:13 GMT+0800 (China Standard Time)

BTW, please refer to #44 for the work you've been doing up to now.

Emilio Dorigatti · Answer 5 · Wed May 06 2015 18:46:27 GMT+0800 (China Standard Time)

I cannot really understand how to use this. Given the sentence il campionato si svolgerà nel corso delle prossime 3 settimane, the result is

$ java -cp "/home/emilio/GSoC/lib/antlr-4.5-complete.jar:$CLASSPATH" org.antlr.v4.runtime.misc.TestRig DateAndTime week_duration test -tree
line 1:3 token recognition error at: 'ca'
line 1:5 token recognition error at: 'mp'
line 1:8 token recognition error at: 'on'
line 1:11 token recognition error at: 'to'
line 1:14 token recognition error at: 'si'
line 1:17 token recognition error at: 'sv'
line 1:19 token recognition error at: 'ol'
line 1:21 token recognition error at: 'ger'
line 1:24 token recognition error at: 'à'
line 1:0 no viable alternative at input 'il'
(week_duration il i a nel corso delle prossime 3 settimane)

Isn't this parser supposed to take as input the whole sentence and understand which rule to apply? Or will we need to try all the rules?

Marco Fossati · Answer 6 · Wed May 06 2015 20:26:28 GMT+0800 (China Standard Time)

Not sure what the class you call is doing.
I used to generate the parser and lexer classes by just invoking the jar:
java -jar antlr-4.5-complete.jar DateAndTime.g4
Then the generated parser is supposed to apply the rules and output the transformation.

Marco Fossati · Answer 7 · Wed May 06 2015 20:28:38 GMT+0800 (China Standard Time)

But beware of the FIXME comments in the grammar file!
We need first to implement the Date objects and enumerations.

Emilio Dorigatti · Answer 8 · Wed May 06 2015 20:30:15 GMT+0800 (China Standard Time)

Yes, I did implement DateEntity and DateEnum, then generated the parser and the lexer with the command you posted. Now how can I apply the rules to a sample sentence?

Marco Fossati · Answer 9 · Wed May 06 2015 20:47:35 GMT+0800 (China Standard Time)

I implemented a class that consumes the generated ANTLR ones.
This is the minimal code to make them run.
Then you have to process the parser.results, which should be a list of DateEntity objects.

                String entityValue = "foo"
                if (entityValue != null) {
                        // Set up parser
                        DateAndTimeParser parser = new DateAndTimeParser(null);
                        parser.setBuildParseTree(false); // Don't need trees
                        // Set up lexer
                        ANTLRInputStream input = new ANTLRInputStream(entityValue);
                        DateAndTimeLexer lexer = new DateAndTimeLexer(input);
                        lexer.setLine(1); // Notify lexer of input position
                        lexer.setCharPositionInLine(0);
                        CommonTokenStream tokens = new CommonTokenStream(lexer);
                        parser.setInputStream(tokens); // Notify parser of new token stream
                        // Start the parser, make sure it doesn't crash in case of unrecognized expressions
                        try {
                                parser.value();
                        } catch (Exception e) {
                                e.printStackTrace();
                                _logger.warning("Skipping normalization for badly stated entity");
                        }
                        // parser.results should be a list of normalized expressions.

Emilio Dorigatti · Answer 10 · Wed May 06 2015 23:57:18 GMT+0800 (China Standard Time)

Okay, I managed to have a first rough version of the python tokenizer working, just pushed to this repo

Emilio Dorigatti · Answer 11 · Fri May 08 2015 00:07:32 GMT+0800 (China Standard Time)

I tried to apply the date normalizer to individual entities found in the sentences (here), but the results aren't very encouraging. This is the code that I added (see date_normalizer.get_tokens)

            try:
                date_norm = date_normalizer.get_tokens(entity)
                if date_norm and any(x['value'] for x in date_norm):
                    print '--- entity: "%s" norm ' % entity + str(date_norm)
            except:
                pass

and this is the output:

--- entity: "Nazionale Under-21" norm [{'type': <DateEnum.TIMEX_START_TIME: 9>, 'value': None}, {'type': <DateEnum.TIMEX_START_TIME: 9>, 'value': None}, {'type': <DateEnum.TIMEX_START_TIME: 9>, 'value': None}, {'type': <DateEnum.TIMEX_START_TIME: 9>, 'value': '21:_:_'}]
--- entity: "1997" norm [{'type': <DateEnum.TIMEX_YEAR: 12>, 'value': u'1997'}]
--- entity: "nel 1991" norm [{'type': <DateEnum.TIMEX_YEAR: 12>, 'value': u'1991'}]
--- entity: "1991" norm [{'type': <DateEnum.TIMEX_YEAR: 12>, 'value': u'1991'}]
--- entity: "nel 1982" norm [{'type': <DateEnum.TIMEX_YEAR: 12>, 'value': u'1982'}]
--- entity: "il 2004" norm [{'type': <DateEnum.TIMEX_YEAR: 12>, 'value': u'2004'}]
--- entity: "il 1999" norm [{'type': <DateEnum.TIMEX_YEAR: 12>, 'value': u'1999'}]
--- entity: "Under 21" norm [{'type': <DateEnum.TIMEX_START_TIME: 9>, 'value': '21:_:_'}]
--- entity: "argentino" norm [{'type': <DateEnum.TIMEX_START_TIME: 9>, 'value': None}, {'type': <DateEnum.TIMEX_START_TIME: 9>, 'value': None}, {'type': <DateEnum.TIMEX_DATE: 1>, 'value': '01:_'}]
--- entity: "Malta" norm [{'type': <DateEnum.TIMEX_START_TIME: 9>, 'value': '$now$'}]
--- entity: "4 incontri" norm [{'type': <DateEnum.TIMEX_START_TIME: 9>, 'value': '04:_:_'}]
--- entity: "unica stagione" norm [{'type': <DateEnum.TIMEX_START_TIME: 9>, 'value': '01:_:_'}, {'type': <DateEnum.TIMEX_SEASON: 8>, 'value': None}, {'type': <DateEnum.TIMEX_SEASON: 8>, 'value': None}]
--- entity: "1934" norm [{'type': <DateEnum.TIMEX_YEAR: 12>, 'value': u'1934'}]
--- entity: "Real" norm [{'type': <DateEnum.TIMEX_START_TIME: 9>, 'value': '$now$'}]
--- entity: "Under-21" norm [{'type': <DateEnum.TIMEX_START_TIME: 9>, 'value': '21:_:_'}]
--- entity: "giovanili" norm [{'type': <DateEnum.TIMEX_WEEKDAY: 11>, 'value': '$thursday$'}]
--- entity: "Under-16" norm [{'type': <DateEnum.TIMEX_START_TIME: 9>, 'value': '16:_:_'}]
--- entity: "1960" norm [{'type': <DateEnum.TIMEX_YEAR: 12>, 'value': u'1960'}]
--- entity: "stagione 1922-1923" norm [{'type': <DateEnum.TIMEX_SEASON: 8>, 'value': None}, {'type': <DateEnum.TIMEX_SEASON: 8>, 'value': None}, {'type': <DateEnum.TIMEX_YEAR: 12>, 'value': u'1922'}, {'type': <DateEnum.TIMEX_YEAR: 12>, 'value': u'1923'}]
--- entity: "nel 2000" norm [{'type': <DateEnum.TIMEX_YEAR: 12>, 'value': u'2000'}]
--- entity: "uno scampolo" norm [{'type': <DateEnum.TIMEX_START_TIME: 9>, 'value': '01:_:_'}]

Marco Fossati · Answer 12 · Fri May 08 2015 00:11:11 GMT+0800 (China Standard Time)

We should keep the date rules only and discard the time ones in the grammar.

On 5/7/15 6:07 PM, Emilio Dorigatti wrote:

I tried to apply the date normalizer to individual entities found in the
sentences (here
https://github.com/dbpedia/fact-extractor/blob/date-normalizer/crowdflower_results_into_training_data.py#L130),
but the results aren't very encouraging. This is the code that I added
(see date_normalizer.get_tokens
https://github.com/dbpedia/fact-extractor/blob/date-normalizer/date_normalizer/date_normalizer.py#L7)

| try:
date_norm = date_normalizer.get_tokens(entity)
if date_norm and any(x['value'] for x in date_norm):
print '--- entity: "%s" norm ' % entity + str(date_norm)
except:
pass
|

and this is the output:

|--- entity: "Nazionale Under-21" norm [{'type': <DateEnum.TIMEX_START_TIME: 9>, 'value': None}, {'type': <DateEnum.TIMEX_START_TIME: 9>, 'value': None}, {'type': <DateEnum.TIMEX_START_TIME: 9>, 'value': None}, {'type': <DateEnum.TIMEX_START_TIME: 9>, 'value': '21::'}]
--- entity: "1997" norm [{'type': <DateEnum.TIMEX_YEAR: 12>, 'value': u'1997'}]
--- entity: "nel 1991" norm [{'type': <DateEnum.TIMEX_YEAR: 12>, 'value': u'1991'}]
--- entity: "1991" norm [{'type': <DateEnum.TIMEX_YEAR: 12>, 'value': u'1991'}]
--- entity: "nel 1982" norm [{'type': <DateEnum.TIMEX_YEAR: 12>, 'value': u'1982'}]
--- entity: "il 2004" norm [{'type': <DateEnum.TIMEX_YEAR: 12>, 'value': u'2004'}]
--- entity: "il 1999" norm [{'type': <DateEnum.TIMEX_YEAR: 12>, 'value': u'1999'}]
--- entity: "Under 21" norm [{'type': <DateEnum.TIMEX_START_TIME: 9>, 'value': '21::'}]
--- entity: "argentino" norm [{'type': <DateEnum.TIMEX_START_TIME: 9>, 'value': None}, {'type': <DateEnum.TIMEX_START_TIME: 9>, 'value': None}, {'type': <DateEnum.TIMEX_DATE: 1>, 'value': '01:'}]
--- entity: "Malta" norm [{'type': <DateEnum.TIMEX_START_TIME: 9>, 'value': '$now$'}]
--- entity: "4 incontri" norm [{'type': <DateEnum.TIMEX_START_TIME: 9>, 'value': '04::'}]
--- entity: "unica stagione" norm [{'type': <DateEnum.TIMEX_START_TIME: 9>, 'value': '01::'}, {'type': <DateEnum.TIMEX_SEASON: 8>, 'value': None}, {'type': <DateEnum.TIMEX_SEASON: 8>, 'value': None}]
--- entity: "1934" norm [{'type': <DateEnum.TIMEX_YEAR: 12>, 'value': u'1934'}]
--- entity: "Real" norm [{'type': <DateEnum.TIMEX_START_TIME: 9>, 'value': '$now$'}]
--- entity: "Under-21" norm [{'type': <DateEnum.TIMEX_START_TIME: 9>, 'value': '21::'}]
--- entity: "giovanili" norm [{'type': <DateEnum.TIMEX_WEEKDAY: 11>, 'value': '$thursday$'}]
--- entity: "Under-16" norm [{'type': <DateEnum.TIMEX_START_TIME: 9>, 'value': '16::'}]
--- entity: "1960" norm [{'type': <DateEnum.TIMEX_YEAR: 12>, 'value': u'1960'}]
--- entity: "stagione 1922-1923" norm [{'type': <DateEnum.TIMEX_SEASON: 8>, 'value': None}, {'type': <DateEnum.TIMEX_SEASON: 8>, 'value': None}, {'type': <DateEnum.TIMEX_YEAR: 12>, 'value': u'1922'}, {'type': <DateEnum.TIMEX_YEAR: 12>, 'value': u'1923'}]
--- entity: "nel 2000" norm [{'type': <DateEnum.TIMEX_YEAR: 12>, 'value': u'2000'}]
--- entity: "uno scampolo" norm [{'type': <DateEnum.TIMEX_START_TIME: 9>, 'value': '01::_'}]
|

—
Reply to this email directly or view it on GitHub
#43 (comment).

Marco Fossati · Answer 13 · Fri May 08 2015 00:12:41 GMT+0800 (China Standard Time)

And probably adapt some rules to fit into our scenario

On 5/7/15 6:11 PM, Marco Fossati wrote:

We should keep the date rules only and discard the time ones in the
grammar.

On 5/7/15 6:07 PM, Emilio Dorigatti wrote:

I tried to apply the date normalizer to individual entities found in the
sentences (here
https://github.com/dbpedia/fact-extractor/blob/date-normalizer/crowdflower_results_into_training_data.py#L130),

but the results aren't very encouraging. This is the code that I added
(see date_normalizer.get_tokens
https://github.com/dbpedia/fact-extractor/blob/date-normalizer/date_normalizer/date_normalizer.py#L7)

| try:
date_norm = date_normalizer.get_tokens(entity)
if date_norm and any(x['value'] for x in date_norm):
print '--- entity: "%s" norm ' % entity +
str(date_norm)
except:
pass
|

and this is the output:

|--- entity: "Nazionale Under-21" norm [{'type':
<DateEnum.TIMEX_START_TIME: 9>, 'value': None}, {'type':
<DateEnum.TIMEX_START_TIME: 9>, 'value': None}, {'type':
<DateEnum.TIMEX_START_TIME: 9>, 'value': None}, {'type':
<DateEnum.TIMEX_START_TIME: 9>, 'value': '21::'}]
--- entity: "1997" norm [{'type': <DateEnum.TIMEX_YEAR: 12>, 'value':
u'1997'}]
--- entity: "nel 1991" norm [{'type': <DateEnum.TIMEX_YEAR: 12>,
'value': u'1991'}]
--- entity: "1991" norm [{'type': <DateEnum.TIMEX_YEAR: 12>, 'value':
u'1991'}]
--- entity: "nel 1982" norm [{'type': <DateEnum.TIMEX_YEAR: 12>,
'value': u'1982'}]
--- entity: "il 2004" norm [{'type': <DateEnum.TIMEX_YEAR: 12>,
'value': u'2004'}]
--- entity: "il 1999" norm [{'type': <DateEnum.TIMEX_YEAR: 12>,
'value': u'1999'}]
--- entity: "Under 21" norm [{'type': <DateEnum.TIMEX_START_TIME: 9>,
'value': '21::'}]
--- entity: "argentino" norm [{'type': <DateEnum.TIMEX_START_TIME: 9>,
'value': None}, {'type': <DateEnum.TIMEX_START_TIME: 9>, 'value':
None}, {'type': <DateEnum.TIMEX_DATE: 1>, 'value': '01:'}]
--- entity: "Malta" norm [{'type': <DateEnum.TIMEX_START_TIME: 9>,
'value': '$now$'}]
--- entity: "4 incontri" norm [{'type': <DateEnum.TIMEX_START_TIME:
9>, 'value': '04::'}]
--- entity: "unica stagione" norm [{'type':
<DateEnum.TIMEX_START_TIME: 9>, 'value': '01::'}, {'type':
<DateEnum.TIMEX_SEASON: 8>, 'value': None}, {'type':
<DateEnum.TIMEX_SEASON: 8>, 'value': None}]
--- entity: "1934" norm [{'type': <DateEnum.TIMEX_YEAR: 12>, 'value':
u'1934'}]
--- entity: "Real" norm [{'type': <DateEnum.TIMEX_START_TIME: 9>,
'value': '$now$'}]
--- entity: "Under-21" norm [{'type': <DateEnum.TIMEX_START_TIME: 9>,
'value': '21::'}]
--- entity: "giovanili" norm [{'type': <DateEnum.TIMEX_WEEKDAY: 11>,
'value': '$thursday$'}]
--- entity: "Under-16" norm [{'type': <DateEnum.TIMEX_START_TIME: 9>,
'value': '16::'}]
--- entity: "1960" norm [{'type': <DateEnum.TIMEX_YEAR: 12>, 'value':
u'1960'}]
--- entity: "stagione 1922-1923" norm [{'type':
<DateEnum.TIMEX_SEASON: 8>, 'value': None}, {'type':
<DateEnum.TIMEX_SEASON: 8>, 'value': None}, {'type':
<DateEnum.TIMEX_YEAR: 12>, 'value': u'1922'}, {'type':
<DateEnum.TIMEX_YEAR: 12>, 'value': u'1923'}]
--- entity: "nel 2000" norm [{'type': <DateEnum.TIMEX_YEAR: 12>,
'value': u'2000'}]
--- entity: "uno scampolo" norm [{'type': <DateEnum.TIMEX_START_TIME:
9>, 'value': '01::_'}]
|

—
Reply to this email directly or view it on GitHub
#43 (comment).

Emilio Dorigatti · Answer 14 · Fri May 15 2015 00:03:05 GMT+0800 (China Standard Time)

I removed all rules related to times. The rules regarding years work quite well but rules regarding seasons are not so reliable. @marfox how should I add the matches to the training data? Should I modify an existing field or add a new one?

Marco Fossati · Answer 15 · Fri May 15 2015 00:12:22 GMT+0800 (China Standard Time)

On 5/14/15 6:03 PM, Emilio Dorigatti wrote:

I removed all rules related to times. The rules regarding years work
quite well but rules regarding seasons are not so reliable.
Sure, this is expected. We are dealing with different seasons, i.e.,
soccer domain VS seasons of the year.
Those rules should be adapted accordingly.
@marfox
https://github.com/marfox how should I add the matches to the training
data? Should I modify an existing field or add a new one?
They should override 'Tempo' or 'Durata' frame elements (depending on
which rule applies, see DURATION rules)

—
Reply to this email directly or view it on GitHub
#43 (comment).

Marco Fossati · Answer 16 · Wed May 20 2015 18:17:15 GMT+0800 (China Standard Time)

One major thing: the original grammar was intended to work on already recognized date entities, not on a whole sentence.

For instance, no results here:
Nel giugno del 2009 rescinde il contratto con il Newcastle e rimane svincolato.

Emilio Dorigatti · Answer 17 · Wed May 20 2015 20:34:23 GMT+0800 (China Standard Time)

Yes, in fact I try to apply the normalizer to each token in the sentence: https://github.com/dbpedia/fact-extractor/blob/date-normalizer/crowdflower_results_into_training_data.py#L148

Marco Fossati · Answer 18 · Wed May 20 2015 20:55:03 GMT+0800 (China Standard Time)

I see.
This approach has a big drawback: it depends on the quality of the crowdsourced annotations.
The next step is to make the grammar directly annotate date entities, thus skipping the crowd.
Its input would then be the whole sentence.
@e-dorigatti , feel free to close this issue once satisfied, I'll open a new one with the new requirements.

Emilio Dorigatti · Answer 19 · Wed May 20 2015 21:06:09 GMT+0800 (China Standard Time)

How will we handle situations of conflict between the dates recognized by the grammar and the ones annotated from the crowd?

Also, I am starting to think that using the antlr4 tool is a bit of an overkill. This grammar is just a set of regexes after all, 1694 lines of grammar and almost 9500 lines of python code seem a bit too much to me..

Marco Fossati · Answer 20 · Wed May 20 2015 21:51:00 GMT+0800 (China Standard Time)

On 5/20/15 3:06 PM, Emilio Dorigatti wrote:

How will we handle situations of conflict between the dates recognized
by the grammar and the ones annotated from the crowd?
We won't ask the crowd to annotate dates at all, that's the main purpose
of the next step.

Also, I am starting to think that using the antlr4 tool is a bit of an
overkill. This grammar is just a set of regexes after all, 1694 lines of
grammar and almost 9500 lines of python code seem a bit too much to me..
You read in my mind. :-)
The original grammar was intended to have very fine-grained
normalizations over a larger set of date and time expressions.
I'm currently isolating the ones we need to tag as Time and Duration FEs.

—
Reply to this email directly or view it on GitHub
#43 (comment).

Emilio Dorigatti · Answer 21 · Wed May 20 2015 22:13:20 GMT+0800 (China Standard Time)

Okay, so I will be using that as input for creating a set of regexes in the next ticket?

Marco Fossati · Answer 22 · Wed May 20 2015 22:17:29 GMT+0800 (China Standard Time)

Sounds good