dbpedia / fact-extractor

Fact Extraction from Wikipedia Text

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Normalize Date Expressions in Training Set

marfox opened this issue · comments

Lots of FEs are dates:

  • absolute, e.g., May 2008
  • relative, e.g., the previous season
  • interval, e.g., from 2008 to 2015

A normalizer based on this CFG grammar should be implemented at training set building time.
It is written as an ANTLR grammar.

N.B.

Please submit your pull request to (or work on) the date-normalizer branch.

Hello Marco, I found out that antlr4 is able to produce python output given a grammar file. This would be ideal but the grammar you provided contains java code instead of python code. Are you using some kind of tool to automate the generation of this grammar? How hard do you think it is to convert that code to python?

Edit: I am trying to do it via vim regexes right now, results seem promising ;)

On 5/5/15 7:53 PM, Emilio Dorigatti wrote:

Hello Marco, I found out that antlr4 is able to produce python output
given a grammar file.
Perfect!
This would be ideal but the grammar you provided
contains java code instead of python code. Are you using some kind of
tool to automate the generation of this grammar?
Nope, it's manually curated.
How hard do you think
it is to convert that code to python?
I don't know, but shouldn't the grammar syntax/rules be independent from
the implementation?
Maybe you just have to change the import statements and the classes we
are using, please check.


Reply to this email directly or view it on GitHub
#43 (comment).

There is some java code in the grammar file which is simply copy-pasted
into the generated python code. I think I managed to convert it correctly,
now I am trying to use the generated code

On Wed, May 6, 2015, 10:15 Marco Fossati notifications@github.com wrote:

On 5/5/15 7:53 PM, Emilio Dorigatti wrote:

Hello Marco, I found out that antlr4 is able to produce python output
given a grammar file.
Perfect!
This would be ideal but the grammar you provided
contains java code instead of python code. Are you using some kind of
tool to automate the generation of this grammar?
Nope, it's manually curated.
How hard do you think
it is to convert that code to python?
I don't know, but shouldn't the grammar syntax/rules be independent from
the implementation?
Maybe you just have to change the import statements and the classes we
are using, please check.


Reply to this email directly or view it on GitHub
<
#43 (comment)
.


Reply to this email directly or view it on GitHub
#43 (comment)
.

BTW, please refer to #44 for the work you've been doing up to now.

I cannot really understand how to use this. Given the sentence il campionato si svolgerà nel corso delle prossime 3 settimane, the result is

$ java -cp "/home/emilio/GSoC/lib/antlr-4.5-complete.jar:$CLASSPATH" org.antlr.v4.runtime.misc.TestRig DateAndTime week_duration test -tree
line 1:3 token recognition error at: 'ca'
line 1:5 token recognition error at: 'mp'
line 1:8 token recognition error at: 'on'
line 1:11 token recognition error at: 'to'
line 1:14 token recognition error at: 'si'
line 1:17 token recognition error at: 'sv'
line 1:19 token recognition error at: 'ol'
line 1:21 token recognition error at: 'ger'
line 1:24 token recognition error at: 'à'
line 1:0 no viable alternative at input 'il'
(week_duration il i a nel corso delle prossime 3 settimane)

Isn't this parser supposed to take as input the whole sentence and understand which rule to apply? Or will we need to try all the rules?

Not sure what the class you call is doing.
I used to generate the parser and lexer classes by just invoking the jar:
java -jar antlr-4.5-complete.jar DateAndTime.g4
Then the generated parser is supposed to apply the rules and output the transformation.

But beware of the FIXME comments in the grammar file!
We need first to implement the Date objects and enumerations.

Yes, I did implement DateEntity and DateEnum, then generated the parser and the lexer with the command you posted. Now how can I apply the rules to a sample sentence?

I implemented a class that consumes the generated ANTLR ones.
This is the minimal code to make them run.
Then you have to process the parser.results, which should be a list of DateEntity objects.

                String entityValue = "foo"
                if (entityValue != null) {
                        // Set up parser
                        DateAndTimeParser parser = new DateAndTimeParser(null);
                        parser.setBuildParseTree(false); // Don't need trees
                        // Set up lexer
                        ANTLRInputStream input = new ANTLRInputStream(entityValue);
                        DateAndTimeLexer lexer = new DateAndTimeLexer(input);
                        lexer.setLine(1); // Notify lexer of input position
                        lexer.setCharPositionInLine(0);
                        CommonTokenStream tokens = new CommonTokenStream(lexer);
                        parser.setInputStream(tokens); // Notify parser of new token stream
                        // Start the parser, make sure it doesn't crash in case of unrecognized expressions
                        try {
                                parser.value();
                        } catch (Exception e) {
                                e.printStackTrace();
                                _logger.warning("Skipping normalization for badly stated entity");
                        }
                        // parser.results should be a list of normalized expressions.

Okay, I managed to have a first rough version of the python tokenizer working, just pushed to this repo

I tried to apply the date normalizer to individual entities found in the sentences (here), but the results aren't very encouraging. This is the code that I added (see date_normalizer.get_tokens)

            try:
                date_norm = date_normalizer.get_tokens(entity)
                if date_norm and any(x['value'] for x in date_norm):
                    print '--- entity: "%s" norm ' % entity + str(date_norm)
            except:
                pass

and this is the output:

--- entity: "Nazionale Under-21" norm [{'type': <DateEnum.TIMEX_START_TIME: 9>, 'value': None}, {'type': <DateEnum.TIMEX_START_TIME: 9>, 'value': None}, {'type': <DateEnum.TIMEX_START_TIME: 9>, 'value': None}, {'type': <DateEnum.TIMEX_START_TIME: 9>, 'value': '21:_:_'}]
--- entity: "1997" norm [{'type': <DateEnum.TIMEX_YEAR: 12>, 'value': u'1997'}]
--- entity: "nel 1991" norm [{'type': <DateEnum.TIMEX_YEAR: 12>, 'value': u'1991'}]
--- entity: "1991" norm [{'type': <DateEnum.TIMEX_YEAR: 12>, 'value': u'1991'}]
--- entity: "nel 1982" norm [{'type': <DateEnum.TIMEX_YEAR: 12>, 'value': u'1982'}]
--- entity: "il 2004" norm [{'type': <DateEnum.TIMEX_YEAR: 12>, 'value': u'2004'}]
--- entity: "il 1999" norm [{'type': <DateEnum.TIMEX_YEAR: 12>, 'value': u'1999'}]
--- entity: "Under 21" norm [{'type': <DateEnum.TIMEX_START_TIME: 9>, 'value': '21:_:_'}]
--- entity: "argentino" norm [{'type': <DateEnum.TIMEX_START_TIME: 9>, 'value': None}, {'type': <DateEnum.TIMEX_START_TIME: 9>, 'value': None}, {'type': <DateEnum.TIMEX_DATE: 1>, 'value': '01:_'}]
--- entity: "Malta" norm [{'type': <DateEnum.TIMEX_START_TIME: 9>, 'value': '$now$'}]
--- entity: "4 incontri" norm [{'type': <DateEnum.TIMEX_START_TIME: 9>, 'value': '04:_:_'}]
--- entity: "unica stagione" norm [{'type': <DateEnum.TIMEX_START_TIME: 9>, 'value': '01:_:_'}, {'type': <DateEnum.TIMEX_SEASON: 8>, 'value': None}, {'type': <DateEnum.TIMEX_SEASON: 8>, 'value': None}]
--- entity: "1934" norm [{'type': <DateEnum.TIMEX_YEAR: 12>, 'value': u'1934'}]
--- entity: "Real" norm [{'type': <DateEnum.TIMEX_START_TIME: 9>, 'value': '$now$'}]
--- entity: "Under-21" norm [{'type': <DateEnum.TIMEX_START_TIME: 9>, 'value': '21:_:_'}]
--- entity: "giovanili" norm [{'type': <DateEnum.TIMEX_WEEKDAY: 11>, 'value': '$thursday$'}]
--- entity: "Under-16" norm [{'type': <DateEnum.TIMEX_START_TIME: 9>, 'value': '16:_:_'}]
--- entity: "1960" norm [{'type': <DateEnum.TIMEX_YEAR: 12>, 'value': u'1960'}]
--- entity: "stagione 1922-1923" norm [{'type': <DateEnum.TIMEX_SEASON: 8>, 'value': None}, {'type': <DateEnum.TIMEX_SEASON: 8>, 'value': None}, {'type': <DateEnum.TIMEX_YEAR: 12>, 'value': u'1922'}, {'type': <DateEnum.TIMEX_YEAR: 12>, 'value': u'1923'}]
--- entity: "nel 2000" norm [{'type': <DateEnum.TIMEX_YEAR: 12>, 'value': u'2000'}]
--- entity: "uno scampolo" norm [{'type': <DateEnum.TIMEX_START_TIME: 9>, 'value': '01:_:_'}]

We should keep the date rules only and discard the time ones in the grammar.

On 5/7/15 6:07 PM, Emilio Dorigatti wrote:

I tried to apply the date normalizer to individual entities found in the
sentences (here
https://github.com/dbpedia/fact-extractor/blob/date-normalizer/crowdflower_results_into_training_data.py#L130),
but the results aren't very encouraging. This is the code that I added
(see date_normalizer.get_tokens
https://github.com/dbpedia/fact-extractor/blob/date-normalizer/date_normalizer/date_normalizer.py#L7)

| try:
date_norm = date_normalizer.get_tokens(entity)
if date_norm and any(x['value'] for x in date_norm):
print '--- entity: "%s" norm ' % entity + str(date_norm)
except:
pass
|

and this is the output:

|--- entity: "Nazionale Under-21" norm [{'type': <DateEnum.TIMEX_START_TIME: 9>, 'value': None}, {'type': <DateEnum.TIMEX_START_TIME: 9>, 'value': None}, {'type': <DateEnum.TIMEX_START_TIME: 9>, 'value': None}, {'type': <DateEnum.TIMEX_START_TIME: 9>, 'value': '21::'}]
--- entity: "1997" norm [{'type': <DateEnum.TIMEX_YEAR: 12>, 'value': u'1997'}]
--- entity: "nel 1991" norm [{'type': <DateEnum.TIMEX_YEAR: 12>, 'value': u'1991'}]
--- entity: "1991" norm [{'type': <DateEnum.TIMEX_YEAR: 12>, 'value': u'1991'}]
--- entity: "nel 1982" norm [{'type': <DateEnum.TIMEX_YEAR: 12>, 'value': u'1982'}]
--- entity: "il 2004" norm [{'type': <DateEnum.TIMEX_YEAR: 12>, 'value': u'2004'}]
--- entity: "il 1999" norm [{'type': <DateEnum.TIMEX_YEAR: 12>, 'value': u'1999'}]
--- entity: "Under 21" norm [{'type': <DateEnum.TIMEX_START_TIME: 9>, 'value': '21::'}]
--- entity: "argentino" norm [{'type': <DateEnum.TIMEX_START_TIME: 9>, 'value': None}, {'type': <DateEnum.TIMEX_START_TIME: 9>, 'value': None}, {'type': <DateEnum.TIMEX_DATE: 1>, 'value': '01:'}]
--- entity: "Malta" norm [{'type': <DateEnum.TIMEX_START_TIME: 9>, 'value': '$now$'}]
--- entity: "4 incontri" norm [{'type': <DateEnum.TIMEX_START_TIME: 9>, 'value': '04:
:'}]
--- entity: "unica stagione" norm [{'type': <DateEnum.TIMEX_START_TIME: 9>, 'value': '01:
:'}, {'type': <DateEnum.TIMEX_SEASON: 8>, 'value': None}, {'type': <DateEnum.TIMEX_SEASON: 8>, 'value': None}]
--- entity: "1934" norm [{'type': <DateEnum.TIMEX_YEAR: 12>, 'value': u'1934'}]
--- entity: "Real" norm [{'type': <DateEnum.TIMEX_START_TIME: 9>, 'value': '$now$'}]
--- entity: "Under-21" norm [{'type': <DateEnum.TIMEX_START_TIME: 9>, 'value': '21:
:'}]
--- entity: "giovanili" norm [{'type': <DateEnum.TIMEX_WEEKDAY: 11>, 'value': '$thursday$'}]
--- entity: "Under-16" norm [{'type': <DateEnum.TIMEX_START_TIME: 9>, 'value': '16:
:'}]
--- entity: "1960" norm [{'type': <DateEnum.TIMEX_YEAR: 12>, 'value': u'1960'}]
--- entity: "stagione 1922-1923" norm [{'type': <DateEnum.TIMEX_SEASON: 8>, 'value': None}, {'type': <DateEnum.TIMEX_SEASON: 8>, 'value': None}, {'type': <DateEnum.TIMEX_YEAR: 12>, 'value': u'1922'}, {'type': <DateEnum.TIMEX_YEAR: 12>, 'value': u'1923'}]
--- entity: "nel 2000" norm [{'type': <DateEnum.TIMEX_YEAR: 12>, 'value': u'2000'}]
--- entity: "uno scampolo" norm [{'type': <DateEnum.TIMEX_START_TIME: 9>, 'value': '01:
:_'}]
|


Reply to this email directly or view it on GitHub
#43 (comment).

And probably adapt some rules to fit into our scenario

On 5/7/15 6:11 PM, Marco Fossati wrote:

We should keep the date rules only and discard the time ones in the
grammar.

On 5/7/15 6:07 PM, Emilio Dorigatti wrote:

I tried to apply the date normalizer to individual entities found in the
sentences (here
https://github.com/dbpedia/fact-extractor/blob/date-normalizer/crowdflower_results_into_training_data.py#L130),

but the results aren't very encouraging. This is the code that I added
(see date_normalizer.get_tokens
https://github.com/dbpedia/fact-extractor/blob/date-normalizer/date_normalizer/date_normalizer.py#L7)

| try:
date_norm = date_normalizer.get_tokens(entity)
if date_norm and any(x['value'] for x in date_norm):
print '--- entity: "%s" norm ' % entity +
str(date_norm)
except:
pass
|

and this is the output:

|--- entity: "Nazionale Under-21" norm [{'type':
<DateEnum.TIMEX_START_TIME: 9>, 'value': None}, {'type':
<DateEnum.TIMEX_START_TIME: 9>, 'value': None}, {'type':
<DateEnum.TIMEX_START_TIME: 9>, 'value': None}, {'type':
<DateEnum.TIMEX_START_TIME: 9>, 'value': '21::'}]
--- entity: "1997" norm [{'type': <DateEnum.TIMEX_YEAR: 12>, 'value':
u'1997'}]
--- entity: "nel 1991" norm [{'type': <DateEnum.TIMEX_YEAR: 12>,
'value': u'1991'}]
--- entity: "1991" norm [{'type': <DateEnum.TIMEX_YEAR: 12>, 'value':
u'1991'}]
--- entity: "nel 1982" norm [{'type': <DateEnum.TIMEX_YEAR: 12>,
'value': u'1982'}]
--- entity: "il 2004" norm [{'type': <DateEnum.TIMEX_YEAR: 12>,
'value': u'2004'}]
--- entity: "il 1999" norm [{'type': <DateEnum.TIMEX_YEAR: 12>,
'value': u'1999'}]
--- entity: "Under 21" norm [{'type': <DateEnum.TIMEX_START_TIME: 9>,
'value': '21::'}]
--- entity: "argentino" norm [{'type': <DateEnum.TIMEX_START_TIME: 9>,
'value': None}, {'type': <DateEnum.TIMEX_START_TIME: 9>, 'value':
None}, {'type': <DateEnum.TIMEX_DATE: 1>, 'value': '01:'}]
--- entity: "Malta" norm [{'type': <DateEnum.TIMEX_START_TIME: 9>,
'value': '$now$'}]
--- entity: "4 incontri" norm [{'type': <DateEnum.TIMEX_START_TIME:
9>, 'value': '04:
:'}]
--- entity: "unica stagione" norm [{'type':
<DateEnum.TIMEX_START_TIME: 9>, 'value': '01:
:'}, {'type':
<DateEnum.TIMEX_SEASON: 8>, 'value': None}, {'type':
<DateEnum.TIMEX_SEASON: 8>, 'value': None}]
--- entity: "1934" norm [{'type': <DateEnum.TIMEX_YEAR: 12>, 'value':
u'1934'}]
--- entity: "Real" norm [{'type': <DateEnum.TIMEX_START_TIME: 9>,
'value': '$now$'}]
--- entity: "Under-21" norm [{'type': <DateEnum.TIMEX_START_TIME: 9>,
'value': '21:
:'}]
--- entity: "giovanili" norm [{'type': <DateEnum.TIMEX_WEEKDAY: 11>,
'value': '$thursday$'}]
--- entity: "Under-16" norm [{'type': <DateEnum.TIMEX_START_TIME: 9>,
'value': '16:
:'}]
--- entity: "1960" norm [{'type': <DateEnum.TIMEX_YEAR: 12>, 'value':
u'1960'}]
--- entity: "stagione 1922-1923" norm [{'type':
<DateEnum.TIMEX_SEASON: 8>, 'value': None}, {'type':
<DateEnum.TIMEX_SEASON: 8>, 'value': None}, {'type':
<DateEnum.TIMEX_YEAR: 12>, 'value': u'1922'}, {'type':
<DateEnum.TIMEX_YEAR: 12>, 'value': u'1923'}]
--- entity: "nel 2000" norm [{'type': <DateEnum.TIMEX_YEAR: 12>,
'value': u'2000'}]
--- entity: "uno scampolo" norm [{'type': <DateEnum.TIMEX_START_TIME:
9>, 'value': '01:
:_'}]
|


Reply to this email directly or view it on GitHub
#43 (comment).

I removed all rules related to times. The rules regarding years work quite well but rules regarding seasons are not so reliable. @marfox how should I add the matches to the training data? Should I modify an existing field or add a new one?

On 5/14/15 6:03 PM, Emilio Dorigatti wrote:

I removed all rules related to times. The rules regarding years work
quite well but rules regarding seasons are not so reliable.
Sure, this is expected. We are dealing with different seasons, i.e.,
soccer domain VS seasons of the year.
Those rules should be adapted accordingly.
@marfox
https://github.com/marfox how should I add the matches to the training
data? Should I modify an existing field or add a new one?
They should override 'Tempo' or 'Durata' frame elements (depending on
which rule applies, see DURATION rules)


Reply to this email directly or view it on GitHub
#43 (comment).

One major thing: the original grammar was intended to work on already recognized date entities, not on a whole sentence.

For instance, no results here:
Nel giugno del 2009 rescinde il contratto con il Newcastle e rimane svincolato.

I see.
This approach has a big drawback: it depends on the quality of the crowdsourced annotations.
The next step is to make the grammar directly annotate date entities, thus skipping the crowd.
Its input would then be the whole sentence.
@e-dorigatti , feel free to close this issue once satisfied, I'll open a new one with the new requirements.

How will we handle situations of conflict between the dates recognized by the grammar and the ones annotated from the crowd?

Also, I am starting to think that using the antlr4 tool is a bit of an overkill. This grammar is just a set of regexes after all, 1694 lines of grammar and almost 9500 lines of python code seem a bit too much to me..

On 5/20/15 3:06 PM, Emilio Dorigatti wrote:

How will we handle situations of conflict between the dates recognized
by the grammar and the ones annotated from the crowd?
We won't ask the crowd to annotate dates at all, that's the main purpose
of the next step.

Also, I am starting to think that using the antlr4 tool is a bit of an
overkill. This grammar is just a set of regexes after all, 1694 lines of
grammar and almost 9500 lines of python code seem a bit too much to me..
You read in my mind. :-)
The original grammar was intended to have very fine-grained
normalizations over a larger set of date and time expressions.
I'm currently isolating the ones we need to tag as Time and Duration FEs.


Reply to this email directly or view it on GitHub
#43 (comment).

Okay, so I will be using that as input for creating a set of regexes in the next ticket?

Sounds good