About Riedel&GIDS raw dataset

Question

About Riedel&GIDS raw dataset

charosen opened this issue 5 years ago · comments

Hi, guys:

Thanks for the awesome contributions.

I have read ur paper and now I am interested in the construction of Riedel & GIDS raw dataset(riedel_train.json, riedel_test.json, gids_train.json, gids_test.json, gids_dev.json).

As mentioned in ur paper, RESIDE used stanford CoreNLP tool to extract nlp features from sentences, and I'm a little confused about how did you get Riedel & GIDS raw dataset(riedel_train.json, riedel_test.json, gids_train.json, gids_test.json, gids_dev.json):

About CoreNLP usage: I found many python wrappers of CoreNLP didn't work well with kbp, entitylink annotators. Did you just start CoreNLP server, then send request to server for preprocessing ?
I've noticed riedel_*.json's format is a little different from gids_*.json. That is, riedel_*.json has a openie key, but gids_*.json's openie key is embedded in corenlp key. To my understanding, for riedel, you used corenlp to extract openie features from one sentence, then used corenlp again to extract depparse features from the same sentence; for gids, you used corenlp to extract all features from one sentence at the same time. Is that right? Also, preprocessing code seems to be not compatible with the format of gids_*.json dataset provided.
for corenlp openie features, did you just activate tokenize,ssplit,pos,lemma,depparse,natlog,openie annotators of corenlp? for corenlp dependency tree features, did you just activate tokenize,ssplit,pos,lemma,parse,depparse,ner,entitylink,coref,kbp annotators of corenlp? Also, as RESIDE seems to only use dependency tree feature extracted from depparse annotator and relation phrase feature extracted from openie phrase, did other features from annotators like kbp, coref, entitylink contribute to the preprocessing of dataset?

Thanks a lot.

Best

charosen · Answer 1 · Thu Aug 15 2019 13:30:19 GMT+0800 (China Standard Time)

More questions:

SharmisthaJat/RE-DS-Word-Attention-Models seems to have already updated their riedel2010 data, which contains 522611 train sentences, 172448 test sentences. So can you provide the riedel raw dataset you used, which contains 570084 train sentences?
When using corenlp to preprocess raw sentences, i've noticed there are some sentences with non-english characters which raises tokenizer warning(ie, WARN edu.stanford.nlp.process.PTBLexer - Untokenizable: � (U+FFFD, decimal: 65533)), and some pretty long raw sentences in dataset which may result in corenlp server timeout. I wonder if you preprocess raw sentence(removing non-english characters, removing stop-words or dropping long sentences...) before using corenlp to preprocess them?
Also, riedel2010 provided by SharmisthaJat/RE-DS-Word-Attention-Models seems to only contains caseless sentences which is the same as sentences in rsent, however, corenlp needs standardly edited and capitalized full sentences. How did you get these standardly edited and capitalized full sentences in sent?

Shikhar Vashishth · Answer 2 · Sat Aug 17 2019 23:52:30 GMT+0800 (China Standard Time)

Hi @charosen,
Please find your answers below:
Post 1:

We used CoreNLP server as it is much more scalable and fast.
Yes, I am sorry about that problem. Actually, there is no scientific reason behind why "openie" key is in riedel but not in gids. Both the files have the same information, the difference is that for riedel the openie preprocessing was done later and was added as a separate key while for gids it was done together. You can find 'openie' in gids as well. I will try to make it consistent in the future.
Yes, we used the keys which you have listed. kbp, coref, entitylink were not required.

Post 2:

Just mix the train and validation data then it will become the original riedel dataset.
Yes, that is a problem when you are dealing with this dataset. We simply ignored the non-english characters. For dealing with long documents, you can split them into smaller sentences and pass them to corenlp. I think if each sentence has around 250 words then corenlp processes it quite quickly.
Yes, corenlp will give best results when text has capitalized information. We took the dataset from the original source and then ran corenlp on it.

Thanks

charosen · Answer 3 · Mon Aug 19 2019 11:34:33 GMT+0800 (China Standard Time)

Thanks for your detailed reply!

Your reply helps a lot. But I'm still confused about the original source which you mentioned in your answer. I found original NYT10 by Riedel has capitalized sentences, entities' guid but not mid, and Riedel2010 used by Jat has only caseless sentences and entities' mid. So how did you construct the raw riedel dataset which contains capitalized sentences and entities' mid? where is the origianl source you mentioned in your answer, can you provide it?

Thanks again! :p

Shikhar Vashishth · Answer 4 · Tue Aug 20 2019 02:16:41 GMT+0800 (China Standard Time)

Hi @charosen,
You got it right. By original source I mean the dataset shared by Sebastian Riedel (http://iesl.cs.umass.edu/riedel/ecml/). I aligned the dataset shared by Jat et al. with that (you can just convert the sentences in original dataset to lower case and can get an exact match) to get entities' mid.

charosen · Answer 5 · Tue Aug 20 2019 17:33:25 GMT+0800 (China Standard Time)

Thanks for your patient reply!!

I will start working on it.

Have a great day!