Any details to setup the ACE2005 dataset?

Question

Any details to setup the ACE2005 dataset?

parap1uie-s opened this issue 5 years ago · comments

Hi, I am confused about the setting about ACE2005 dataset.

I got the dataset called ace_2005_td_v7_LDC2006T06, and I'm aware of the issue 2 about dataset setup, and tried the script preprocess.py you write in this link:

#2

I set the parameters:

path_to_ace2005 = "ace_2005_td_v7/"
saving_path = "ace2005/"

However, I don't know how to split into train/val/test set.

The script produces a lot of files with ‘.sgm.coref’ and ‘.sgm.like_conll’ under ace2005/

In addition, the script has been running more than 12hours on a decent server. Right now there are 800 files under ace2005/. Is that normal?

Thanks in advance.

parap1uie-s · Answer 1 · Thu Jan 24 2019 19:58:14 GMT+0800 (China Standard Time)

Update:

The running process of preprocess.py has been done after a few days.

Right now there are 1198 files under ace2005/

Victor SANH · Answer 2 · Thu Jan 24 2019 23:18:34 GMT+0800 (China Standard Time)

Hello,
Sorry for the late answer.

Hmm this sounds like really slow... On my laptop, the whole preprocessing takes ~15mins (I use a standard MacBookPro, nothing fancy). So it's definitely surprising...
Have you measured what operation take time?
One way to speed up the thing is to add a disable=['parser', 'tagger', 'ner'] when calling spacy tokenizer. Note that I haven't tested it, nor optimized the script for time performance since it's a one time shot.

The main splits can be found here (repo from Miwa and Bansal).

That said, the parameters seems good. I like to put saving_path=<some_tmp_folder> since as you noticed, the script produced individual preprocessed files (outputed in the conll format for convenience). The next step is fairly simple: read the splits files and copy them in train/dev/test folders.

mkdir train
for i in `cat split/train`; do cp tmp/$i.sgm.like_conll train; done

For coref training examples, I found it easier to dump everything in the same file:

for i in `cat split/train`; do (cat "tmp/${i}.sgm.coref"; echo) >> single_file_train.gold_conll; done

(The last two snippets are bash commands).

parap1uie-s · Answer 3 · Sun Jan 27 2019 20:13:17 GMT+0800 (China Standard Time)

Hi,

It seems to take too many time with the script.

However, I didn't trace the script step by step to measure what take time...

And I noticed about the number of files in splits, which is 511=351+80+80.

I have got 1198 files under ace2005/, with ‘.sgm.coref’ and ‘.sgm.like_conll’.

Any possibility about the different version of the datasets?

In addition, the bash command works perfectly.

Thanks!

daviddongkc · Answer 4 · Fri Dec 25 2020 23:19:05 GMT+0800 (China Standard Time)

Hi,

I am doing research on information extraction and need to use ACE2005 dataset urgently. But unfortunately, the LDC licence for ACE2005 is not available for my university.
May I know if you can by any chances share the dataset for research purpose?

Many thanks,
Regards,
kc

Julien Chaumond · Answer 5 · Sat Dec 26 2020 17:57:17 GMT+0800 (China Standard Time)

@daviddongkc Try asking LDC. We can't really share it ourselves for licensing reasons.

More generally, have you tried looking at the list of publicly available datasets from https://huggingface.co/datasets? Is there anything that would work for your research?