Any details to setup the ACE2005 dataset?
parap1uie-s opened this issue · comments
Hi, I am confused about the setting about ACE2005 dataset.
I got the dataset called ace_2005_td_v7_LDC2006T06
, and I'm aware of the issue 2 about dataset setup, and tried the script preprocess.py
you write in this link:
I set the parameters:
path_to_ace2005 = "ace_2005_td_v7/"
saving_path = "ace2005/"
However, I don't know how to split into train/val/test set
.
The script produces a lot of files with ‘.sgm.coref’ and ‘.sgm.like_conll’ under ace2005/
In addition, the script has been running more than 12hours on a decent server. Right now there are 800 files under ace2005/. Is that normal?
Thanks in advance.
Update:
The running process of preprocess.py
has been done after a few days.
Right now there are 1198 files under ace2005/
Hello,
Sorry for the late answer.
Hmm this sounds like really slow... On my laptop, the whole preprocessing takes ~15mins (I use a standard MacBookPro, nothing fancy). So it's definitely surprising...
Have you measured what operation take time?
One way to speed up the thing is to add a disable=['parser', 'tagger', 'ner']
when calling spacy tokenizer. Note that I haven't tested it, nor optimized the script for time performance since it's a one time shot.
The main splits can be found here (repo from Miwa and Bansal).
That said, the parameters seems good. I like to put saving_path=<some_tmp_folder>
since as you noticed, the script produced individual preprocessed files (outputed in the conll format for convenience). The next step is fairly simple: read the splits files and copy them in train/dev/test folders.
mkdir train
for i in `cat split/train`; do cp tmp/$i.sgm.like_conll train; done
For coref training examples, I found it easier to dump everything in the same file:
for i in `cat split/train`; do (cat "tmp/${i}.sgm.coref"; echo) >> single_file_train.gold_conll; done
(The last two snippets are bash commands).
Hi,
It seems to take too many time with the script.
However, I didn't trace the script step by step to measure what take time...
And I noticed about the number of files in splits, which is 511=351+80+80.
I have got 1198 files under ace2005/, with ‘.sgm.coref’ and ‘.sgm.like_conll’.
Any possibility about the different version of the datasets?
In addition, the bash command works perfectly.
Thanks!
Hi,
I am doing research on information extraction and need to use ACE2005 dataset urgently. But unfortunately, the LDC licence for ACE2005 is not available for my university.
May I know if you can by any chances share the dataset for research purpose?
Many thanks,
Regards,
kc
@daviddongkc Try asking LDC. We can't really share it ourselves for licensing reasons.
More generally, have you tried looking at the list of publicly available datasets from https://huggingface.co/datasets? Is there anything that would work for your research?