huggingface / hmtl

🌊HMTL: Hierarchical Multi-Task Learning - A State-of-the-Art neural network model for several NLP tasks based on PyTorch and AllenNLP

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Any details to setup the ACE2005 dataset?

parap1uie-s opened this issue · comments

Hi, I am confused about the setting about ACE2005 dataset.

I got the dataset called ace_2005_td_v7_LDC2006T06, and I'm aware of the issue 2 about dataset setup, and tried the script you write in this link:


I set the parameters:

path_to_ace2005 = "ace_2005_td_v7/"
saving_path = "ace2005/"

However, I don't know how to split into train/val/test set.

The script produces a lot of files with ‘.sgm.coref’ and ‘.sgm.like_conll’ under ace2005/

In addition, the script has been running more than 12hours on a decent server. Right now there are 800 files under ace2005/. Is that normal?

Thanks in advance.


The running process of has been done after a few days.

Right now there are 1198 files under ace2005/

Sorry for the late answer.

Hmm this sounds like really slow... On my laptop, the whole preprocessing takes ~15mins (I use a standard MacBookPro, nothing fancy). So it's definitely surprising...
Have you measured what operation take time?
One way to speed up the thing is to add a disable=['parser', 'tagger', 'ner'] when calling spacy tokenizer. Note that I haven't tested it, nor optimized the script for time performance since it's a one time shot.

The main splits can be found here (repo from Miwa and Bansal).

That said, the parameters seems good. I like to put saving_path=<some_tmp_folder> since as you noticed, the script produced individual preprocessed files (outputed in the conll format for convenience). The next step is fairly simple: read the splits files and copy them in train/dev/test folders.

mkdir train
for i in `cat split/train`; do cp tmp/$i.sgm.like_conll train; done

For coref training examples, I found it easier to dump everything in the same file:

for i in `cat split/train`; do (cat "tmp/${i}.sgm.coref"; echo) >> single_file_train.gold_conll; done

(The last two snippets are bash commands).


It seems to take too many time with the script.

However, I didn't trace the script step by step to measure what take time...

And I noticed about the number of files in splits, which is 511=351+80+80.

I have got 1198 files under ace2005/, with ‘.sgm.coref’ and ‘.sgm.like_conll’.

Any possibility about the different version of the datasets?

In addition, the bash command works perfectly.



I am doing research on information extraction and need to use ACE2005 dataset urgently. But unfortunately, the LDC licence for ACE2005 is not available for my university.
May I know if you can by any chances share the dataset for research purpose?

Many thanks,

@daviddongkc Try asking LDC. We can't really share it ourselves for licensing reasons.

More generally, have you tried looking at the list of publicly available datasets from Is there anything that would work for your research?