huggingface / hmtl

🌊HMTL: Hierarchical Multi-Task Learning - A State-of-the-Art neural network model for several NLP tasks based on PyTorch and AllenNLP

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Any details to setup the ACE2005 dataset?

parap1uie-s opened this issue · comments

Hi, I am confused about the setting about ACE2005 dataset.

I got the dataset called ace_2005_td_v7_LDC2006T06, and I'm aware of the issue 2 about dataset setup, and tried the script preprocess.py you write in this link:

#2

I set the parameters:

path_to_ace2005 = "ace_2005_td_v7/"
saving_path = "ace2005/"

However, I don't know how to split into train/val/test set.

The script produces a lot of files with ‘.sgm.coref’ and ‘.sgm.like_conll’ under ace2005/

In addition, the script has been running more than 12hours on a decent server. Right now there are 800 files under ace2005/. Is that normal?

Thanks in advance.

Update:

The running process of preprocess.py has been done after a few days.

Right now there are 1198 files under ace2005/

Hello,
Sorry for the late answer.

Hmm this sounds like really slow... On my laptop, the whole preprocessing takes ~15mins (I use a standard MacBookPro, nothing fancy). So it's definitely surprising...
Have you measured what operation take time?
One way to speed up the thing is to add a disable=['parser', 'tagger', 'ner'] when calling spacy tokenizer. Note that I haven't tested it, nor optimized the script for time performance since it's a one time shot.

The main splits can be found here (repo from Miwa and Bansal).

That said, the parameters seems good. I like to put saving_path=<some_tmp_folder> since as you noticed, the script produced individual preprocessed files (outputed in the conll format for convenience). The next step is fairly simple: read the splits files and copy them in train/dev/test folders.

mkdir train
for i in `cat split/train`; do cp tmp/$i.sgm.like_conll train; done

For coref training examples, I found it easier to dump everything in the same file:

for i in `cat split/train`; do (cat "tmp/${i}.sgm.coref"; echo) >> single_file_train.gold_conll; done

(The last two snippets are bash commands).

Hi,

It seems to take too many time with the script.

However, I didn't trace the script step by step to measure what take time...

And I noticed about the number of files in splits, which is 511=351+80+80.

I have got 1198 files under ace2005/, with ‘.sgm.coref’ and ‘.sgm.like_conll’.

Any possibility about the different version of the datasets?

In addition, the bash command works perfectly.

Thanks!

Hi,

I am doing research on information extraction and need to use ACE2005 dataset urgently. But unfortunately, the LDC licence for ACE2005 is not available for my university.
May I know if you can by any chances share the dataset for research purpose?

Many thanks,
Regards,
kc

@daviddongkc Try asking LDC. We can't really share it ourselves for licensing reasons.

More generally, have you tried looking at the list of publicly available datasets from https://huggingface.co/datasets? Is there anything that would work for your research?