Suggestions to run it against other datasets

Question

Suggestions to run it against other datasets

Jurys22 opened this issue 3 years ago · comments

Hi! I'm pretty new to deep learning and ASTE.

Can you please suggest to me the necessary steps to run this against another dataset?
Do I need to follow this data structure (https://github.com/xuuuluuu/SemEval-Triplet-data/blob/master/README.md#data-description) on my dataset by labeling it?
How can I modify the code on Colab for new datasets? thank you
Any other advice?

Thank you

Chia Yew Ken · Answer 1 · Wed Dec 29 2021 15:49:20 GMT+0800 (China Standard Time)

Hi, yes you would need to annotate the data in the same format. In the folder "aste/data/triplet_data", you can create a folder called "new_data", and put train.txt, dev.txt and test.txt inside. Then, you can specify the new dataset for training by modifying line 11 in aste/main.sh to be "--names new_data, " and line 12 to be "--seeds 0, ".

Jurys22 · Answer 2 · Wed Jan 05 2022 00:20:59 GMT+0800 (China Standard Time)

Thank you!
I was reading a closed issue about data format, and I am wondering:

1 - has the data format changed?
From:
Exactly=O as=O posted=O plus=O a=O great=O value=T-POS .=O####Exactly=O as=O posted=O plus=O a=O great=S value=O .=O####[([6], [5], 'POS')]
To:
Exactly as posted plus a great value . [([6], [5], 'POS')]

2 - looking at the data generated in the colab, Span-Aste/aste/data/triplet_data/14lap I see that train,test,dev have similar structure:
Train
Not even safe mode boots .####Not=O even=O safe=T-NEG mode=T-NEG boots=O .=O####Not=S even=O safe=O mode=O boots=O .=O####[([2, 3], [0], 'NEG')]

Test
A lot of features and shortcuts on the MBP that I was never exposed to on a normal PC .####A=O lot=O of=O features=T-NEU and=O shortcuts=TT-NEU on=O the=O MBP=O that=O I=O was=O never=O exposed=O to=O on=O a=O normal=O PC=O .=O####A=O lot=S of=S features=O and=O shortcuts=O on=O the=O MBP=O that=O I=O was=O never=O exposed=O to=O on=O a=O normal=O PC=O .=O####[([3], [1, 2], 'NEU'), ([5], [1, 2], 'NEU')]

Eval
It was slow , locked up , and also had hardware replaced after only 2 months !####It=O was=O slow=O ,=O locked=O up=O ,=O and=O also=O had=O hardware=T-NEG replaced=O after=O only=O 2=O months=O !=O####It=O was=O slow=O ,=O locked=O up=O ,=O and=O also=O had=O hardware=O replaced=S after=O only=O 2=O months=O !=O####[([10], [11], 'NEG')]

Do I need then to label manually the three sets during the first tests on my dataset?
If yes, once I am sure that it works on my type of dataset, should the final data format be something like that -I will use the same sentence for the example but of course they will be different in the real scenario:

Train:
Exactly as posted plus a great value . [([6], [5], 'POS')]

Test and Dev:
Exactly as posted plus a great value .

Thank you

Chia Yew Ken · Answer 3 · Mon Jan 10 2022 15:13:31 GMT+0800 (China Standard Time)

Hi, the data format that the training script needs is the same that is in Span-ASTE/aste/data/triplet_data/14lap/train.txt, which is like the sample below. The train, dev and test samples have the same format.

I charge it at night and skip taking the cord with me because of the good battery life .####I=O charge=O it=O at=O night=O and=O skip=O taking=O the=O cord=O with=O me=O because=O of=O the=O good=O battery=T-POS life=T-POS .=O####I=O charge=O it=O at=O night=O and=O skip=O taking=O the=O cord=O with=O me=O because=O of=O the=O good=S battery=O life=O .=O####[([16, 17], [15], 'POS')]

Chia Yew Ken · Answer 4 · Mon Jan 10 2022 16:43:55 GMT+0800 (China Standard Time)

Hi, to make it more convenient to apply to new datasets, you can omit the tags component of the annotation, and include just the sentence and triplet information, such as the sample below. Each line in the train, dev and test set can have the same format.

I charge it at night and skip taking the cord with me because of the good battery life .#### #### ####[([16, 17], [15], 'POS')]