chiayewken / Span-ASTE

Code Implementation of "Learning Span-Level Interactions for Aspect Sentiment Triplet Extraction".

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Suggestions to run it against other datasets

Jurys22 opened this issue · comments

Hi! I'm pretty new to deep learning and ASTE.

Can you please suggest to me the necessary steps to run this against another dataset?
Do I need to follow this data structure (https://github.com/xuuuluuu/SemEval-Triplet-data/blob/master/README.md#data-description) on my dataset by labeling it?
How can I modify the code on Colab for new datasets? thank you
Any other advice?

Thank you

Hi, yes you would need to annotate the data in the same format. In the folder "aste/data/triplet_data", you can create a folder called "new_data", and put train.txt, dev.txt and test.txt inside. Then, you can specify the new dataset for training by modifying line 11 in aste/main.sh to be "--names new_data, " and line 12 to be "--seeds 0, ".

Thank you!
I was reading a closed issue about data format, and I am wondering:

1 - has the data format changed?
From:
Exactly=O as=O posted=O plus=O a=O great=O value=T-POS .=O####Exactly=O as=O posted=O plus=O a=O great=S value=O .=O####[([6], [5], 'POS')]
To:
Exactly as posted plus a great value . [([6], [5], 'POS')]

2 - looking at the data generated in the colab, Span-Aste/aste/data/triplet_data/14lap I see that train,test,dev have similar structure:
Train
Not even safe mode boots .####Not=O even=O safe=T-NEG mode=T-NEG boots=O .=O####Not=S even=O safe=O mode=O boots=O .=O####[([2, 3], [0], 'NEG')]

Test
A lot of features and shortcuts on the MBP that I was never exposed to on a normal PC .####A=O lot=O of=O features=T-NEU and=O shortcuts=TT-NEU on=O the=O MBP=O that=O I=O was=O never=O exposed=O to=O on=O a=O normal=O PC=O .=O####A=O lot=S of=S features=O and=O shortcuts=O on=O the=O MBP=O that=O I=O was=O never=O exposed=O to=O on=O a=O normal=O PC=O .=O####[([3], [1, 2], 'NEU'), ([5], [1, 2], 'NEU')]

Eval
It was slow , locked up , and also had hardware replaced after only 2 months !####It=O was=O slow=O ,=O locked=O up=O ,=O and=O also=O had=O hardware=T-NEG replaced=O after=O only=O 2=O months=O !=O####It=O was=O slow=O ,=O locked=O up=O ,=O and=O also=O had=O hardware=O replaced=S after=O only=O 2=O months=O !=O####[([10], [11], 'NEG')]

Do I need then to label manually the three sets during the first tests on my dataset?
If yes, once I am sure that it works on my type of dataset, should the final data format be something like that -I will use the same sentence for the example but of course they will be different in the real scenario:

Train:
Exactly as posted plus a great value . [([6], [5], 'POS')]

Test and Dev:
Exactly as posted plus a great value .

Thank you

Hi, the data format that the training script needs is the same that is in Span-ASTE/aste/data/triplet_data/14lap/train.txt, which is like the sample below. The train, dev and test samples have the same format.

I charge it at night and skip taking the cord with me because of the good battery life .####I=O charge=O it=O at=O night=O and=O skip=O taking=O the=O cord=O with=O me=O because=O of=O the=O good=O battery=T-POS life=T-POS .=O####I=O charge=O it=O at=O night=O and=O skip=O taking=O the=O cord=O with=O me=O because=O of=O the=O good=S battery=O life=O .=O####[([16, 17], [15], 'POS')]

Hi, to make it more convenient to apply to new datasets, you can omit the tags component of the annotation, and include just the sentence and triplet information, such as the sample below. Each line in the train, dev and test set can have the same format.

I charge it at night and skip taking the cord with me because of the good battery life .#### #### ####[([16, 17], [15], 'POS')]