fine tune on a new dataset

Question

fine tune on a new dataset

un-lock-me opened this issue 3 years ago · comments

Hi @thu-coai @hzhwcmhf @MaLiN2223 @zqwerty @xiaotianzi @truthless11 and thanks so much for making your code available.
I want to fine tune the code on a new dataset that the format is very similar to IMDB dataset (it has a couple of sentences and label is positive/negative/neutral). Could you please advise on what changes I need to make?

I appreciate your time and help :).

mg · Answer 1 · Thu Oct 07 2021 13:58:13 GMT+0800 (China Standard Time)

Another question is that for preprocessing the new dataset do I need to all the script in this link: https://github.com/thu-coai/SentiLARE/tree/master/preprocess
If so, is there any order for doing that?

Thanks :)

kepei1106 · Answer 2 · Wed Nov 17 2021 15:19:23 GMT+0800 (China Standard Time)

Hi, I suggest that you can follow these steps to adapt our codes to your own dataset:

Prepare your own dataset in the same format as our provided raw dataset, such as IMDB. The link to download the raw dataset / preprocessed dataset is provided in README.
Preprocess the raw dataset with our codes. If your task is sentence-level sentiment classfication, you should refer to prep_sent.py. You may need additional files like SentiWordNet and the representation of its glosses. We have mentioned this in our code.
Run the classification code on your own dataset just as on IMDB. Some arguments may be modified such as the data path.

Hope this can help you.