Work With Other Datasets

Question

Work With Other Datasets

antmarakis opened this issue 3 years ago · comments

Antonis Maronikolakis commented 3 years ago

Hi! This looks like a very interesting tool, I am wondering if it would be easy to use on other datasets. I see only GLUE/NLI datasets are supported. Do you have any tips on how to use this on a simple {TEXT, LABEL} task? Thanks!

douglashiwo · Answer 1 · Thu Mar 17 2022 18:58:37 GMT+0800 (China Standard Time)

I have the same question with antmarakis. Can you kindly help?

Lukas Moldon · Answer 2 · Mon Jan 16 2023 04:55:15 GMT+0800 (China Standard Time)

Just sharing my experience with this repo, maybe this helps someone in the future:

I used this repo for a {TEXT, LABEL} task with BERT models. Since neither this type of task nor this type of model is supported in the training section of this repo, I would recommend to first train any model on any dataset on your own (without using the code of this repo). While training, save somewhere the logits of each data instance together with the gold standard label and a unique identifier, as suggested by the authors (see "Note:" section).

After training you can use train_dy_filtering, as explained here to generate DataMaps and to obtain coordinates for further data filtering. You just need to extend this line of code by any additional name, which you use from now on as task name. Then you can call python -m cartography.selection.train_dy_filtering --plot --task_name "YOUR_NEW_TASK_NAME" --model "ANY_NAME_YOU_WANT" --model_dir "" from the main directory of this repo to create the DataMap. Make sure to create a training_dynamics folder in the main directory including your training dynamics. The plot will be automatically saved in the cartography folder, while the coordinates will be stored to the main directory (this can be changed by other arguments like --plots_dir).

Pritam Kadasi · Answer 3 · Wed May 03 2023 18:40:14 GMT+0800 (China Standard Time)

fixing errors of this repo is more time consuming than extracting training dynamics from the model that is trained independently :)