allenai / cartography

Dataset Cartography: Mapping and Diagnosing Datasets with Training Dynamics

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Work With Other Datasets

antmarakis opened this issue · comments

Hi! This looks like a very interesting tool, I am wondering if it would be easy to use on other datasets. I see only GLUE/NLI datasets are supported. Do you have any tips on how to use this on a simple {TEXT, LABEL} task? Thanks!

I have the same question with antmarakis. Can you kindly help?

Just sharing my experience with this repo, maybe this helps someone in the future:

I used this repo for a {TEXT, LABEL} task with BERT models. Since neither this type of task nor this type of model is supported in the training section of this repo, I would recommend to first train any model on any dataset on your own (without using the code of this repo). While training, save somewhere the logits of each data instance together with the gold standard label and a unique identifier, as suggested by the authors (see "Note:" section).

After training you can use train_dy_filtering, as explained here to generate DataMaps and to obtain coordinates for further data filtering. You just need to extend this line of code by any additional name, which you use from now on as task name. Then you can call python -m cartography.selection.train_dy_filtering --plot --task_name "YOUR_NEW_TASK_NAME" --model "ANY_NAME_YOU_WANT" --model_dir "" from the main directory of this repo to create the DataMap. Make sure to create a training_dynamics folder in the main directory including your training dynamics. The plot will be automatically saved in the cartography folder, while the coordinates will be stored to the main directory (this can be changed by other arguments like --plots_dir).

fixing errors of this repo is more time consuming than extracting training dynamics from the model that is trained independently :)