Deduplication against evaluation sets
nopperl opened this issue · comments
nopperl commented
Could you publish the script used to deduplicate CommonPool against the datasets used for evaluation (mentioned in Appendix F in the paper)?
Samir Yitzhak Gadre commented
Hi @nopperl! Please see this repo for our dataset pre-processing implementation. If you are only interested in de-contamination against eval sets, this modified yml should be all u need:
models: # model directives, specifying the models to instantiate
- dedup-isc-ft-v107
postprocess_columns: # postprocessing directives
- dedup-isc-ft-v107-score
additional_fields: # fields in a webdataset json to carry over into the metadata
- uid
nworkers: 2
batch_size: 512
device: 0
input_tars: "path/to/my/tars/000057{17..19}.tar" # braceexpand suported, can also be s3 paths
output_metadata_dir: "path/to/my/ouput/metadata" # can be arbitrary path
custom_pypath: null # if model, preprocessors, postprocessors not known, look in this python file for user provided custom implementation
reprocess: True # if true will process from scratch, else will just process tars not already processed
This should produce output parquet files with uid
and dedup-isc-ft-v107-score
fields. By thresholding the latter field at 0.604169
you can then recover the desired uids