Deduplication against evaluation sets

Question

Deduplication against evaluation sets

nopperl opened this issue 7 months ago · comments

Could you publish the script used to deduplicate CommonPool against the datasets used for evaluation (mentioned in Appendix F in the paper)?

Samir Yitzhak Gadre · Answer 1 · Mon Oct 30 2023 02:08:29 GMT+0800 (China Standard Time)

Hi @nopperl! Please see this repo for our dataset pre-processing implementation. If you are only interested in de-contamination against eval sets, this modified yml should be all u need:

models: # model directives, specifying the models to instantiate
  - dedup-isc-ft-v107
postprocess_columns: # postprocessing directives
  - dedup-isc-ft-v107-score
additional_fields: # fields in a webdataset json to carry over into the metadata
  - uid
nworkers: 2
batch_size: 512
device: 0
input_tars: "path/to/my/tars/000057{17..19}.tar" # braceexpand suported, can also be s3 paths
output_metadata_dir: "path/to/my/ouput/metadata" # can be arbitrary path
custom_pypath: null # if model, preprocessors, postprocessors not known, look in this python file for user provided custom implementation
reprocess: True # if true will process from scratch, else will just process tars not already processed

This should produce output parquet files with uid and dedup-isc-ft-v107-score fields. By thresholding the latter field at 0.604169 you can then recover the desired uids