uclaml / SPIN

The official implementation of Self-Play Fine-Tuning (SPIN)

Home Page:https://uclaml.github.io/SPIN/

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

The data num is wrong

guozhiyao opened this issue · comments

I download the [train_prefs-00000-of-00001.parquet](https://huggingface.co/datasets/UCLA-AGI/SPIN_iter0/blob/main/train_prefs-00000-of-00001.parquet) and load with

import pdb

from datasets import load_dataset

data = load_dataset("/data/oss_bucket_0/yunying/paper/SPIN/data/SPIN_iter0/", split="train")
print(len(data))
all_val = set()
for line in data:
    all_val.add(str(line))
print(len(all_val))
pdb.set_trace()

The len(data) is 99584 and the len(all_val) is 49792. Is that right?