The data num is wrong
guozhiyao opened this issue · comments
guozhiyao commented
I download the [train_prefs-00000-of-00001.parquet](https://huggingface.co/datasets/UCLA-AGI/SPIN_iter0/blob/main/train_prefs-00000-of-00001.parquet)
and load with
import pdb
from datasets import load_dataset
data = load_dataset("/data/oss_bucket_0/yunying/paper/SPIN/data/SPIN_iter0/", split="train")
print(len(data))
all_val = set()
for line in data:
all_val.add(str(line))
print(len(all_val))
pdb.set_trace()
The len(data)
is 99584 and the len(all_val)
is 49792. Is that right?