how to build valid dataset

Question

how to build valid dataset

dasemiao opened this issue 10 months ago · comments

I made a pile dataset, but how to divide the valid dataset. For my self-made validation set, I always get the error "Is a directory: '/home/mdz/pywork/LongMem/pile_preprocessed_binary/valid'

Weizhi Wang · Answer 1 · Fri Oct 20 2023 02:17:08 GMT+0800 (China Standard Time)

Can you provide more details to let me reproduce the error?

Weijie Liu · Answer 2 · Sun Oct 22 2023 13:30:42 GMT+0800 (China Standard Time)

I suggest that I have solved your problem.
I try to generate a custom dataset following the format that the author given.However when I try to train a LongMem model.I meet an error "Is a directory: 'XXX/longmem/valid' ".I think the reason is that when the author writes the code, the version of fairseq is low, and valid and test binaries are not required to run.Up to now,I run this code by the fairseq version is 0.12 and you need to find the code under the fairseq subfolder like "xxx/longmem/fairseq/fairseq_cli/train.py" and just comment the code

    # Load valid dataset (we load training data below, based on the latest checkpoint)
    # We load the valid dataset AFTER building the model
    data_utils.raise_if_valid_subsets_unintentionally_ignored(cfg)
    if cfg.dataset.combine_valid_subsets:
        task.load_dataset("valid", combine=True, epoch=1)
    else:
        for valid_sub_split in cfg.dataset.valid_subset.split(","):
            task.load_dataset(valid_sub_split, combine=False, epoch=1)

In the official code,It's on lines 128 through 133.