Victorwz / LongMem

Official implementation of our NeurIPS 2023 paper "Augmenting Language Models with Long-Term Memory".

Home Page:https://arxiv.org/abs/2306.07174

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

how to build valid dataset

dasemiao opened this issue · comments

I made a pile dataset, but how to divide the valid dataset. For my self-made validation set, I always get the error "Is a directory: '/home/mdz/pywork/LongMem/pile_preprocessed_binary/valid'

Can you provide more details to let me reproduce the error?

I suggest that I have solved your problem.
I try to generate a custom dataset following the format that the author given.However when I try to train a LongMem model.I meet an error "Is a directory: 'XXX/longmem/valid' ".I think the reason is that when the author writes the code, the version of fairseq is low, and valid and test binaries are not required to run.Up to now,I run this code by the fairseq version is 0.12 and you need to find the code under the fairseq subfolder like "xxx/longmem/fairseq/fairseq_cli/train.py" and just comment the code

    # Load valid dataset (we load training data below, based on the latest checkpoint)
    # We load the valid dataset AFTER building the model
    data_utils.raise_if_valid_subsets_unintentionally_ignored(cfg)
    if cfg.dataset.combine_valid_subsets:
        task.load_dataset("valid", combine=True, epoch=1)
    else:
        for valid_sub_split in cfg.dataset.valid_subset.split(","):
            task.load_dataset(valid_sub_split, combine=False, epoch=1)

In the official code,It's on lines 128 through 133.