google-research / FLAN

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Code for mixing Enrico sets?

takiholadi opened this issue · comments

What is the proper way of mixing datasets provided by Enrico? What size should it be?

Enrico sets: https://github.com/google-research/FLAN/tree/main/flan/v2#download
The mixture percentage: https://github.com/google-research/FLAN/blob/main/flan/v2/run_example.py

For now I use:

import datasets

cot_submix = datasets.load_dataset('conceptofmind/cot_submix_original')
dialog_submix = datasets.load_dataset('conceptofmind/dialog_submix_original')
niv2_submix = datasets.load_dataset('conceptofmind/niv2_submix_original')
flan2021_submix = datasets.load_dataset('conceptofmind/flan2021_submix_original')
t0_submix = datasets.load_dataset('conceptofmind/t0_submix_original')

cot_zsopt = cot_submix['train'].filter(lambda x: x['template_type'] == 'zs_opt')
cot_fsopt = cot_submix['train'].filter(lambda x: x['template_type'] == 'fs_opt')

dialog_zsopt = dialog_submix['train'].filter(lambda x: x['template_type'] == 'zs_opt')
dialog_fsopt = dialog_submix['train'].filter(lambda x: x['template_type'] == 'fs_opt')

niv2_zsopt = niv2_submix['train'].filter(lambda x: x['template_type'] == 'zs_opt')
niv2_fsopt = niv2_submix['train'].filter(lambda x: x['template_type'] == 'fs_opt')

flan_zsopt = flan2021_submix['train'].filter(lambda x: x['template_type'] == 'zs_opt')
flan_fsopt = flan2021_submix['train'].filter(lambda x: x['template_type'] == 'fs_opt')
flan_zsnoopt = flan2021_submix['train'].filter(lambda x: x['template_type'] == 'zs_noopt')
flan_fsnoopt = flan2021_submix['train'].filter(lambda x: x['template_type'] == 'fs_noopt')

t0_zsopt = t0_submix['train'].filter(lambda x: x['template_type'] == 'zs_opt')
t0_fsopt = t0_submix['train'].filter(lambda x: x['template_type'] == 'fs_opt')
t0_zsnoopt = t0_submix['train'].filter(lambda x: x['template_type'] == 'zs_noopt')
t0_fsnoopt = t0_submix['train'].filter(lambda x: x['template_type'] == 'fs_noopt')

all_datasets = [
    flan_zsopt,
    flan_fsopt,
    flan_zsnoopt,
    flan_fsnoopt,
    #
    t0_zsopt,
    t0_fsopt,
    t0_zsnoopt,
    t0_fsnoopt,
    #
    niv2_zsopt,
    niv2_fsopt,
    #
    cot_zsopt,
    cot_fsopt,
    #
    dialog_zsopt,
    dialog_fsopt,
]

probabilities = [
    0.4/4, 0.4/4, 0.4/4, 0.4/4,
    #
    0.32/4, 0.32/4, 0.32/4, 0.32/4,
    #
    0.2/2, 0.2/2,
    #
    0.05/2, 0.05/2,
    #
    0.03/2, 0.03/2,
]

flan2022_submix = datasets.interleave_datasets(
    datasets=all_datasets,
    probabilities=probabilities,
    seed=567,
    stopping_strategy='first_exhausted',
)

flan2022_submix.to_csv('flan2022_submix.csv')

Size of final dataset is 3699512.

Is it correct?

@takiholadi Yes, this looks correct!

@takiholadi do you use the output as is or do you uniformise the prompts across the dataset ?