google-research / FLAN

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

[Question] What license is used for this FLAN dataset(not the code).

quq99 opened this issue · comments

Hi,

Thanks a lot for open source the code to fetch the FLAN data set.

I noticed in the paper: The Flan Collection: Designing Data and Methods for Effective Instruction Tuning. (https://arxiv.org/abs/2301.13688) you mentioned

"to accelerate research on instruction tuning, we make the Flan 2022 collection of datasets, templates, and methods publicly available at this https URL."

I noticed that this repo used Apache 2.0 license. Is the FLAN data set that fetched from the code also under Apache 2.0 license?

Thanks a lot!

@quq99 Good question. As the Flan Collection (or P3, or Natural Instructions v2) is a compilation of hundreds of different datasets, with many different licenses, the rendered data would not be under Apache 2.0.

I am actually working on a full labelling of the dataset licenses and plan to release this publicly soon, so that users can take the subset of Flan that fits their licensing constraints.

@shayne-longpre Thanks a lot! looking forward to that. When you finish, could you reply in this issue, so I could know. Appreciate your work!!

@quq99 Update: we plan to release this in the last week of May.

@shayne-longpre Looking forward to the dataset labeled with license. Thanks for the effort!

@shayne-longpre any update on the above license part? Were you able to complete it?

@balachandarsv apologies again for the wait on this. It turns out license labelling is much more complex than we had originally anticipated.

It has gone from a side project into my next major release, with a lot more data selection/partitioning features being added, not just for Flan, but a lot of relevant data sources. It's tentatively slated for mid-July. I hope this isn't too inconvenient and apologies again on the delay.

@shayne-longpre No problem at all. Please let me know in case if you need help in sorting out the data according to license. I will be happy to help! :-)

commented

Hi @shayne-longpre thanks for labeling all the licenses in the Flan Collection! I'm a bit confused about the Flan-T5 models' Apache-2.0 license, i.e., if some datasets in the Flan Collection have to be removed due to license constraint, why the Flan-T5 models can have Apache-2.0? Were they trained with only permissive datasets?