Processed orca dataset for LLaMA is not published

Question

Processed orca dataset for LLaMA is not published

szutenberg opened this issue 5 months ago · comments

Michal Szutenberg commented 5 months ago

Hi,

It was promised that the processed dataset will be published in the MLCommons cloud.

However, the instruction in https://github.com/mlcommons/inference/tree/master/language/llama2-70b still explains how to generate the dataset on our own (with processorca.py).

md5sum of the generated file is also not present ( 5fe8be0a7ce5c3c9a028674fd24b00d5 ) in the README which may potentially cause problems during review or audit process (what if a submitter generates slightly different dataset which slightly impacts the results). IMHO at least md5sum should be present in the readme.

CC @pgmpablo157321 @nv-alicheng @ashwin @attafosu

Zhihan Jiang · Answer 1 · Thu Feb 22 2024 02:03:05 GMT+0800 (China Standard Time)

Hi Michal, I do think the link to sign the consent form (which directs the user to the gdrive link) should be posted: https://docs.google.com/forms/d/e/1FAIpQLSc_8VIvRmXM3I8KQaYnKf7gy27Z63BBoI_I1u02f4lw6rBp3g/viewform

Zhihan Jiang · Answer 2 · Thu Feb 22 2024 02:03:15 GMT+0800 (China Standard Time)

@pgmpablo157321 can you help add that?

Michal Szutenberg · Answer 3 · Thu Feb 22 2024 02:28:10 GMT+0800 (China Standard Time)

Hi @nvzhihanj ,

Thanks. Yes, I agree that the link to the consent form should be also in the readme. However, the issue is about the dataset (24576 samples), not weights.

We would like to make sure that:

submitters do use the same dataset for their submission
it's easy to to reproduce the results without generating the dataset by an organization which is not MLCommons Member (weights and tokenizer can be obtained directly from HF after signing the Meta licence)

Zhihan Jiang · Answer 4 · Thu Feb 22 2024 02:37:04 GMT+0800 (China Standard Time)

@szutenberg the dataset is also in the gdrive.
For reproducing the dataset, I think https://github.com/mlcommons/inference/tree/master/language/llama2-70b#get-dataset provides the guidance to generate the pkl file. Are you suggesting that the MD5SUM of the pickle file be added to the readme?

Pablo Gonzalez · Answer 5 · Thu Feb 22 2024 02:58:10 GMT+0800 (China Standard Time)

@szutenberg Inside the link of the MLCommons Llama 2 (obtained after filling out the form), there is a copy of the preprocessed dataset. Is it enough to post the link of the form and the instructions to get the preprocessed dataset? Or do we ideally want to share the dataset directly? The confidentiality restriction is for the Llama 2 weights, so this should be possible. I just want to have very clear what do I have to do and to ask for.

Michal Szutenberg · Answer 6 · Thu Feb 22 2024 03:08:13 GMT+0800 (China Standard Time)

@pgmpablo157321 ideally if we share the dataset directly.

IMHO md5sum in README is must have since it's our own, custom dataset. I'm just not sure if 2 years later these repro steps will generate exactly the same output...

Pablo Gonzalez · Answer 7 · Thu Feb 22 2024 06:41:42 GMT+0800 (China Standard Time)

Solved by #1638