mlcommons / inference

Reference implementations of MLPerf™ inference benchmarks

Home Page:https://mlcommons.org/en/groups/inference

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Processed orca dataset for LLaMA is not published

szutenberg opened this issue · comments

Hi,

It was promised that the processed dataset will be published in the MLCommons cloud.

However, the instruction in https://github.com/mlcommons/inference/tree/master/language/llama2-70b still explains how to generate the dataset on our own (with processorca.py).

md5sum of the generated file is also not present ( 5fe8be0a7ce5c3c9a028674fd24b00d5 ) in the README which may potentially cause problems during review or audit process (what if a submitter generates slightly different dataset which slightly impacts the results). IMHO at least md5sum should be present in the readme.

CC @pgmpablo157321 @nv-alicheng @ashwin @attafosu

Hi Michal, I do think the link to sign the consent form (which directs the user to the gdrive link) should be posted: https://docs.google.com/forms/d/e/1FAIpQLSc_8VIvRmXM3I8KQaYnKf7gy27Z63BBoI_I1u02f4lw6rBp3g/viewform

@pgmpablo157321 can you help add that?

Hi @nvzhihanj ,

Thanks. Yes, I agree that the link to the consent form should be also in the readme. However, the issue is about the dataset (24576 samples), not weights.

We would like to make sure that:

  • submitters do use the same dataset for their submission
  • it's easy to to reproduce the results without generating the dataset by an organization which is not MLCommons Member (weights and tokenizer can be obtained directly from HF after signing the Meta licence)

@szutenberg the dataset is also in the gdrive.
For reproducing the dataset, I think https://github.com/mlcommons/inference/tree/master/language/llama2-70b#get-dataset provides the guidance to generate the pkl file. Are you suggesting that the MD5SUM of the pickle file be added to the readme?

@szutenberg Inside the link of the MLCommons Llama 2 (obtained after filling out the form), there is a copy of the preprocessed dataset. Is it enough to post the link of the form and the instructions to get the preprocessed dataset? Or do we ideally want to share the dataset directly? The confidentiality restriction is for the Llama 2 weights, so this should be possible. I just want to have very clear what do I have to do and to ask for.

@pgmpablo157321 ideally if we share the dataset directly.

IMHO md5sum in README is must have since it's our own, custom dataset. I'm just not sure if 2 years later these repro steps will generate exactly the same output...

Solved by #1638