Reduce the dataset size for LLAMA2

Question

Reduce the dataset size for LLAMA2

arjunsuresh opened this issue 6 months ago · comments

I believe there is no need to run LLAMA2-70B model for 24576 samples. GPT-J-6B is run for 13368 samples, Stable diffusion is run for 5000 samples, 3d-unet is run for 43 samples and all other MLPerf inference models are way faster. In fact, LLAMA2-70B is at least an order of magnitude slower than any other MLPerf inference model. For coming rounds, we should reduce the LLAMA2-70B dataset size to some 4-digit number.

Anton Lokhmotov · Answer 1 · Tue Feb 27 2024 23:29:26 GMT+0800 (China Standard Time)

Another idea would be to create a representative subset for performance runs, i.e. with sample distribution similar to the original dataset, but only 1/10th the size.

Arjun Suresh · Answer 2 · Tue Feb 27 2024 23:36:40 GMT+0800 (China Standard Time)

@psyhtest Thats a good option. But in LLAMA2 case the dataset is already a subset selection. So, it might be easier to reduce the dataset size itself though this means a new accuracy threshold.
If we use a different dataset for performance runs, TEST01 might be complicated when it gets introduced for LLAMA2.