[QA] Question about phase 2 long context pretraining batch size

Question

[QA] Question about phase 2 long context pretraining batch size

skyshine102 opened this issue 2 months ago · comments

Describe the question.

Hi internLM team,
I was reading your great paper about internLM2 and saw that

phase 1: 4k pretraining batch size = 4M (tokens) | 50% of data | 90% training steps
phase 2: 32k pretraining batch size = ? (tokens) <--- is this still 4M tokens? | 50% data (?) | 9% training steps.

Can you provide more details about whether phase 2 batch size in terms of tokens remains constant? I cannot match the data quantity with training steps :(

github-actions · Answer 1 · Fri Apr 19 2024 10:07:48 GMT+0800 (China Standard Time)

This issue is marked as stale because it has been marked as invalid or awaiting response for 7 days without any further response. It will be closed in 7 days if the stale label is not removed or if there is no further response.

Shuo Zhang · Answer 2 · Fri Apr 26 2024 20:58:20 GMT+0800 (China Standard Time)

Hi @skyshine102 , the batch_size in phase 2 remains at 4M.

Jeremy Jahn · Answer 3 · Mon Apr 29 2024 14:43:33 GMT+0800 (China Standard Time)

@00INDEX Thank you for clarification. It's my reading problem. My apologies.

Here is the correct table for future readers.

phase 1: 4k pretraining batch size = 4M (tokens) | 90% training steps --> 90% of total data
phase 2: 32k pretraining batch size = 4M (tokens) | 9% training steps. --> ~=10% of total data but with mix length. Around 50% of this 10% data is <=4k length.