jzhang38 / TinyLlama

Hi,

Can I check why did you not ignore the loss for the bos token (<s>)?

TinyLlama/pretrain/tinyllama.py

Line 197 in c53075b

loss_func = FusedCrossEntropyLoss()

I noticed that the preprocessing causes the remainder of the binary file to be the bos token (<s>).

TinyLlama/scripts/prepare_slimpajama.py

Line 69 in c53075b

builder.write_reminder()

Consequently, my model checkpoint (not TinyLlama's) outputs poor qualitative results:

Prompt: "(very long text of 1027 tokens). How many yards longer was the longest passing touchdown than the shortest?"
Output: "<s>ending<s>end<s>ent<s>ended"

Interestingly, my model of previous checkpoint (100b tokens before) performed okay.

I'm trying to fix this by specifying the loss function to ignore the <s> idx (i.e. 1). I think this is a correct fix, but i'm not sure if it fixes the underlying issue (the issue should have plagued our model from the start, why did it only happen at this iter step?).

I noticed that the preprocessing causes the remainder of the binary file to be the bos token (<s>).

Yes I think you are right here.

TinyLlama/scripts/prepare_slimpajama.py

Line 85 in c53075b

num_processes = cpu_count()

If you have 64 CPU cores. prepare_slimpajama.py will initiate 64 processes, each with a PackedDatasetBuilder and call

TinyLlama/scripts/prepare_slimpajama.py

Line 26 in c53075b

def prepare_full(

TinyLlama/scripts/prepare_slimpajama.py

Line 69 in c53075b

builder.write_reminder()

That means each process will leave a chunk file that is not fully filled with text tokens but rather with some sep tokens.

TinyLlama/lit_gpt/packed_dataset.py

Line 77 in c53075b

self._arr.fill(self._sep_token)

TinyLlama/lit_gpt/packed_dataset.py

Line 91 in c53075b

f.write(self._arr.tobytes(order="C"))

So we may have 64 files with some sep tokens remaining.

but i'm not sure if it fixes the underlying issue (the issue should have plagued our model from the start, why did it only happen at this iter step?).

I think it is because 64 files with some sep tokens remaining is a relatively small portion compared with the entire pretraining corpus(450k small bin files after processing), especially when you consider the small size of each chunk. So I do not know why your second checkpoint became really bad. Maybe it is just this specific prompt？ Does the benchmark performance degrade significantly?

These are some of my preliminary thoughts. Haven't looked very deeply into it yet. Thanks for spotting this out. We will fix it soon. For example, we can opt to not call builder.write_reminder() at all.

Thanks for your reply Peiyuan!

I think it is because 64 files with some sep tokens remaining is a relatively small portion compared with the entire pretraining corpus(450k small bin files after processing), especially when you consider the small size of each chunk.

It's true that the % of files with some bos token remaining is relatively small. The chunk size is actually quite big (i.e. 2049 * 1028). This means that once a process loads the "problematic" binary chunk, it'll use this file for the next 1024 iterations.

TinyLlama/scripts/prepare_slimpajama.py

Line 76 in 072536c

chunk_size: int = 2049 * 1024,

But you are right that as the % of files is relatively small, it shouldn't affect. I'll let you know if I managed to fix the bug. Thanks for your help anyway!

Maybe it is just this specific prompt？ Does the benchmark performance degrade significantly?

It degraded significantly across the instruct-eval benchmark.

#85

Why do we not set the `ignore_index` of `FusedCrossEntropy` to `bos_id`?