jzhang38 / TinyLlama

The TinyLlama project is an open endeavor to pretrain a 1.1B Llama model on 3 trillion tokens.

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Why do we not set the `ignore_index` of `FusedCrossEntropy` to `bos_id`?

larrylawl opened this issue · comments

Hi,

Can I check why did you not ignore the loss for the bos token (<s>)?

loss_func = FusedCrossEntropyLoss()

I noticed that the preprocessing causes the remainder of the binary file to be the bos token (<s>).

builder.write_reminder()

Consequently, my model checkpoint (not TinyLlama's) outputs poor qualitative results:

Prompt: "(very long text of 1027 tokens). How many yards longer was the longest passing touchdown than the shortest?"
Output: "<s>ending<s>end<s>ent<s>ended"

Interestingly, my model of previous checkpoint (100b tokens before) performed okay.

I'm trying to fix this by specifying the loss function to ignore the <s> idx (i.e. 1). I think this is a correct fix, but i'm not sure if it fixes the underlying issue (the issue should have plagued our model from the start, why did it only happen at this iter step?).

I noticed that the preprocessing causes the remainder of the binary file to be the bos token (<s>).

Yes I think you are right here.

num_processes = cpu_count()

If you have 64 CPU cores. prepare_slimpajama.py will initiate 64 processes, each with a PackedDatasetBuilder and call

def prepare_full(

builder.write_reminder()

That means each process will leave a chunk file that is not fully filled with text tokens but rather with some sep tokens.
self._arr.fill(self._sep_token)

f.write(self._arr.tobytes(order="C"))

So we may have 64 files with some sep tokens remaining.

but i'm not sure if it fixes the underlying issue (the issue should have plagued our model from the start, why did it only happen at this iter step?).

I think it is because 64 files with some sep tokens remaining is a relatively small portion compared with the entire pretraining corpus(450k small bin files after processing), especially when you consider the small size of each chunk. So I do not know why your second checkpoint became really bad. Maybe it is just this specific prompt? Does the benchmark performance degrade significantly?

These are some of my preliminary thoughts. Haven't looked very deeply into it yet. Thanks for spotting this out. We will fix it soon. For example, we can opt to not call builder.write_reminder() at all.

Thanks for your reply Peiyuan!

I think it is because 64 files with some sep tokens remaining is a relatively small portion compared with the entire pretraining corpus(450k small bin files after processing), especially when you consider the small size of each chunk.

It's true that the % of files with some bos token remaining is relatively small. The chunk size is actually quite big (i.e. 2049 * 1028). This means that once a process loads the "problematic" binary chunk, it'll use this file for the next 1024 iterations.

chunk_size: int = 2049 * 1024,

But you are right that as the % of files is relatively small, it shouldn't affect. I'll let you know if I managed to fix the bug. Thanks for your help anyway!

Maybe it is just this specific prompt? Does the benchmark performance degrade significantly?

It degraded significantly across the instruct-eval benchmark.