DrishtiShrrrma / llama2-7b-rlhf

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

The Problem with Padding

When training these models, it's common to batch multiple sequences together for efficiency. However, sequences in a batch often have varying lengths. To ensure every sequence in the batch has the same length, shorter sequences are "padded" with extra tokens (often zeros) to match the length of the longest sequence.

While this method works, it's not optimal. Here's why:

  1. Computational Overhead: The model has to process these padding tokens, which don't contribute any meaningful information. This increases computational costs without direct benefit.

  2. Wasted Token Capacity: Especially in situations where many sequences are much shorter than the model's maximum context size, a lot of the model's processing capacity is wasted on padding tokens.

Packing to the Rescue

image

The "packing" technique aims to address these inefficiencies. Here's how it works:

  1. Concatenation: Multiple text sequences are concatenated end-to-end, with each sequence separated by a special token, typically the "End Of Sentence" (EOS) token. This creates a long, continuous sequence of meaningful tokens without gaps or padding.

  2. Dividing into Chunks: This long sequence is then divided into fixed-length chunks that match the model's context size. For instance, if the model's context size is 2048 tokens, then the concatenated sequence is split every 2048 tokens.

  3. No Padding Needed: Since each chunk is exactly the size of the model's context, there's no need for padding. Every token is meaningful and contributes to training.

Benefits

  1. Efficiency: The model processes fewer redundant tokens, making training faster and more computationally efficient.

  2. Better Utilization: The model's token processing capacity is better utilized since it's focusing only on meaningful tokens and not wasting computations on padding.

  3. Consistent Sequence Length: Each sequence fed to the model during training is of consistent length, simplifying certain aspects of the training pipeline.

About