hpcaitech / PaLM-colossalai

Scalable PaLM implementation of PyTorch

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Warning When Using Different HuggingFace Datasets

conceptofmind opened this issue · comments

Hello,

Any idea if this warning will impact the training of the model when using alternative datasets? Or can it be ignored? I understand that PaLM needs concatenated input sequences of length 2048.

Warning thrown:

Token indices sequence length is longer than the specified maximum sequence length for this model (1366 > 1024). Running this sequence through the model will result in indexing errors.

Contained Example:

dataset = load_dataset("the_pile", 'enron_emails')

print(dataset)

tokenizer = GPT2TokenizerFast.from_pretrained("gpt2")

seq_len = 2048

def tokenize(examples):
    seq_length = seq_len
    examples = tokenizer(examples["text"])
    concatenated_examples = {k: list(chain(*examples[k])) for k in examples.keys()}
    total_length = len(concatenated_examples[list(examples.keys())[0]])
    if total_length >= seq_length:
        total_length = (total_length // seq_length) * seq_length

    result = {
        k: [t[i : i + seq_len] for i in range(0, total_length, seq_length)]
        for k, t in concatenated_examples.items()
    }

    result["labels"] = copy.deepcopy(result["input_ids"])

    return result

tokenized_dataset = dataset.map(
    tokenize, batched=True, num_proc=16, keep_in_memory=True, remove_columns= ['text', 'meta']
)

Thank you,

Enrico

In my humble option, the reason is that the max seq len of the embedding of your pretrained model is limited to 1024. Maybe you need to modified the embedding logic. It is common in Bert pertaining using longer sequences. Because max len of bert embedding is only 512. The GPT2 may fall into the same cases.

@feifeibear After testing a few different configurations I have confirmed that the warning is due to the pre-trained tokenizer config and the model_max_length embedding parameter in the JSON file. It does not seem to have any noticeable impact on training.

Thank you for the help.