Warning When Using Different HuggingFace Datasets
conceptofmind opened this issue · comments
Hello,
Any idea if this warning will impact the training of the model when using alternative datasets? Or can it be ignored? I understand that PaLM needs concatenated input sequences of length 2048.
Warning thrown:
Token indices sequence length is longer than the specified maximum sequence length for this model (1366 > 1024). Running this sequence through the model will result in indexing errors.
Contained Example:
dataset = load_dataset("the_pile", 'enron_emails')
print(dataset)
tokenizer = GPT2TokenizerFast.from_pretrained("gpt2")
seq_len = 2048
def tokenize(examples):
seq_length = seq_len
examples = tokenizer(examples["text"])
concatenated_examples = {k: list(chain(*examples[k])) for k in examples.keys()}
total_length = len(concatenated_examples[list(examples.keys())[0]])
if total_length >= seq_length:
total_length = (total_length // seq_length) * seq_length
result = {
k: [t[i : i + seq_len] for i in range(0, total_length, seq_length)]
for k, t in concatenated_examples.items()
}
result["labels"] = copy.deepcopy(result["input_ids"])
return result
tokenized_dataset = dataset.map(
tokenize, batched=True, num_proc=16, keep_in_memory=True, remove_columns= ['text', 'meta']
)
Thank you,
Enrico
In my humble option, the reason is that the max seq len of the embedding of your pretrained model is limited to 1024. Maybe you need to modified the embedding logic. It is common in Bert pertaining using longer sequences. Because max len of bert embedding is only 512. The GPT2 may fall into the same cases.
@feifeibear After testing a few different configurations I have confirmed that the warning is due to the pre-trained tokenizer config and the model_max_length embedding parameter in the JSON file. It does not seem to have any noticeable impact on training.
Thank you for the help.