Warning When Using Different HuggingFace Datasets

Question

Warning When Using Different HuggingFace Datasets

conceptofmind opened this issue 2 years ago · comments

Hello,

Any idea if this warning will impact the training of the model when using alternative datasets? Or can it be ignored? I understand that PaLM needs concatenated input sequences of length 2048.

Warning thrown:

Token indices sequence length is longer than the specified maximum sequence length for this model (1366 > 1024). Running this sequence through the model will result in indexing errors.

Contained Example:

dataset = load_dataset("the_pile", 'enron_emails')

print(dataset)

tokenizer = GPT2TokenizerFast.from_pretrained("gpt2")

seq_len = 2048

def tokenize(examples):
    seq_length = seq_len
    examples = tokenizer(examples["text"])
    concatenated_examples = {k: list(chain(*examples[k])) for k in examples.keys()}
    total_length = len(concatenated_examples[list(examples.keys())[0]])
    if total_length >= seq_length:
        total_length = (total_length // seq_length) * seq_length

    result = {
        k: [t[i : i + seq_len] for i in range(0, total_length, seq_length)]
        for k, t in concatenated_examples.items()
    }

    result["labels"] = copy.deepcopy(result["input_ids"])

    return result

tokenized_dataset = dataset.map(
    tokenize, batched=True, num_proc=16, keep_in_memory=True, remove_columns= ['text', 'meta']
)

Thank you,

Enrico

Jiarui Fang · Answer 1 · Sun May 01 2022 23:45:08 GMT+0800 (China Standard Time)

In my humble option, the reason is that the max seq len of the embedding of your pretrained model is limited to 1024. Maybe you need to modified the embedding logic. It is common in Bert pertaining using longer sequences. Because max len of bert embedding is only 512. The GPT2 may fall into the same cases.

Enrico Shippole · Answer 2 · Thu May 19 2022 09:32:31 GMT+0800 (China Standard Time)

@feifeibear After testing a few different configurations I have confirmed that the warning is due to the pre-trained tokenizer config and the model_max_length embedding parameter in the JSON file. It does not seem to have any noticeable impact on training.

Thank you for the help.