`PackedDatasetBuilder` does not separate with `sep_token`
calvintwr opened this issue · comments
I noticed that PackedDatasetBuilder
does not separate the tokens with sep_token
.
To illustrate, referencing
lit-llama/scripts/prepare_redpajama.py
Line 71 in da71ade
builder = packed_dataset.PackedDatasetBuilder(
outdir=destination_path,
prefix=prefix,
chunk_size=chunk_size,
sep_token=tokenizer.bos_id,
dtype="auto",
vocab_size=tokenizer.vocab_size,
)
and
lit-llama/scripts/prepare_redpajama.py
Line 85 in da71ade
text_ids = tokenizer.encode(text)
The minimal reproducible code is as follows:
from pathlib import Path
import numpy as np
from lit_gpt.tokenizer import Tokenizer
from lit_gpt.packed_dataset import PackedDatasetBuilder
tokenizer = Tokenizer(Path('tokenizer'))
content = 'foo'
tokenized = tokenizer.encode(content)
print(tokenized)
# prints:
# tensor([7953, 2], dtype=torch.int32)
training_dataset_builder = PackedDatasetBuilder(
outdir='FOO',
# Use process_id to differentiate builders
prefix='BAR',
chunk_size=6,
sep_token=tokenizer.bos_id,
dtype="auto",
vocab_size=tokenizer.vocab_size,
)
training_dataset_builder.add_array(np.array(tokenized))
print(training_dataset_builder._arr)
# prints:
# [7953 2 1 1 1 1]
training_dataset_builder.add_array(np.array(tokenized))
print(training_dataset_builder._arr)
# prints:
# [7953 2 7953 2 1 1]
1
represents the bos token.
2
represents the eos token.
As you can see, this translates to:
foo</s>foo</s><s><s>
Shouldn't the foo's be wrapped in bos and eos tokens, like this?
# Tensor
[1 7953 2 1 7953 2 ]
# Plain text
<s>foo</s><s>foo</s>