FasterDecoding / REST

REST: Retrieval-Based Speculative Decoding, NAACL 2024

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

pyo3_runtime.PanicException: called `Result::unwrap()` on an `Err` value: RuntimeError(StackOverflow)

yangbohust opened this issue · comments

from datasets import load_dataset

from tokenization_qwen import QWenTokenizer

import draftretriever

from tqdm import tqdm


model_path = '/home/models/qwen-7b'
tokenizer = QWenTokenizer.from_pretrained(model_path)

segment = 11 # Maximum number of segment: 206
data_files = []
for i in range(segment):
    if i>=100:
        data_files.append(f"train-00{i}-of-00206.parquet")
    elif i >=10:
        data_files.append(f"train-000{i}-of-00206.parquet")
    else:
        data_files.append(f"train-0000{i}-of-00206.parquet")
print("data_files:", data_files)


data_dir = ''
dataset = load_dataset('parquet', data_dir=data_dir, split='train', data_files=data_files)


datastore_path = 'the_stack_python_suffix_array_0_10.idx'
writer = draftretriever.Writer(
    index_file_path=datastore_path,
    max_chunk_len=512 * 1024 * 1024,
    vocab_size=tokenizer.vocab_size,
)

total_length = len(dataset)
print("number of samples: ", total_length)

for sample in tqdm(dataset, total=len(dataset)):
    token_list = tokenizer.encode(sample['content'])
    writer.add_entry(token_list)

writer.finalize()

# python build_sufix_array.py
data_files: ['train-00000-of-00206.parquet', 'train-00001-of-00206.parquet', 'train-00002-of-00206.parquet', 'train-00003-of-00206.parquet', 'train-00004-of-00206.parquet', 'train-00005-of-00206.parquet', 'train-00006-of-00206.parquet', 'train-00007-of-00206.parquet', 'train-00008-of-00206.parquet', 'train-00009-of-00206.parquet', 'train-00010-of-00206.parquet']
Generating train split: 1292995 examples [00:37, 34115.10 examples/s]
Loading dataset shards: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 22/22 [00:00<00:00, 951.57it/s]
number of samples:  1292995
  0%|                                                                                                                                                                             | 33/1292995 [00:00<1:08:04, 316.53it/s]Token indices sequence length is longer than the specified maximum sequence length for this model (97935 > 32768). Running this sequence through the model will result in indexing errors
 48%|██████████████████████████████████████████████████████████████████████████████████▌                                                                                        | 624359/1292995 [53:56<43:54, 253.82it/s]thread '<unnamed>' panicked at src/lib.rs:250:33:
called `Result::unwrap()` on an `Err` value: RuntimeError(StackOverflow)
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace
 48%|██████████████████████████████████████████████████████████████████████████████████▌                                                                                        | 624365/1292995 [53:56<57:46, 192.89it/s]
Traceback (most recent call last):
  File "/home/yb/code/opensource/REST/datastore/my/build_sufix_array.py", line 40, in <module>
    token_list = tokenizer.encode(sample['content'])
  File "/root/miniconda3/envs/lookahead_yb/lib/python3.9/site-packages/transformers/tokenization_utils_base.py", line 2573, in encode
    encoded_inputs = self.encode_plus(
  File "/root/miniconda3/envs/lookahead_yb/lib/python3.9/site-packages/transformers/tokenization_utils_base.py", line 2981, in encode_plus
    return self._encode_plus(
  File "/root/miniconda3/envs/lookahead_yb/lib/python3.9/site-packages/transformers/tokenization_utils.py", line 719, in _encode_plus
    first_ids = get_input_ids(text)
  File "/root/miniconda3/envs/lookahead_yb/lib/python3.9/site-packages/transformers/tokenization_utils.py", line 686, in get_input_ids
    tokens = self.tokenize(text, **kwargs)
  File "/home/models/qwen-7b/tokenization_qwen.py", line 212, in tokenize
    for t in self.tokenizer.encode(
  File "/root/miniconda3/envs/lookahead_yb/lib/python3.9/site-packages/tiktoken/core.py", line 124, in encode
    return self._core_bpe.encode(text, allowed_special)
pyo3_runtime.PanicException: called `Result::unwrap()` on an `Err` value: RuntimeError(StackOverflow)

Hi, it seems like a bug of the tiktoken package when dealing with long texts: Extremely Long Text results in PanicException, which is hard to catch in python code. You could partition long text into multiple segments in tokenizer.encode(sample['content']). For example:

# from chatgpt
import textwrap
segment_length = 1000  

for sample in tqdm(dataset, total=len(dataset)):
    segments = textwrap.wrap(text, sample['content'], break_long_words=False, replace_whitespace=False)
    for segment in segments:
        writer.add_entry(tokenizer.encode(segment))