gpt_tokenize: unknown token '?

Question

gpt_tokenize: unknown token '?

mark420524 opened this issue a year ago · comments

from flask import Flask,jsonify, render_template, flash, redirect, url_for, Markup, request
gptj_model_load: loading model from 'models/ggml-gpt4all-j-v1.3-groovy.bin' - please wait ...
gptj_model_load: n_vocab = 50400
gptj_model_load: n_ctx = 2048
gptj_model_load: n_embd = 4096
gptj_model_load: n_head = 16
gptj_model_load: n_layer = 28
gptj_model_load: n_rot = 64
gptj_model_load: f16 = 2
gptj_model_load: ggml ctx size = 4505.45 MB
gptj_model_load: memory_size = 896.00 MB, n_mem = 57344
gptj_model_load: ................................... done
gptj_model_load: model size = 3609.38 MB / num tensors = 285
LLM0 GPT4All
Params: {'model': 'models/ggml-gpt4all-j-v1.3-groovy.bin', 'n_predict': 256, 'n_threads': 4, 'top_k': 40, 'top_p': 0.95, 'temp': 0.8}

Serving Flask app 'privateGPT'
Debug mode: off
[2023-05-31 10:39:11,833] {_internal.py:186} INFO - WARNING: This is a development server. Do not use it in a production deployment. Use a production WSGI server instead.
Running on all addresses (0.0.0.0)
Running on http://127.0.0.1:5000
Running on http://10.253.1.21:5000
[2023-05-31 10:39:11,834] {_internal.py:186} INFO - Press CTRL+C to quit
Loading documents from source_documents
Loaded 1 documents from source_documents
Split into 90 chunks of text (max. 500 characters each)
[2023-05-31 10:39:47,710] {_internal.py:186} INFO - 127.0.0.1 - - [31/May/2023 10:39:47] "GET /ingest HTTP/1.1" 200 -
[2023-05-31 10:40:04,057] {_internal.py:186} INFO - 127.0.0.1 - - [31/May/2023 10:40:04] "OPTIONS /get_answer HTTP/1.1" 200 -
gpt_tokenize: unknown token '?
gpt_tokenize: unknown token '€'
gpt_tokenize: unknown token '?
gpt_tokenize: unknown token '?
gpt_tokenize: unknown token '€'
gpt_tokenize: unknown token '?
gpt_tokenize: unknown token '?
gpt_tokenize: unknown token '€'
gpt_tokenize: unknown token '?
gpt_tokenize: unknown token '?
gpt_tokenize: unknown token '€'
gpt_tokenize: unknown token '?
gpt_tokenize: unknown token '?
gpt_tokenize: unknown token '€'
gpt_tokenize: unknown token '?
gpt_tokenize: unknown token '?
gpt_tokenize: unknown token '€'
gpt_tokenize: unknown token '?
gpt_tokenize: unknown token '?
gpt_tokenize: unknown token '€'
gpt_tokenize: unknown token '?
gpt_tokenize: unknown token '?

How can I fix the issue ?

Rami S. Jaber · Answer 1 · Wed May 31 2023 15:22:09 GMT+0800 (China Standard Time)

For the important_tokens which contain several actual words (like frankie_and_bennys), you can replace underscore with the space and feed them normally, Or add them as a special token. I prefer the first option because this way you can use pre-trained embedding for their subtokens. For the ones which aren't actual words (like cb17dy), you must add them as special tokens.

from transformers import GPT2TokenizerFast tokenizer = GPT2TokenizerFast.from_pretrained("gpt2") your_string = '[PRED] name [SUB] frankie and bennys frankie_and_bennys [PRED] cb17dy' SPECIAL_TOKENS = { "bos_token": "<|endoftext|>", "eos_token": "<|endoftext|>", "pad_token": "[PAD]", "additional_special_tokens": ["[SYS]", "[USR]", "[KG]", "[SUB]", "[PRED]", "[OBJ]", "[TRIPLE]", "[SEP]", "[Q]","[DOM]", 'frankie_and_bennys', 'cb17dy'] } tokenizer.add_special_tokens(SPECIAL_TOKENS) print(tokenizer(your_string)['input_ids']) print(tokenizer.convert_ids_to_tokens(tokenizer(your_string)['input_ids']))

Anil Chandra Naidu Matcha · Answer 2 · Wed May 31 2023 21:53:57 GMT+0800 (China Standard Time)

This looks like a common issue with Python 3.8 . You can upgrade to Python 3.10 and it should work

mark · Answer 3 · Thu Jun 01 2023 10:01:03 GMT+0800 (China Standard Time)

This looks like a common issue with Python 3.8 . You can upgrade to Python 3.10 and it should work

I upgrade python to 3.11 , it works. thanks