How to use the encoding with tiktoken ?

Question

How to use the encoding with tiktoken ?

rishabhgupta93 opened this issue 5 months ago · comments

Hey,

I am trying to get the encoding using tiktoken to initiate token counter:

import tiktoken
from llama_index.callbacks import CallbackManager, TokenCountingHandler
enc = tiktoken.get_encoding("WhereIsAI/UAE-Large-V1")
token_counter = TokenCountingHandler(tokenizer= enc.encode)

But i am getting following error:

_---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
Cell In[20], line 3
1 import tiktoken
2 from llama_index.callbacks import CallbackManager, TokenCountingHandler
----> 3 enc = tiktoken.get_encoding("WhereIsAI/UAE-Large-V1")
4 token_counter = TokenCountingHandler(tokenizer= enc.encode)

File f:\pycharmprojects\llamaindex\venv\lib\site-packages\tiktoken\registry.py:68, in get_encoding(encoding_name)
65 assert ENCODING_CONSTRUCTORS is not None
67 if encoding_name not in ENCODING_CONSTRUCTORS:
---> 68 raise ValueError(
69 f"Unknown encoding {encoding_name}. Plugins found: {_available_plugin_modules()}"
70 )
72 constructor = ENCODING_CONSTRUCTORS[encoding_name]
73 enc = Encoding(**constructor())

ValueError: Unknown encoding WhereIsAI/UAE-Large-V1. Plugins found: ['tiktoken_ext.openai_public']_

Is there any way to use the encodings with tiktoken ?

Thanks

Sean · Answer 1 · Wed Jan 10 2024 10:02:19 GMT+0800 (China Standard Time)

It seems tiktoken is used for GPT-like models, but UAE is a BERT-based model.

Could you use the tokenizers package in your application? You can use tokenizers to load UAE's tokenizer.

Sean · Answer 2 · Wed Jan 10 2024 10:04:44 GMT+0800 (China Standard Time)

By the way, do you want to get the tokenized IDs of sentences or obtain their sentence embedding?
If you want to get their sentence embeddings, follow the usage.

Rishabh Gupta · Answer 3 · Wed Jan 10 2024 13:30:25 GMT+0800 (China Standard Time)

Thanks for prompt response!

I am able to create embeddings.

I just want to count total tokens for which embedding is generated and also the number of tokens while running the query engine.

I am using llama-index for creating RAG.