SeanLee97 / AnglE

Train and Infer Powerful Sentence Embeddings with AnglE | 🔥 SOTA on STS and MTEB Leaderboard

Home Page:https://arxiv.org/abs/2309.12871

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

How to use the encoding with tiktoken ?

rishabhgupta93 opened this issue · comments

Hey,

I am trying to get the encoding using tiktoken to initiate token counter:

import tiktoken
from llama_index.callbacks import CallbackManager, TokenCountingHandler
enc = tiktoken.get_encoding("WhereIsAI/UAE-Large-V1")
token_counter = TokenCountingHandler(tokenizer= enc.encode)

But i am getting following error:

_---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
Cell In[20], line 3
1 import tiktoken
2 from llama_index.callbacks import CallbackManager, TokenCountingHandler
----> 3 enc = tiktoken.get_encoding("WhereIsAI/UAE-Large-V1")
4 token_counter = TokenCountingHandler(tokenizer= enc.encode)

File f:\pycharmprojects\llamaindex\venv\lib\site-packages\tiktoken\registry.py:68, in get_encoding(encoding_name)
65 assert ENCODING_CONSTRUCTORS is not None
67 if encoding_name not in ENCODING_CONSTRUCTORS:
---> 68 raise ValueError(
69 f"Unknown encoding {encoding_name}. Plugins found: {_available_plugin_modules()}"
70 )
72 constructor = ENCODING_CONSTRUCTORS[encoding_name]
73 enc = Encoding(**constructor())

ValueError: Unknown encoding WhereIsAI/UAE-Large-V1. Plugins found: ['tiktoken_ext.openai_public']_

Is there any way to use the encodings with tiktoken ?

Thanks

commented

It seems tiktoken is used for GPT-like models, but UAE is a BERT-based model.

Could you use the tokenizers package in your application? You can use tokenizers to load UAE's tokenizer.

commented

By the way, do you want to get the tokenized IDs of sentences or obtain their sentence embedding?
If you want to get their sentence embeddings, follow the usage.

Thanks for prompt response!

I am able to create embeddings.

I just want to count total tokens for which embedding is generated and also the number of tokens while running the query engine.

I am using llama-index for creating RAG.