xlang-ai / instructor-embedding

[ACL 2023] One Embedder, Any Task: Instruction-Finetuned Text Embeddings

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

what is the max input size?

vyau opened this issue · comments

Hi Instructor team: When I feed content into instructor to generate embedding,
I saw this in stdout:

max_seq_length 512

I assume that means there is some input cap at 512 bytes?
What happen if I my input is larger than that size?
thanks

Hi, Thanks a lot for your interest in the INSTRUCTOR!

The input text will be truncated if the length is larger than 512.

@hongjin-su
Does that mean if chunks of 1000 tokens are passed to the embedding model the remaining 488 tokens are lost?
and how would you generate the embeddings for a long text or document with multiple pages?

See here https://github.com/HKUNLP/instructor-embedding/blob/main/InstructorEmbedding/instructor.py#L321.
You can then look into tokenizers docs and find out that any token over 512 is scraped.

Also the unit used for sequence length is TOKEN not byte or character.

Thanks a lot for the reply! @hynky1999
Feel free to re-open the issue if you have any further questions or comments!