what is the max input size?

Question

what is the max input size?

vyau opened this issue a year ago · comments

Vincent Yau commented a year ago

Hi Instructor team: When I feed content into instructor to generate embedding,
I saw this in stdout:

max_seq_length 512

I assume that means there is some input cap at 512 bytes?
What happen if I my input is larger than that size?
thanks

hongjin-su · Answer 1 · Mon Jul 31 2023 21:57:30 GMT+0800 (China Standard Time)

Hi, Thanks a lot for your interest in the INSTRUCTOR!

The input text will be truncated if the length is larger than 512.

amitduwal · Answer 2 · Tue Aug 01 2023 13:21:45 GMT+0800 (China Standard Time)

@hongjin-su
Does that mean if chunks of 1000 tokens are passed to the embedding model the remaining 488 tokens are lost?
and how would you generate the embeddings for a long text or document with multiple pages?

Hynek Kydlíček · Answer 3 · Wed Aug 02 2023 00:54:50 GMT+0800 (China Standard Time)

See here https://github.com/HKUNLP/instructor-embedding/blob/main/InstructorEmbedding/instructor.py#L321.
You can then look into tokenizers docs and find out that any token over 512 is scraped.

Hynek Kydlíček · Answer 4 · Wed Aug 02 2023 00:55:35 GMT+0800 (China Standard Time)

Also the unit used for sequence length is TOKEN not byte or character.

hongjin-su · Answer 5 · Tue Dec 19 2023 20:05:58 GMT+0800 (China Standard Time)

Thanks a lot for the reply! @hynky1999
Feel free to re-open the issue if you have any further questions or comments!