TextSplitter in different languages
goldengrape opened this issue · comments
For summarization methods above level 3, the best practice is not to use RecursiveCharacterTextSplitter
, but TokenTextSplitter
, because the number of tokens corresponding to the same length of string intercepted varies greatly from language to language.
text_splitter_by_char = RecursiveCharacterTextSplitter(separators=["\n\n", "\n"], chunk_size=10000, chunk_overlap=500)
text_splitter_by_token = TokenTextSplitter(chunk_size=3000, chunk_overlap=100)
If this is not taken into account, errors exceeding the max token count are likely to occur when processing text in multiple languages.
I have tested the number of tokens used for the same family of patents, in different languages:
English (US10901237B2)=21823 (100%)
Simplified Chinese (CN112904591A)=30901 (142%)
Traditional Chinese (TW201940135A)=36530 (167%)
Korean (KR20190089752A)=42644 (195%)
Japanese (JP2019128599A)=51430 (236%)
Thank you for this! I'll take this as a best practice
I had some problems using RecursiveCharacterTextSplitter where it exceeded the recursion depth. This happened when my document set exceeded a small threshold of tokens (about 2000 in total). Anybody else experience similar issues?