TextSplitter in different languages

Question

TextSplitter in different languages

goldengrape opened this issue 2 years ago · comments

https://github.com/gkamradt/langchain-tutorials/blob/main/data_generation/5%20Levels%20Of%20Summarization%20-%20Novice%20To%20Expert.ipynb

For summarization methods above level 3, the best practice is not to use RecursiveCharacterTextSplitter, but TokenTextSplitter, because the number of tokens corresponding to the same length of string intercepted varies greatly from language to language.

text_splitter_by_char = RecursiveCharacterTextSplitter(separators=["\n\n", "\n"], chunk_size=10000, chunk_overlap=500)
text_splitter_by_token = TokenTextSplitter(chunk_size=3000, chunk_overlap=100)

If this is not taken into account, errors exceeding the max token count are likely to occur when processing text in multiple languages.

I have tested the number of tokens used for the same family of patents, in different languages:

English (US10901237B2)=21823 (100%)
Simplified Chinese (CN112904591A)=30901 (142%)
Traditional Chinese (TW201940135A)=36530 (167%)
Korean (KR20190089752A)=42644 (195%)
Japanese (JP2019128599A)=51430 (236%)

gkamradt · Answer 1 · Thu May 25 2023 04:41:04 GMT+0800 (China Standard Time)

Thank you for this! I'll take this as a best practice

Mo Charawi · Answer 2 · Thu May 25 2023 13:08:32 GMT+0800 (China Standard Time)

I had some problems using RecursiveCharacterTextSplitter where it exceeded the recursion depth. This happened when my document set exceeded a small threshold of tokens (about 2000 in total). Anybody else experience similar issues?