gkamradt / langchain-tutorials

Overview and tutorial of the LangChain Library

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

TextSplitter in different languages

goldengrape opened this issue · comments

https://github.com/gkamradt/langchain-tutorials/blob/main/data_generation/5%20Levels%20Of%20Summarization%20-%20Novice%20To%20Expert.ipynb

For summarization methods above level 3, the best practice is not to use RecursiveCharacterTextSplitter, but TokenTextSplitter, because the number of tokens corresponding to the same length of string intercepted varies greatly from language to language.

text_splitter_by_char = RecursiveCharacterTextSplitter(separators=["\n\n", "\n"], chunk_size=10000, chunk_overlap=500)
text_splitter_by_token = TokenTextSplitter(chunk_size=3000, chunk_overlap=100)

If this is not taken into account, errors exceeding the max token count are likely to occur when processing text in multiple languages.

I have tested the number of tokens used for the same family of patents, in different languages:

English (US10901237B2)=21823 (100%)
Simplified Chinese (CN112904591A)=30901 (142%)
Traditional Chinese (TW201940135A)=36530 (167%)
Korean (KR20190089752A)=42644 (195%)
Japanese (JP2019128599A)=51430 (236%)

Thank you for this! I'll take this as a best practice

I had some problems using RecursiveCharacterTextSplitter where it exceeded the recursion depth. This happened when my document set exceeded a small threshold of tokens (about 2000 in total). Anybody else experience similar issues?