Bug: incorrect value of start_index in RecursiveCharacterTextSplitter when substring is present
maggonravi opened this issue · comments
Ravi Maggon commented
Checked other resources
- I added a very descriptive title to this issue.
- I searched the LangChain documentation with the integrated search.
- I used the GitHub search to find a similar question and didn't find it.
- I am sure that this is a bug in LangChain rather than my code.
- The bug is not resolved by updating to the latest stable version of LangChain (or the specific integration package).
Example Code
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.docstore.document import Document
splitter = RecursiveCharacterTextSplitter(chunk_size=5, chunk_overlap=5, separators=[" ", ""], add_start_index=True)
splitter.split_documents([Document(page_content="chunk chunk")])
Error Message and Stack Trace (if applicable)
No response
Description
Expected output
[Document(page_content='chunk', metadata={'start_index': 0}),
Document(page_content='chun', metadata={'start_index': 6}),
Document(page_content='chunk', metadata={'start_index': 6})]
Output with current code
[Document(page_content='chunk', metadata={'start_index': 0}),
Document(page_content='chun', metadata={'start_index': 0}),
Document(page_content='chunk', metadata={'start_index': 0})]
System Info
System Information
------------------
> OS: Linux
> OS Version: #1 SMP Thu Feb 1 03:51:05 EST 2024
> Python Version: 3.11.8 (main, Mar 15 2024, 12:37:54) [GCC 10.3.1 20210422 (Red Hat 10.3.1-1)]
Package Information
-------------------
> langchain_core: 0.1.46
> langchain: 0.1.12
> langchain_community: 0.0.28
> langsmith: 0.0.82
> langchain_experimental: 0.0.47
> langchain_text_splitters: 0.0.1
> langchainplus_sdk: 0.0.21
Ravi Maggon commented
This change works.
class TextSplitter(BaseDocumentTransformer, ABC):
def create_documents(
self, texts: List[str], metadatas: Optional[List[dict]] = None
) -> List[Document]:
"""Create documents from a list of texts."""
_metadatas = metadatas or [{}] * len(texts)
documents = []
for i, text in enumerate(texts):
index = -1
previous_chunk_len = 0
for j, chunk in enumerate(self.split_text(text)):
metadata = copy.deepcopy(_metadatas[i])
if self._add_start_index:
if j > 0:
minimum_index_offset = max(0, previous_chunk_len - self._chunk_overlap, previous_chunk_len - len(chunk))
else:
minimum_index_offset = 1
index = text.find(chunk, index + minimum_index_offset)
metadata["start_index"] = index
previous_chunk_len = len(chunk)
new_doc = Document(page_content=chunk, metadata=metadata)
documents.append(new_doc)
return documents