Bug: incorrect value of start_index in RecursiveCharacterTextSplitter when substring is present

Question

Bug: incorrect value of start_index in RecursiveCharacterTextSplitter when substring is present

maggonravi opened this issue a month ago · comments

Checked other resources

I added a very descriptive title to this issue.
I searched the LangChain documentation with the integrated search.
I used the GitHub search to find a similar question and didn't find it.
I am sure that this is a bug in LangChain rather than my code.
The bug is not resolved by updating to the latest stable version of LangChain (or the specific integration package).

Example Code

from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.docstore.document import Document
splitter = RecursiveCharacterTextSplitter(chunk_size=5, chunk_overlap=5, separators=[" ", ""], add_start_index=True)
splitter.split_documents([Document(page_content="chunk chunk")])

Error Message and Stack Trace (if applicable)

No response

Description

Expected output

[Document(page_content='chunk', metadata={'start_index': 0}),
 Document(page_content='chun', metadata={'start_index': 6}),
 Document(page_content='chunk', metadata={'start_index': 6})]

Output with current code

[Document(page_content='chunk', metadata={'start_index': 0}),
 Document(page_content='chun', metadata={'start_index': 0}),
 Document(page_content='chunk', metadata={'start_index': 0})]

System Info

System Information
------------------
> OS:  Linux
> OS Version:  #1 SMP Thu Feb 1 03:51:05 EST 2024
> Python Version:  3.11.8 (main, Mar 15 2024, 12:37:54) [GCC 10.3.1 20210422 (Red Hat 10.3.1-1)]

Package Information
-------------------
> langchain_core: 0.1.46
> langchain: 0.1.12
> langchain_community: 0.0.28
> langsmith: 0.0.82
> langchain_experimental: 0.0.47
> langchain_text_splitters: 0.0.1
> langchainplus_sdk: 0.0.21

Ravi Maggon · Answer 1 · Thu May 09 2024 18:31:08 GMT+0800 (China Standard Time)

This change works.

class TextSplitter(BaseDocumentTransformer, ABC):
    def create_documents(
        self, texts: List[str], metadatas: Optional[List[dict]] = None
    ) -> List[Document]:
        """Create documents from a list of texts."""
        _metadatas = metadatas or [{}] * len(texts)
        documents = []
        for i, text in enumerate(texts):
            index = -1
            previous_chunk_len = 0
            for j, chunk in enumerate(self.split_text(text)):
                metadata = copy.deepcopy(_metadatas[i])
                if self._add_start_index:
                    if j > 0:
                        minimum_index_offset = max(0, previous_chunk_len - self._chunk_overlap, previous_chunk_len - len(chunk))
                    else:
                        minimum_index_offset = 1
                    index = text.find(chunk, index + minimum_index_offset)
                    metadata["start_index"] = index
                    previous_chunk_len = len(chunk)
                new_doc = Document(page_content=chunk, metadata=metadata)
                documents.append(new_doc)
        return documents