gkamradt / langchain-tutorials

Overview and tutorial of the LangChain Library

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Additional questions on the summarisation tutorial

gidzr opened this issue · comments

Hey there

Thanks for putting this together. I had the same conclusion regarding the summarisation of a large document, in terms of splitting, then embedding, and then ranking the sections and choosing the most relevant for a map_reduce.

However, I've been scouring the net and racking my brains to find a splitter that would work according to theme (eg. keyword density) or being able to identify chapter/section breaks without having to pre-define what the markup would look like.

Is there a python tool or form of analysis that can segment a text document into smaller part more intelligently than a character length breakpoint?

Thanks :)

That's a great question and a topic that I've actually been thinking a lot about. I think I might do a tutorial about the 5 levels of text splitting that work for different examples.

I haven't seen yet a "semantic splitter" like you're talking about but it's on my mind.

Good question!

@gidzr I'm trying to solve the same issue. This is a bit meta, but what about an initial step of using an LLM to split up a document into topical segments / chapters? I've had decent results using that as a prompt for gpt3.5-turbo, but the chapters do seem on the shorter side (varying of course, but usually a few hundred words). If chapters are too short, I was considering a merge chapters step based on simple cosine-similarity of the average embedding of neighboring chapters (sbert or LLM embeddings).

@gkamradt Thanks for this repo and resources! What are the 5 levels of text splitting? +1 on the usefulness of a tutorial

Here's my current thought on that:

  1. Character Split
  2. Reservice Character Text Splitter
  3. Document specific chunker (Code, Markdown, PDF Parser w/ tables)
  4. Semantic chunking
  5. Agent chunking

However, the answer to the question is actually becoming larger. There are more nuances to take into account with LangChains Parent Document Retriever and their multivector retrieval.

To come up with a strategy on chunking you really need to take the whole retrieval process into account