Additional questions on the summarisation tutorial

Question

Additional questions on the summarisation tutorial

gidzr opened this issue 10 months ago · comments

Hey there

Thanks for putting this together. I had the same conclusion regarding the summarisation of a large document, in terms of splitting, then embedding, and then ranking the sections and choosing the most relevant for a map_reduce.

However, I've been scouring the net and racking my brains to find a splitter that would work according to theme (eg. keyword density) or being able to identify chapter/section breaks without having to pre-define what the markup would look like.

Is there a python tool or form of analysis that can segment a text document into smaller part more intelligently than a character length breakpoint?

Thanks :)

gkamradt · Answer 1 · Mon Aug 21 2023 01:43:59 GMT+0800 (China Standard Time)

That's a great question and a topic that I've actually been thinking a lot about. I think I might do a tutorial about the 5 levels of text splitting that work for different examples.

I haven't seen yet a "semantic splitter" like you're talking about but it's on my mind.

Good question!

Richie Mendelsohn · Answer 2 · Wed Sep 06 2023 21:46:35 GMT+0800 (China Standard Time)

@gidzr I'm trying to solve the same issue. This is a bit meta, but what about an initial step of using an LLM to split up a document into topical segments / chapters? I've had decent results using that as a prompt for gpt3.5-turbo, but the chapters do seem on the shorter side (varying of course, but usually a few hundred words). If chapters are too short, I was considering a merge chapters step based on simple cosine-similarity of the average embedding of neighboring chapters (sbert or LLM embeddings).

@gkamradt Thanks for this repo and resources! What are the 5 levels of text splitting? +1 on the usefulness of a tutorial

gkamradt · Answer 3 · Thu Sep 07 2023 00:48:00 GMT+0800 (China Standard Time)

Here's my current thought on that:

Character Split
Reservice Character Text Splitter
Document specific chunker (Code, Markdown, PDF Parser w/ tables)
Semantic chunking
Agent chunking

However, the answer to the question is actually becoming larger. There are more nuances to take into account with LangChains Parent Document Retriever and their multivector retrieval.

To come up with a strategy on chunking you really need to take the whole retrieval process into account