Could MarkdownTextSplitter be stoped at table not at row level?

Question

Could MarkdownTextSplitter be stoped at table not at row level?

chew-z opened this issue 24 days ago · comments

I am processing some financial texts that contain tables with MarkdownTextSplitter. It works very well but perhaps too well for my needs.

As you could see from logs below I am getting very small chunks that contain some hierarchy information but only single data row from markdown table per chunk. This is significantly diminishing the quality of results...

If I send larger part of table or entire table to LLM (via TextSplitter) I get more context and model can sum up rows, subtract and give more intelligent answers in general.

Single row isn't very informative on its own. With just a few random rows selected form entire table (and not all rows on their own match similarity search criteria as they do contain very little information) I don't get good answers placed in larger context.

So my question: is there a way to stop splitting at table not row level?

2024/06/28 05:31:16 ---chunk---
2024/06/28 05:31:16 # DODATKOWE INFORMACJE I OBJAŚNIENIA
## Warszawa, 24 marca 2020 r.
| Treść | 2018 | 2019 |
| --- | --- | --- |
| Pracownicy na stanowiskach nierobotniczych (etaty) | 1 554 | 2 343 |
2024/06/28 05:31:16 ---chunk---
2024/06/28 05:31:16 # DODATKOWE INFORMACJE I OBJAŚNIENIA
## Warszawa, 24 marca 2020 r.
**_3)_** **_Przeciętnym w roku obrotowym zatrudnieniu, z podziałem na grupy zawodowe;_**
| Treść | 2018 | 2019 |
| --- | --- | --- |
| Pracownicy na stanowiskach robotniczych i pokrewnych (etaty) | 16 | 15 |
2024/06/28 05:31:16 ---chunk---
2024/06/28 05:31:16 # DODATKOWE INFORMACJE I OBJAŚNIENIA
## Warszawa, 24 marca 2020 r.
| Treść | 2018 | 2019 |
| --- | --- | --- |
| Zatrudnienie wg stanu na dzień bilansowy w osobach | 1 575 | 2 363 |
2024/06/28 05:31:16 ---chunk---
2024/06/28 05:31:16 # DODATKOWE INFORMACJE I OBJAŚNIENIA
## Warszawa, 24 marca 2020 r.
| Treść | 2018 | 2019 |
| --- | --- | --- |
| w tym kobiety | 821 | 1 368 |
2024/06/28 05:31:16 ---chunk---
2024/06/28 05:31:16 # DODATKOWE INFORMACJE I OBJAŚNIENIA
## Warszawa, 24 marca 2020 r.
| Treść | 2018 | 2019 |
| --- | --- | --- |
| Ogółem przeciętne zatrudnienie (etaty) | 1 415 | 1 834 |
2024/06/28 05:31:16 ---chunk---
2024/06/28 05:31:16 # DODATKOWE INFORMACJE I OBJAŚNIENIA
## Warszawa, 24 marca 2020 r.
| Lp. | Treść | 01.01.-31.12.2018 r. | 01.01.-31.12.2019 r. |
| --- | --- | --- | --- |
| 8 | Wynagrodzenie z tytułu funkcji płatnika: | 3 270 687,00 | 3 515 863,00 |
2024/06/28 05:31:16 ---chunk---
2024/06/28 05:31:16 # DODATKOWE INFORMACJE I OBJAŚNIENIA
## Warszawa, 24 marca 2020 r.
| Wyszczególnienie | Stan na 31.12.2018 r. | Zwiększenia | Wykorzystanie | Rozwiązanie | Stan na 31.12.2019 r. |
| --- | --- | --- | --- | --- | --- |
| - niewykorzystane urlopy | 8 918 615,86 | 2 057 949,16 | 0,00 | 0,00 | 10 976 565,02 |
2024/06/28 05:31:16 ---chunk---
2024/06/28 05:31:16 # DODATKOWE INFORMACJE I OBJAŚNIENIA
## Warszawa, 24 marca 2020 r.
| Wyszczególnienie | Stan na 31.12.2018 r. | Zwiększenia | Wykorzystanie | Rozwiązanie | Stan na 31.12.2019 r. |
| --- | --- | --- | --- | --- | --- |
| - regulaminowe wygrane | 11 417 995,55 | 11 057 134,55 | 0,00 | 10 773 580,72 | 11 701 549,38 |

ChunHao · Answer 1 · Thu Jul 18 2024 06:08:47 GMT+0800 (China Standard Time)

Hi @chew-z ,
Recently, I am working on chunking development in other project.

I reference and dig in langchaingo a lot.
As I known now, langchaingo import markdown to determine the lots of Markdown information like table, code, link, etc.
And, some parts of logic might restrict the chunk to contain more information, which may cause your situation.

In my current working project, we mainly need header information in the chunk.
So, I rewrite the logic by referring langchain MarkdownHeaderTextSplitter

I guess it would make your chunk closer what you want.

Because the project I am working on is also a open source project, you can also take a look to check if the chunking in this project can solve your issue.

Daniel Bos · Answer 2 · Thu Jul 18 2024 10:45:34 GMT+0800 (China Standard Time)

Hi @chuang8511, that sounds like something that could be worth adding to langchaingo (maybe behind a configuration flag?). I've noticed similar behavior with headings: if you have very short sections in the source document, you end up with very short splits containing a single section, rather than a more reasonably sized split containing multiple sections.

ChunHao · Answer 3 · Thu Jul 18 2024 22:44:15 GMT+0800 (China Standard Time)

Hi @corani
Thanks for reply.
I also want to integrate this logic to langchaingo.
However, I read the code, and I think it will be hard to integrate into langchaingo with a configuration flag.
Unless it is ok for langchaingo to exist 2 MarkdownTextSplitter at the same time. But, I guess it would confuse other developers.

btw, the reason we (open source project) rewrite the logic, it is mainly because we want to position our chunk.
So, I propose this improvement.
If it is good for langchaingo, probably existing 2 MarkdownTextSplitter with different logic will be also good.

Daniel Bos · Answer 4 · Fri Jul 19 2024 10:09:42 GMT+0800 (China Standard Time)

@tmc what's your opinion?

Robert J. · Answer 5 · Sun Jul 21 2024 22:28:24 GMT+0800 (China Standard Time)

Personally I have just replaced:

// append table header
	for _, row := range bodies {
		line := tableRowInMarkdown(row)

		mc.joinSnippet(fmt.Sprintf("%s\n%s", headerMD, line))

		// keep every row in a single Document
		mc.applyToChunks()
	}

with

// append table header
	buffer := headerMD
	for _, row := range bodies {
		line := tableRowInMarkdown(row)

		buffer = fmt.Sprintf("%s\n%s", buffer, line)

		// keep every row in a single Document
	}
	mc.joinSnippet(buffer)
	mc.applyToChunks()

in markdown_splitter.go