usametov / text-splitter

Semantic Text Splitting

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Semantic Text Splitter.

This program processes a group of text files and generates a collection of semantic chunks for each file in JSON format. This project is an extension of the work found at this GitHub repository, which itself is based on another project documented here.

The Motivation

We want avoid the use of a global constant to determine the size of text chunks. We believe that there must be a more effective method. One potential solution could be the utilization of embeddings to identify clusters of texts that share semantic similarities.

Our underlying assumption is that chunks of text that are semantically similar should be grouped together. This is based on the idea that meaning in language often extends beyond individual sentences, and that by considering larger chunks of text, we can capture more of this meaning.

Furthermore, by grouping similar sentences together, we aim to reduce the amount of noise in the data. Noise, in this context, refers to random or irrelevant information that can interfere with our ability to extract meaningful insights from the text. By reducing this noise, we hope to enhance the clarity and depth of the information we can glean from the text.

In essence, our goal is to leverage the power of embeddings and semantic similarity to create more meaningful and insightful representations of text data. This approach has the potential to significantly improve our ability to understand and interpret large bodies of text.

This notebook effectively visualizes this concept: https://github.com/FullStackRetrieval-com/RetrievalTutorials/blob/main/tutorials/LevelsOfTextSplitting/5_Levels_Of_Text_Splitting.ipynb

Certainly, we'll need to experiment with various values for the parameters of minimum and maximum characters. That's the reason we aim to utilize a Docker image to execute this as an element of our MLOps workflow.

Here's the command to run this in Docker:

docker run -v /home/user777/code/rust/text-splitter/tests:/data <image-name> /target/release/text-splitter --minchar 200 --maxchar 500 --input-files /data/inputs/files2process.txt --dir /data/inputs -o /data

About

Semantic Text Splitting

License:MIT License


Languages

Language:MDX 88.7%Language:Rust 10.9%Language:Dockerfile 0.4%