safe_store

safe_store is an open-source Python library that provides essential tools for text data management, vectorization, and document retrieval. It empowers users to work with text documents efficiently and effortlessly.

safe_store

Key Features:

1. Text Vectorizer

Versatile Vectorization: Choose between TF-IDF vectorization, model-based embeddings to convert text documents into numerical representations or use BM25 ranking for text retreival.
Document Similarity: Find documents similar to a given query text, making it ideal for document retrieval tasks.
Interactive Visualization: Visualize document embeddings in a scatter plot to gain insights into document relationships.
No Authentication Required: Use the library without the need for API keys or authentication, making it accessible for everyone.
Commercially Usable: safe_store is 100% open-source and free to use, even for commercial purposes, under the Apache 2.0 License.

2. Generic Data Loader

Multi-format Support: Read various file formats, including PDF, DOCX, JSON, HTML, and more.
Simplified Text Extraction: Convert file content to plain text or data structures with ease.
Efficient and Time-Saving: Streamline data loading and processing tasks, reducing the need for manual extraction.

What Can You Use `safe_store` For?

Text Document Analysis: Analyze and understand the content of text documents quickly and efficiently.
Document Retrieval: Retrieve documents similar to a given query text, facilitating content recommendation and search tasks.
Text Data Preprocessing: Prepare text data for natural language processing (NLP) tasks, such as sentiment analysis and text classification.
Data Loading: Streamline the process of reading and extracting content from various file formats.

safe_store is designed to be accessible, versatile, and free for all users. It's an ideal choice for developers, data scientists, and researchers who want a user-friendly and open-source solution for working with text data.

Explore the world of text data management and analysis with safe_store today!

Text Vectorizer

Features

Vectorize and index text documents.
Retrieve similar documents based on a query.
Supports both TF-IDF vectorization and model-based embeddings.
Interactive visualization of document embeddings.
No authentication or API keys required.

Installation

To install safe_store, you can use pip:

pip install safe_store

Getting Started

Initializing the Text Vectorizer

from safe_store import TextVectorizer, VectorizationMethod
from pathlib import Path

# Create an instance of TextVectorizer
vectorizer = TextVectorizer(
    vectorization_method=VectorizationMethod.TFIDF_VECTORIZER,
    database_path="database.json",
    save_db=False
)

Adding and Indexing Documents

# Add documents for vectorization
documents = ["llm", "space", "submarines", "new york"]
for doc in documents:
    document_name = Path(__file__).parent / f"{doc}.txt"
    with open(document_name, 'r', encoding='utf-8') as file:
        text = file.read()
    vectorizer.add_document(document_name, text, chunk_size=100, overlap_size=20, force_vectorize=False, add_as_a_bloc=False)

# Index the documents (perform vectorization)
vectorizer.index()

Embedding a Query and Retrieving Similar Documents

# Embed a query and retrieve similar documents
query_text = "what is space"
similar_texts, _, _ = vectorizer.recover_text(query_text, top_k=3)

# Show the interactive document visualization
vectorizer.show_document(show_interactive_form=True)

print("Similar Documents:")
for i, text in enumerate(similar_texts):
    print(f"{i + 1}: {text}")

The vectorizer.show_document(show_interactive_form=True) should yield a plot like this where you can read the text by pointing on the dots. Each dot is a chunk of the text. We can clearly see that chunks that come from the same document tend to form a cluster.

Generic Data Loader

Features

Read various file formats including PDF, DOCX, JSON, HTML, and more.
Convert file content to text or data structures.

Usage

To read a file using GenericDataLoader, you can use the read_file method and provide the file path:

from safe_store import GenericDataLoader
from pathlib import Path

file_path = Path("example.pdf")
file_content = GenericDataLoader.read_file(file_path)

Supported File Types

PDF
DOCX
JSON
HTML
PPTX
TXT
RTF
MD
LOG
CPP
Java
JS
Python
Ruby
Shell Script
SQL
CSS
PHP
XML
YAML
INI
INF
MAP
BAT

Feel free to replace "example.pdf" with the path to your specific file.

Author

ParisNeo

License

This project is licensed under the Apache 2.0 License.

About

A data indexing library 100% open source with no need to use any closed source embeddings or opaque code.

Apache License 2.0

Languages

Language:Python 100.0%

safe_store

Key Features:

1. Text Vectorizer

2. Generic Data Loader

What Can You Use safe_store For?

Text Vectorizer

Features

Installation

Getting Started

Initializing the Text Vectorizer

Adding and Indexing Documents

Embedding a Query and Retrieving Similar Documents

Generic Data Loader

Features

Usage

Supported File Types

Author

License

About

Languages

What Can You Use `safe_store` For?