nate-dryer/rag-preprocessor

Key Features

Tokenization: Splits text into tokens (words and sentences).
Part-of-Speech Tagging**: Assigns parts of speech to each token, such as verb, noun, adjective, etc.
Named Entity Recognition**: Identifies and classifies named entities in text into predefined categories.
Lemmatization**: Reduces words to their base or dictionary form.
Phone Number Extraction and Anonymization**: Detects and anonymizes phone numbers in text.

Before you begin, ensure you have the following prerequisites installed on your system:

You can install NLTK using pip:

pip install nltk

Clone the repository to your local machine to get started:

git clone https://github.com/nate-dryer/rag-preprocessor.git
cd  <project_directory>

Run the script using the following command:

python main.py <input_file> --format <output_format> --logging

Replace <input_file> with the path to your text file, and <output_format> with either json or csv depending on your desired output format.

Contributions to the Text Preprocessing Utility are welcome!

This project is licensed under the MIT License - for more information, refer to the LICENSE file.