LiuYuWei / document2paragraph

You can use the python script to transfer pdf, docx and other document to paragraph for transfer them to embedding vector.

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

document2paragraph

You can use the python script to transfer pdf, docx, and other documents to paragraph for transferring them to embedding vector. This tool is particularly useful for scenarios where text extraction and further processing from various documents are required.

Features

  • Supports text extraction from PDF and DOCX files.
  • Allows custom title patterns for segmenting text.
  • Saves extracted text into a CSV file for further processing.

Installation

Before using the document2paragraph tool, ensure that you have Python 3 installed. Follow these steps to install the necessary dependencies:

git clone https://github.com/LiuYuWei/document2paragraph.git
cd document2paragraph
pip install -r requirements.txt

Usage

To use the document2paragraph tool, follow these steps:

  1. Place your PDF or DOCX files in an appropriate directory.
  2. Execute the script with the following command:
python main.py <document_file_path> --pattern <split_pattern> --folder <output_folder>

For example:

python main.py example.pdf --pattern "(\s*[一二三四五六七八九十]{1,3}\、)" --folder result

Streamlit webUI

  1. You can use the following method to run the streamlit webUI.
git clone https://github.com/LiuYuWei/document2paragraph.git
cd document2paragraph
streamlit run streamlit.py

License

This project is licensed under the MIT License - see the LICENSE file for details.

About

You can use the python script to transfer pdf, docx and other document to paragraph for transfer them to embedding vector.

License:Apache License 2.0


Languages

Language:Python 100.0%