shangeth / DocumentExtraction

Repository from Github https://github.comshangeth/DocumentExtractionRepository from Github https://github.comshangeth/DocumentExtraction

DocumentExtraction

This Repository contain code for extracting details such as author, author institution, companies, target price of company, BUY/SELL call from financial PDF documents.

Installation

Use the package manager pip to install foobar.

apt install tesseract-ocr
apt-get install poppler-utils

pip install -r requirements.txt

python -m spacy download en_core_web_trf
python -m spacy download en_core_web_sm

Also install tesseract on your Windows device and add the path to the script with

import pytesseract

pytesseract.pytesseract.tesseract_cmd = (
    # path to .exe file in windows
    r"C:\Users\user\Programs\Tesseract-OCR\tesseract.exe"

    # Linux('which tesseract' to get the path, after installing tesseract)
    r"/usr/bin/tesseract" 
)

NOTE: pytesseract is only necessary for methods using Tesseract-OCR.

Usage

ImageTools

from ImgProcess import ImageTools

# split the document image into region of interest
# avoid useless parts of the document

pdf_image = 'path to image of document'
img_tool = ImageTools()
doc_imgs = img_tool(pdf_image)

PDFReader

from reader import PDFReader

# returns the text content in a PDF file using ImageTools
# 3 available methods
# - pdfplumber
# - pytesseract
# - pytesseract_split

reader = PDFReader(pdf_method='tesseract_split')
text_content = reader('path_to_pdf_document')

EntityRecognition

from NER import EntityRecognition

# extracts the details from the text content 
ER = EntityRecognition(pdf_method='tesseract_split')
author_institution, author, companies, target = ER(text_content)
# Extract details from a single pdf file
python main.py ----pdf_method='tesseract_split' --pdf_file='path_to_pdf'

# Extract details from a directory of pdf files
python main.py ----pdf_method='tesseract_split' --pdf_dir='path_to_pdf_dir'

# Extract details from a directory of pdf files to CSV file
python results.py ----pdf_method='tesseract_split' --pdf_dir='path_to_pdf_dir' --csv_path='path_to_csv'

Contributing

Pull requests are welcome. For major changes, please open an issue first to discuss what you would like to change.

Please make sure to update tests as appropriate.

License

MIT

About


Languages

Language:Python 100.0%