minoad / plat-analysis

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Plat Analysis

Summary

Initially conceived as a straightforward processor for homeowners association restriction documents, this project aimed to extract and store pertinent information. However, it has since evolved to accommodate a diverse range of document types and supports storage across multiple database systems.

In recent months, the project has further expanded due to an influx of historical newspaper clippings and other document images. This has necessitated the incorporation of more sophisticated image extraction techniques. Additionally, the project now includes the development of a text classifier, sentiment analysis, and various other text analysis tools.

TODO

    • Find a way to avoid installing java by getting rid of tika.
    • Add textract for additional metadata information.
    • For searchable pdfs, extract the text and store it in the database.
    • Document should take all potential file types and match rather than testing internally.
    • In the pdfprocessor object, collect metadata about the file and merge it into the writer object.
    • Add sqlwriter
    • Deal with these file names
    • Add filetypes
    • Come back to struct errors and figure out what is going on
    • Can I remove watermarks?
    • Convert the file write to use a protocol and implement a sqlwriter
    • Resolve WARNING:plat.ocr:Not implemented error on pages in pdf plat/data/GRAND MESA/Grand Mesa 7 Lots 83 and 84 Replat Addressing 2019-07-30 (2).pdf: unsupported filter /JBIG2Decode. File path: plat/data/GRAND MESA/Grand Mesa 7 Lots 83 and 84 Replat Addressing 2019-07-30 (2).pdf
    • Collect additional pdf file properties.
    • Secure u/p
    • Rename plat to src
    • Remove watermarks
    • Image analysis
    • Opencv
    • OCR document
    • Store text data
    • Analyse text data
    • Store image data in mongodb

Mongo Details

"IPAddress": "172.19.0.3",
"DNSNames": [
    "plat_analysis_devcontainer-mongo-1",
    "mongo",
    "18470a4f9382"
]

Setup

make build

About


Languages

Language:Python 96.1%Language:Makefile 2.7%Language:Dockerfile 1.1%