There are 2 repositories under document-processing topic.
Generic framework for historical document processing
A full-featured Document Layer for your application, providing the functionality of a flexible document management system, including storage, discovery, processing, and retrieval. Deploys directly into your Amazon Web Services Cloud. 🌟 Star to support our work!
An include filter for Pandoc
:zap: Cloud-native, AI-powered, document processing pipelines on AWS.
Semantic extraction from conference proceedings.
A comprehensive list of annotated training datasets classified by use case.
Enhanced Document Understanding on AWS delivers an easy-to-use web application that ingests and analyzes documents, extracts content, identifies and redacts sensitive customer information, and creates search indexes from the analyzed data.
Retrieval of fully structured data made easy. Use LLMs or custom models. Specialized on PDFs and HTML files. Extensive support of tabular data extraction and multimodal queries.
This library builds a graph-representation of the content of PDFs. The graph is then clustered, resulting page segments are classified and returned. Tables are retrieved formatted as a CSV.
A module for creating stopword lists for any language, based on a set of documents.
A Python command-line utility intended for automating some copyediting tasks in documents. It allows editing zipped, XML-based files (e.g. docx, odt, or epub), through XSLT stylesheets. Can be rather easily extended with your own custom xsl stylesheets.
An advanced distributed knowledge fabric for intelligent document processing, featuring multi-document agents, optimized query handling, and semantic understanding.
Document Templater is a powerful tool for automated document generation. Streamline the process of creating standard documents, such as contracts, reports, and forms, using predefined templates. This repository contains the source code for Document Templater, allowing you to easily integrate this functionality into your projects and automate docs.
Text line detection for Urdu OCR (UTRNet)
Use data from MongoDB in LangChain, Llama and OpenAI
FileGazer - deep file analysing and categorisation
Generative intent detection with Magick
An implementation of basic IR techniques from scratch.
TU Dublin Computer Science MSc. Final Project Group 3 - Accessibilator
Spire.Doc for C++ is a professional Word C++ library specifically designed for developers to create, read, write, convert, merge, split, and compare Word documents on any C++ platforms with fast and high-quality performance.
ClearCouncil: Automated tools for collecting, organizing, and embedding publicly available local state county council documents (minutes, agendas) into LLMs. Python, JS, and wget scripts included for easy data retrieval and integration.
This set of robots provides support for automatically obtaining information from invoices using docDigitizer API and keep track of the processed invoices on an Airtable repository
Program Helps remove watermark from a pdf document
通过 python 脚本将两个相对不完整的文档合并为一个完整的文档 / merge two relatively incomplete documents into one complete document via python script
AI-powered chatbot designed to simplify the job search process
Python tool for converting PDF files to text. Simplify your document processing tasks.
Minimize the time requirement of audit report analysis with a containerized file conversion and scraping system
School/College Stationary List OCR and Parsing
Apply keyword procedures in a given Racket namespace using X-expressions.