There are 3 repositories under document-processing topic.
Generic framework for historical document processing
A full-featured Document Layer for your application, providing the functionality of a flexible document management system, including storage, discovery, processing, and retrieval. Deploys directly into your Amazon Web Services Cloud. 🌟 Star to support our work!
:zap: Cloud-native, AI-powered, document processing pipelines on AWS.
An include filter for Pandoc
Retrieval of fully structured data made easy. Use LLMs or custom models. Specialized on PDFs and HTML files. Extensive support of tabular data extraction and multimodal queries.
Unofficial mirror of git://git.lyx.org/lyx.git (updates daily. not affiliated with lyx.org.)
Semantic extraction from conference proceedings.
A comprehensive list of annotated training datasets classified by use case.
Enhanced Document Understanding on AWS delivers an easy-to-use web application that ingests and analyzes documents, extracts content, identifies and redacts sensitive customer information, and creates search indexes from the analyzed data.
This library builds a graph-representation of the content of PDFs. The graph is then clustered, resulting page segments are classified and returned. Tables are retrieved formatted as a CSV.
tokyo, a REST API, when given any type of document 📄, Identifies mime-type 🧐. Suggests extension 🦔. Alas Extracts text 💪.
A module for creating stopword lists for any language, based on a set of documents.
An advanced distributed knowledge fabric for intelligent document processing, featuring multi-document agents, optimized query handling, and semantic understanding.
A Python command-line utility intended for automating some copyediting tasks in documents. It allows editing zipped, XML-based files (e.g. docx, odt, or epub), through XSLT stylesheets. Can be rather easily extended with your own custom xsl stylesheets.
Text line detection for Urdu OCR (UTRNet)
Use data from MongoDB in LangChain, Llama and OpenAI
FileGazer - deep file analysing and categorisation
Phone-Oriented Processing SofTware for ARchives
Generative intent detection with Magick
An implementation of basic IR techniques from scratch.
Pdf2xNet is a .NET library for seamless integration with Xpdf tools, enabling easy conversion of PDF documents to text, images, and HTML formats within your .NET applications.
TU Dublin Computer Science MSc. Final Project Group 3 - Accessibilator
Spire.Doc for C++ is a professional Word C++ library specifically designed for developers to create, read, write, convert, merge, split, and compare Word documents on any C++ platforms with fast and high-quality performance.
A document preprocessor that works in conjunction with tools like groff/troff & refer.
ClearCouncil: Automated tools for collecting, organizing, and embedding publicly available local state county council documents (minutes, agendas) into LLMs. Python, JS, and wget scripts included for easy data retrieval and integration.
This set of robots provides support for automatically obtaining information from invoices using docDigitizer API and keep track of the processed invoices on an Airtable repository
Program Helps remove watermark from a pdf document
通过 python 脚本将两个相对不完整的文档合并为一个完整的文档 / merge two relatively incomplete documents into one complete document via python script
XmlDocumentProcessor: A .NET component for XML document processing. It analyzes XML content, performs keyword-based queries, and transforms data into HTML. Emphasizes design patterns like Strategy pattern, with a focus on class diagramming. Implements penalty for non-compliance.
AI-powered chatbot designed to simplify the job search process
Python tool for converting PDF files to text. Simplify your document processing tasks.
Adds annotation to each element in document and defines what it is.
Service to vectorize documents into a FAISS vectorstore.