document-processing

There are 3 repositories under document-processing topic.

dhlab-epfl / dhSegment
Generic framework for historical document processing
document-processing historical-data python3 segmentation tensorflow
Language:Python 370
formkiq-core
formkiq / formkiq-core
A full-featured Document Layer for your application, providing the functionality of a flexible document management system, including storage, discovery, processing, and retrieval. Deploys directly into your Amazon Web Services Cloud. 🌟 Star to support our work!
amazon-web-services aws cloud-storage dms document-api document-apis document-database document-management document-management-system document-processing headless serverless ocr intelligent-document-processing optical-character-recognition document-layer
Language:Java 99
awslabs / project-lakechain
:zap: Cloud-native, AI-powered, document processing pipelines on AWS.
aws aws-cdk computer-vision document-processing generative-ai hacktoberfest machine-learning natural-language-processing retrieval-augmented-generation serverless
Language:TypeScript 85
steindani / pandoc-include
An include filter for Pandoc
pandoc pandoc-filter markdown document-processing
Language:Haskell 60
parsee-ai / parsee-core
Retrieval of fully structured data made easy. Use LLMs or custom models. Specialized on PDFs and HTML files. Extensive support of tabular data extraction and multimodal queries.
document-processing llm multimodal structured-data
Language:Python 35
rhubarb
awslabs / rhubarb
A Python framework for multi-modal document understanding with Amazon Bedrock
amazon-bedrock document-processing generative-ai intelligent-document-processing multi-modal
Language:Python 34
cburschka / lyx
Unofficial mirror of git://git.lyx.org/lyx.git (updates daily. not affiliated with lyx.org.)
document-processing latex lyx mirror
Language:C++ 34
afrozas / proceedings
Semantic extraction from conference proceedings.
conferences semantic spacy document-processing
Language:Python 31
kili-technology / awesome-datasets
A comprehensive list of annotated training datasets classified by use case.
awesome-public-datasets datasets opendata dataset public-data public-dataset public-datasets awesome-datasets awesome-data-science data open-datasets opendatasets annotation nlp entity-extraction ner corpora entity-recognition document-processing ocr
28
aws-solutions / enhanced-document-understanding-on-aws
Enhanced Document Understanding on AWS delivers an easy-to-use web application that ingests and analyzes documents, extracts content, identifies and redacts sensitive customer information, and creates search indexes from the analyzed data.
document-analysis document-processing
Language:JavaScript 27
MBAigner / PDFSegmenter
This library builds a graph-representation of the content of PDFs. The graph is then clustered, resulting page segments are classified and returned. Tables are retrieved formatted as a CSV.
pdf document-processing python page-segmentation layout-analysis cluster-analysis annotations csv table detection-model
Language:Python 19
greed2411 / tokyo
tokyo, a REST API, when given any type of document 📄, Identifies mime-type 🧐. Suggests extension 🦔. Alas Extracts text 💪.
document-processing apache-tika clojure ring mime-types extension text-parsing text-parser extract-text filetype text-extraction
Language:Clojure 18
eklem / stopword-trainer
A module for creating stopword lists for any language, based on a set of documents.
document-processing information-retrieval nlp stopwords stopwords-removal
Language:JavaScript 14
jmanhype / DSPy-Multi-Document-Agents
An advanced distributed knowledge fabric for intelligent document processing, featuring multi-document agents, optimized query handling, and semantic understanding.
ai distributed-systems document-processing knowledge-management nlp query-optimization vector-search
Language:Python 11
jeanbaptisteb / doccleaner
A Python command-line utility intended for automating some copyediting tasks in documents. It allows editing zipped, XML-based files (e.g. docx, odt, or epub), through XSLT stylesheets. Can be rather easily extended with your own custom xsl stylesheets.
text-processing docx xsl-stylesheet document-processing odt xsl-sheet xsl-transformation
Language:XSLT 6
abdur75648 / urdu-text-detection
Text line detection for Urdu OCR (UTRNet)
document-processing ocr text-detection urdu-ocr urdu-text-detection utrnet contournet
Language:Python 4
RPetitpierre / Generic_Semantic_Segmentation_of_Historical_Maps
historical-maps computer-vision document-processing
Language:Jupyter Notebook 4
CentralFloridaAttorney / zmongo_retriever
Use data from MongoDB in LangChain, Llama and OpenAI
llamacpp mongodb openai langchain data-retrieval database document-processing machine-learning mongo python data-chunking
Language:Python 3
SvenEichelsheimer / filegazer
FileGazer - deep file analysing and categorisation
document-processing ocr tesseract tika file-analysing document-categorisation content-extraction
3
caltechlibrary / popstar
Phone-Oriented Processing SofTware for ARchives
archiving digitization document-processing iphone libraries scanning shortcuts-app workflow-automation
Language:Makefile 2
Oneirocom / generative-intent-detection
Generative intent detection with Magick
investment-analysis machine-learning document-processing
Language:TypeScript 2
anne27 / Information-Retrieval
An implementation of basic IR techniques from scratch.
information-retrieval document-processing document-retrieval tfidf
Language:Python 1
cemonal / Pdf2xNet
Pdf2xNet is a .NET library for seamless integration with Xpdf tools, enabling easy conversion of PDF documents to text, images, and HTML formats within your .NET applications.
conversion conversion-tool document-processing html images library pdf pdf-converter png text xpdf xpdf-utils
Language:C# 1
dayang4321 / MSc-Team-Project-CMPU9010-2023-24-Group-3
TU Dublin Computer Science MSc. Final Project Group 3 - Accessibilator
accessibility document-processing social-good
Language:Jupyter Notebook 1
eiceblue / Spire.Doc-for-C-
Spire.Doc for C++ is a professional Word C++ library specifically designed for developers to create, read, write, convert, merge, split, and compare Word documents on any C++ platforms with fast and high-quality performance.
class-library cpp document-processing docx word
Language:C++ 1
gumienny / cn
Convert scans of handwritten notes to PDF.
image-thresholding document-processing image-processing k-means foreground-background separation tsallis rust entropy clean notes cli
Language:Rust 1
Jackojc / old-wotpp
A document preprocessor that works in conjunction with tools like groff/troff & refer.
document-processing text-engine preprocessor
Language:C++ 1
johnsirmon / clearcouncil
ClearCouncil: Automated tools for collecting, organizing, and embedding publicly available local state county council documents (minutes, agendas) into LLMs. Python, JS, and wget scripts included for easy data retrieval and integration.
civic-tech data-retrieval document-processing gpt langchain langchain-python local-government open-data openai retrieval-augmented-generation transparency-enhancing-technologies wget
Language:Python 1
joseferrerh / invoices-leanautomation
This set of robots provides support for automatically obtaining information from invoices using docDigitizer API and keep track of the processed invoices on an Airtable repository
document-data document-processing hyperautomation idp intelligent-automation invoices ocr account-payables
Language:RobotFramework 1
thoth2357 / Watermark-removal
Program Helps remove watermark from a pdf document
watermarking document-processing
Language:Python 1
x1ao4 / doc-merger
通过 python 脚本将两个相对不完整的文档合并为一个完整的文档 / merge two relatively incomplete documents into one complete document via python script
data-analysis data-merging document-analysis document-comparison document-processing documents filtering filtering-data merge merge-documents
Language:Python 1
ArtemZarubin / XmlDocumentProcessor
XmlDocumentProcessor: A .NET component for XML document processing. It analyzes XML content, performs keyword-based queries, and transforms data into HTML. Emphasizes design patterns like Strategy pattern, with a focus on class diagramming. Implements penalty for non-compliance.
c-sharp document-processing dotnet xml xml-processing
Language:C# 0
rina-reimer / uwb-hacks-ai-local
AI-powered chatbot designed to simplify the job search process
ai document-processing job-search resume
Language:TypeScript 0
fonckchain / pdf-text-converter
Python tool for converting PDF files to text. Simplify your document processing tasks.
automation document-processing pdf-converter python text-extraction
Language:Python
SDpDas / Document_annotate_tool
Adds annotation to each element in document and defines what it is.
document-processing python python-docx xml
Language:Python
swiss-ai-center / document-vectorizer-service
Service to vectorize documents into a FAISS vectorstore.
natural-language-processing document-processing
Language:Python

document-processing

dhlab-epfl / dhSegment

formkiq / formkiq-core

awslabs / project-lakechain

steindani / pandoc-include

parsee-ai / parsee-core

awslabs / rhubarb

cburschka / lyx

afrozas / proceedings

kili-technology / awesome-datasets

aws-solutions / enhanced-document-understanding-on-aws

MBAigner / PDFSegmenter

greed2411 / tokyo

eklem / stopword-trainer

jmanhype / DSPy-Multi-Document-Agents

jeanbaptisteb / doccleaner

abdur75648 / urdu-text-detection

RPetitpierre / Generic_Semantic_Segmentation_of_Historical_Maps

CentralFloridaAttorney / zmongo_retriever

SvenEichelsheimer / filegazer

caltechlibrary / popstar

Oneirocom / generative-intent-detection

anne27 / Information-Retrieval

cemonal / Pdf2xNet

dayang4321 / MSc-Team-Project-CMPU9010-2023-24-Group-3

eiceblue / Spire.Doc-for-C-

gumienny / cn

Jackojc / old-wotpp

johnsirmon / clearcouncil

joseferrerh / invoices-leanautomation

thoth2357 / Watermark-removal

x1ao4 / doc-merger

ArtemZarubin / XmlDocumentProcessor

rina-reimer / uwb-hacks-ai-local

fonckchain / pdf-text-converter

SDpDas / Document_annotate_tool

swiss-ai-center / document-vectorizer-service