There are 14 repositories under document-analysis topic.
A high-quality tool for convert PDF to Markdown and JSON.一站式开源高质量数据提取工具,将PDF转换成Markdown和JSON格式。
A collection of original, innovative ideas and algorithms towards Advanced Literate Machinery. This project is maintained by the OCR Team in the Language Technology Lab, Tongyi Lab, Alibaba Group.
A curated list of resources for Document Understanding (DU) topic
Open-source platform for extracting structured data from documents using AI.
This repository provides train&test code, dataset, det.&rec. annotation, evaluation script, annotation tool, and ranking.
Dedoc is a library (service) for automate documents parsing and bringing to a uniform format. It automatically extracts content, logical structure, tables, and meta information from textual electronic documents. (Parse document; Document content extraction; Logical structure extraction; PDF parser; Scanned document parser; DOCX parser; HTML parser
Code for the paper "PICK: Processing Key Information Extraction from Documents using Improved Graph Learning-Convolutional Networks" (ICPR 2020)
Official PyTorch implementation of LiLT: A Simple yet Effective Language-Independent Layout Transformer for Structured Document Understanding (ACL 2022)
AssemblyLine 4: File triage and malware analysis
A package for parsing PDFs and analyzing their content using LLMs.
Pandora is an analysis framework to discover if a file is suspicious and conveniently show the results
RObust document image BINarization
Local adaptive image binarization
Powerful web application that combines Streamlit, LangChain, and Pinecone to simplify document analysis. Powered by OpenAI's GPT-3, RAG enables dynamic, interactive document conversations, making it ideal for efficient document retrieval and summarization.
Document Visual Question Answering
YOLO models trained by DocLayNet - power your Document Intelligent by Layout Analysis
Post-process Amazon Textract results with Hugging Face transformer models for document understanding
(ICFHR 2020 oral) Code for "docExtractor: An off-the-shelf historical document element extraction" paper
Effortlessly extract information from unstructured data with this library, utilizing advanced AI techniques. Compose AI in customizable pipelines and diverse sources for your projects.
An unofficial PyTorch implementation of "Lin et al. ViBERTgrid: A Jointly Trained Multi-Modal 2D Document Representation for Key Information Extraction from Documents. ICDAR, 2021"
UTRNet: High-Resolution Urdu Text Recognition In Printed Documents (ICDAR'23)
Trained Detectron2 object detection models for document layout analysis based on PubLayNet dataset
Improving Document Binarization via Adversarial Noise-Texture Augmentation (ICIP 2019)
Enhanced Document Understanding on AWS delivers an easy-to-use web application that ingests and analyzes documents, extracts content, identifies and redacts sensitive customer information, and creates search indexes from the analyzed data.
Code for ICPR2022 paper: "Graph Neural Networks and Representation Embedding for table extraction in PDF Documents"
Service to import data from various sources and index it in AI Search. Increases data relevance and reduces final size by 90%+. Useful for RAG scenarios with LLM. Hosted in Azure with serverless architecture.
Trained Detectron2 object detection models for document layout analysis based on PubLayNet dataset
DL models that take a document image file as input, locate the position of paragraphs, lines, images, etc. with their labels and confidence scores.
Official PyTorch implementation of PyramidTabNet: Transformer-based Table Recognition in Image-based Documents
An open-source tool for visualisation of outputs of deep-learning models for document analysis tasks such as fully automatic, bounding box and OCR.
[Late Submission] Solution for Kuzushiji recognition (Kaggle competition)
A fast and accurate command line tool for extracting text from PDF files.
Adobe CEP extension for InDesign to use the Bookalope cloud services. You can download the extension from Adobe Exchange.