document-analysis

There are 17 repositories under document-analysis topic.

opendatalab / MinerU
Transforms complex documents like PDFs into LLM-ready markdown/JSON for your Agentic workflows.
ai4science document-analysis extract-data layout-analysis ocr parser pdf pdf-converter pdf-extractor-llm pdf-extractor-pretrain pdf-extractor-rag pdf-parser python
Language:Python 48578
bytedance / Dolphin
The official repo for “Dolphin: Document Image Parsing via Heterogeneous Anchor Prompting”, ACL, 2025.
document-analysis layout-analysis ocr parser pdf pdf-converter pdf-parser python vlm-ocr
Language:Python 7749
ucbepic / docetl
A system for agentic LLM-powered data processing and ETL
agents data data-pipelines document-analysis document-processing elt etl llm python semantic-data unstructured-data unstructured-data-analysis workflow
Language:Python 3049
UglyToad / PdfPig
Read and extract text and other content from PDFs in C# (port of PDFBox)
pdfbox pdf pdf-document csharp netstandard pdf-extractor pdf-document-processor pdf-files alto-xml hocr layout-analysis document-analysis page-xml pdf-generation
Language:C# 2268
NanoNets / docext
An on-premises, OCR-free unstructured data extraction, markdown conversion and benchmarking toolkit. (https://idp-leaderboard.org/)
document document-analysis document-data-extraction document-information-extraction extraction llm-ocr llms machine-learning nlp ocr ocr-benchmark ocr-onpremise onprem onprem-ocr onprem-vision onpremise rag table-extraction unstructured-data vlms
Language:Python 1800
AlibabaResearch / AdvancedLiterateMachinery
A collection of original, innovative ideas and algorithms towards Advanced Literate Machinery. This project is maintained by the OCR Team in the Language Technology Lab, Tongyi Lab, Alibaba Group.
artificial-intelligence documentai multimodal multimodal-deep-learning ocr computer-vision vision-language-transformer end-to-end-ocr scene-text-detection scene-text-detection-recognition scene-text-recognition text-detection text-recognition vision-language document document-analysis document-recognition document-understanding document-intelligence vision-language-model
Language:C++ 1796
tstanislawek / awesome-document-understanding
A curated list of resources for Document Understanding (DU) topic
awesome-list machine-learning information-extraction key-information-extraction document-understanding robotic-process-automation document-analysis document-layout-analysis ocr natural-language-processing deep-learning nlp awesome pdf rpa pdf-documents document-intelligence unstructured-data intelligent-processing document-ai
1474
DocumindHQ / documind
Open-source platform for extracting structured data from documents using AI.
ai developer-tools document-analysis document-extraction extract-data llms ocr open-source parser pdf pdf-converter pdf-extractor pdf-extractor-llm
Language:JavaScript 1451
Yuliang-Liu / Curve-Text-Detector
This repository provides train＆test code, dataset, det.&rec. annotation, evaluation script, annotation tool, and ranking.
deep-learning document-analysis object-detection scene-text
Language:Jupyter Notebook 652
ispras / dedoc
Dedoc is a library (service) for automate documents parsing and bringing to a uniform format. It automatically extracts content, logical structure, tables, and meta information from textual electronic documents. (Parse document; Document content extraction; Logical structure extraction; PDF parser; Scanned document parser; DOCX parser; HTML parser
doc docx odt documents excel pdf txt ocr scanned-documents document-content-extraction table-of-contents table-recognition html docx-parser html-parser pdf-parser document-analysis logical-structure-extraction
Language:Python 620
wenwenyu / PICK-pytorch
Code for the paper "PICK: Processing Key Information Extraction from Documents using Improved Graph Learning-Convolutional Networks" (ICPR 2020)
document-analysis document-understanding graph-convolutional-network graph-learning graph-neural-networks key-information-extraction
Language:Python 570
CybercentreCanada / assemblyline
AssemblyLine 4: File triage and malware analysis
malware-analysis malware-research file-analysis malware-detection malware-analyzer cybersecurity incident-response infosec malware assemblyline automation-framework cert cyber-security document-analysis framework python3 security-automation security-automation-framework security-tools
Language:Python 384
jpWang / LiLT
Official PyTorch implementation of LiLT: A Simple yet Effective Language-Independent Layout Transformer for Structured Document Understanding (ACL 2022)
document-ai document-analysis document-understanding information-extraction multilingual-models multimodal-pre-trained-model nlp
Language:Python 357
pandora-analysis / pandora
Pandora is an analysis framework to discover if a file is suspicious and conveniently show the results
infosec document-analysis malware-detection document-analyzing
Language:Python 272
lazyFrogLOL / llmdocparser
A package for parsing PDFs and analyzing their content using LLMs.
llm nlp ocr pdfparser rag chunking document-analysis pdf-parser text-chunking
Language:Python 270
masyagin1998 / robin
RObust document image BINarization
python opencv keras neural-networks deep-learning document-binarization ocr computer-vision u-net document-analysis
Language:Python 184
ppaanngggg / yolo-doclaynet
YOLO models trained by DocLayNet - power your Document Intelligent by Layout Analysis
document-analysis layout-analysis ultralytics yolo yolov8 doclaynet
Language:Python 140
anisha2102 / docvqa
Document Visual Question Answering
visual-question-answering computer-vision deep-learning document-analysis
Language:Python 127
chriswolfvision / local_adaptive_binarization
Local adaptive image binarization
computer-vision document-analysis document-binarization
Language:C++ 126
mirabdullahyaser / Retrieval-Augmented-Generation-Engine-with-LangChain-and-Streamlit
Powerful web application that combines Streamlit, LangChain, and Pinecone to simplify document analysis. Powered by OpenAI's GPT-3, RAG enables dynamic, interactive document conversations, making it ideal for efficient document retrieval and summarization.
artificial-intelligence chat-application document-analysis generative-ai langchain large-language-models natural-language-processing openai-chatgpt question-answering retrieval-augmented-generation streamlit gpt-3
Language:Python 126
amazon-textract-transformer-pipeline
aws-samples / amazon-textract-transformer-pipeline
Post-process Amazon Textract results with Hugging Face transformer models for document understanding
amazon-textract huggingface-transformers document-analysis ocr
Language:Python 101
monniert / docExtractor
(ICFHR 2020 oral) Code for "docExtractor: An off-the-shelf historical document element extraction" paper
document-analysis segmentation historical-data pytorch
Language:Python 88
Xyntopia / pydoxtools
Effortlessly extract information from unstructured data with this library, utilizing advanced AI techniques. Compose AI in customizable pipelines and diverse sources for your projects.
chatgpt document-analysis document-extraction extraction information-retrieval llm nlp pdf python
Language:Python 86
abdur75648 / UTRNet-High-Resolution-Urdu-Text-Recognition
UTRNet: High-Resolution Urdu Text Recognition In Printed Documents (ICDAR'23)
document-analysis high-resolution hrnet icdar icdar2023 ocr scene-text-recognition text-detection text-recognition unet urdu urdu-nlp urdu-ocr utrnet computer-vision deep-learning machine-learning pytorch urdu-synth
Language:Python 61
ZeningLin / ViBERTgrid-PyTorch
An unofficial PyTorch implementation of "Lin et al. ViBERTgrid: A Jointly Trained Multi-Modal 2D Document Representation for Key Information Extraction from Documents. ICDAR, 2021"
document-ai document-analysis information-extraction key-information-extraction visual-information-extraction
Language:Python 53
JPLeoRX / detectron2-publaynet
Trained Detectron2 object detection models for document layout analysis based on PubLayNet dataset
object-detection instance-segmentation computer-vision detectron2 publaynet python python3 machine-learning neural-network document-classification document-layout document-layout-analysis layout-analysis document-analysis neural-networks artificial-intelligence deep-learning faster-rcnn pytorch
Language:Python 50
BjornMelin / docmind-ai-llm
DocMind AI is a powerful, open-source Streamlit application leveraging LlamaIndex, LangGraph, and local Large Language Models (LLMs) via Ollama, LMStudio, llama.cpp, or vLLM for advanced document analysis. Analyze, summarize, and extract insights from a wide array of file formats—securely and privately, all offline.
ai-agents document-analysis hybrid-search langchain langgraph-supervisor-py llama-cpp llamacpp lmstudio local-llm multimodal-embeddings ollama private-ai-agents python qdrant sentence-transformers streamlit torch transformers vllm
Language:Python 45
aws-solutions / enhanced-document-understanding-on-aws
Enhanced Document Understanding on AWS delivers an easy-to-use web application that ingests and analyzes documents, extracts content, identifies and redacts sensitive customer information, and creates search indexes from the analyzed data.
document-analysis document-processing
Language:JavaScript 40
ankanbhunia / AdverseBiNet
Improving Document Binarization via Adversarial Noise-Texture Augmentation (ICIP 2019)
binarization document-analysis generative-adversarial-network adversarial-learning deep-learning
Language:Python 38
lin-tan / DocTer
For our ISSTA22 paper "DocTer: Documentation-Guided Fuzzing for Testing Deep Learning API Functions" by Danning Xie, Yitong Li, Mijung Kim, Hung Viet Pham, Lin Tan, Xiangyu Zhang, Mike Godfrey
deep-learning document-analysis fuzzing natural-language-processing software-reliability software-text-analytics testing
37
AILab-UniFI / GNN-TableExtraction
Code for ICPR2022 paper: "Graph Neural Networks and Representation Embedding for table extraction in PDF Documents"
document-analysis graph-neural-networks
Language:Python 36
retab-dev / retab
The developper starter pack for document processing
api document-analysis llm structured-generation openai
Language:Jupyter Notebook 34
microsoft / synthetic-rag-index
Service to import data from various sources and index it in AI Search. Increases data relevance and reduces final size by 90%+. Useful for RAG scenarios with LLM. Hosted in Azure with serverless architecture.
azure large-language-model llm rag retrieval-augmented-generation serverless document-analysis few-shot-learning
Language:Python 32
Doctra
AdemBoukhris457 / Doctra
📄🔍 Parse, extract, and analyze documents with ease 📄🔍
ai documentparsing gemini ocr openai python vlm document-analysis extract-data pdf-parser image-restoration pdf2markdown
Language:Jupyter Notebook 29
muhd-umer / pyramidtabnet
Official PyTorch implementation of PyramidTabNet: Transformer-based Table Recognition in Image-based Documents
computer-vision deep-learning document-analysis implementation table-detection table-structure-recognition pytorch
Language:Python 28
CaseDrive / publaynet-models
Trained Detectron2 object detection models for document layout analysis based on PubLayNet dataset
artificial-intelligence computer-vision deep-learning detectron2 document-analysis document-classification document-layout document-layout-analysis faster-rcnn instance-segmentation layout-analysis machine-learning neural-network neural-networks object-detection publaynet python python3 pytorch
Language:Python 27

document-analysis

opendatalab / MinerU

bytedance / Dolphin

ucbepic / docetl

UglyToad / PdfPig

NanoNets / docext

AlibabaResearch / AdvancedLiterateMachinery

tstanislawek / awesome-document-understanding

DocumindHQ / documind

Yuliang-Liu / Curve-Text-Detector

ispras / dedoc

wenwenyu / PICK-pytorch

CybercentreCanada / assemblyline

jpWang / LiLT

pandora-analysis / pandora

lazyFrogLOL / llmdocparser

masyagin1998 / robin

ppaanngggg / yolo-doclaynet

anisha2102 / docvqa

chriswolfvision / local_adaptive_binarization

mirabdullahyaser / Retrieval-Augmented-Generation-Engine-with-LangChain-and-Streamlit

aws-samples / amazon-textract-transformer-pipeline

monniert / docExtractor

Xyntopia / pydoxtools

abdur75648 / UTRNet-High-Resolution-Urdu-Text-Recognition

ZeningLin / ViBERTgrid-PyTorch

JPLeoRX / detectron2-publaynet

BjornMelin / docmind-ai-llm

aws-solutions / enhanced-document-understanding-on-aws

ankanbhunia / AdverseBiNet

lin-tan / DocTer

AILab-UniFI / GNN-TableExtraction

retab-dev / retab

microsoft / synthetic-rag-index

AdemBoukhris457 / Doctra

muhd-umer / pyramidtabnet

CaseDrive / publaynet-models