pdf-parser

There are 8 repositories under pdf-parser topic.

PaddlePaddle / PaddleOCR
Turn any PDF or image document into structured data for your AI. A powerful, lightweight OCR toolkit that bridges the gap between images/PDFs and LLMs. Supports 100+ languages.
ocr chineseocr pdf2markdown pp-ocr pp-structure document-parsing document-translation kie ai4science pdf-extractor-rag pdf-parser rag paddleocr-vl
Language:Python 62969
opendatalab / MinerU
Transforms complex documents like PDFs into LLM-ready markdown/JSON for your Agentic workflows.
ai4science document-analysis extract-data layout-analysis ocr parser pdf pdf-converter pdf-extractor-llm pdf-extractor-pretrain pdf-extractor-rag pdf-parser python
Language:Python 48281
py-pdf / pypdf
A pure-python PDF library capable of splitting, merging, cropping, and transforming the pages of PDF files
help-wanted pdf pdf-documents pdf-manipulation pdf-parser pdf-parsing pypdf2 python
Language:Python 9569
bytedance / Dolphin
The official repo for “Dolphin: Document Image Parsing via Heterogeneous Anchor Prompting”, ACL, 2025.
document-analysis layout-analysis ocr parser pdf pdf-converter pdf-parser python vlm-ocr
Language:Python 7735
yobix-ai / extractous
Fast and efficient unstructured data extraction. Written in Rust with bindings for many languages.
extraction pdf tika unstructured unstructured-data data-pipelines docx etl etl-pipelines llm machine-learning natural-language-processing nlp ocr pdf-parser rag rust
Language:Rust 1618
dromara / yft-design
yft-design is a powerful, visually stunning online design tool built with Vue3, fabric.js, and Element Plus. 基于fabric.js的开源版【稿定设计】。一款美观且功能强大的在线设计工具，具备海报设计和图片编辑功能。适用于多种场景，如海报生成、电商产品图制作、文章长图设计、视频/公众号封面编辑等。
canvas-editor clipper element-plus fabric-editor fabricjs image-crop online-design online-editor pdf-editor pdf-parser poster-design psd-editor psd-parse text2path vue3-fabric
Language:TypeScript 1457
NanoNets / docstrange
Extract and convert data from any document, images, pdfs, word doc, ppt or URL into multiple formats (Markdown, JSON, CSV, HTML) with intelligent structured data extraction and advanced OCR.
image-to-markdown llm markdown ocr pdf-to-markdown structured-data ai document-parser document-parsing pdf-parser pdf-to-json structured-data-capture tables
Language:Python 1002
adithya-s-k / marker-api
Easily deployable 🚀 API to convert PDF to markdown quickly with high accuracy.
fastapi marker pdf-converter pdf-files pdf-parser pdf-parsing api rest-api
Language:Python 920
drmingler / docling-api
Easily deployable and scalable backend server that efficiently converts various document formats (pdf, docx, pptx, html, images, etc) into Markdown. With support for both CPU and GPU processing, it is Ideal for large-scale workflows, it offers text/table extraction, OCR, and batch processing with sync/async endpoints.
api fastapi markdown-parser pdf-chatbot pdf-conversion pdf-converter pdf-parser pdf-parsing pdf-to-markdown
Language:Python 720
ispras / dedoc
Dedoc is a library (service) for automate documents parsing and bringing to a uniform format. It automatically extracts content, logical structure, tables, and meta information from textual electronic documents. (Parse document; Document content extraction; Logical structure extraction; PDF parser; Scanned document parser; DOCX parser; HTML parser
doc document-analysis document-content-extraction documents docx docx-parser excel html html-parser logical-structure-extraction ocr odt pdf pdf-parser scanned-documents table-of-contents table-recognition txt
Language:Python 619
iamarunbrahma / vision-parse
Parse PDFs into markdown using Vision LLMs
document-parser pdf-parser pdf-to-markdown text-extraction
Language:Python 441
titipata / scipdf_parser
Python PDF parser for scientific publications: content and figures
grobid pdf python-parser scipdf-parser parser pdf-parser
Language:Python 437
michelcrypt4d4mus / pdfalyzer
Analyze PDFs with colors (and YARA)
malicious-pdf-files malware-analysis pdf pdf-documents pdf-format pdf-parser yara yara-rules yara-scanner
Language:YARA 335
sylphxltd / pdf-reader-mcp
Production-ready MCP server for PDF processing with 5-10x faster parallel processing, Y-coordinate content ordering, and 94%+ test coverage
ai-agent llm-tool mcp model-content-protocol nodejs pdf pdf-parse pdf-parser pdf-reader stdio typescript ai-tools document-processing performance
Language:TypeScript 301
lazyFrogLOL / llmdocparser
A package for parsing PDFs and analyzing their content using LLMs.
chunking document-analysis llm nlp ocr pdf-parser pdfparser rag text-chunking
Language:Python 270
codereverser / casparser
Parser for Consolidated Account Statements (CAS) generated from CAMS/Karvy/Kfintech
cams karvy kfintech cas mutual-funds mutual-fund-portfolio pdf-parser consolidated-account-statements parser capital-gains capital-gains-calculator capital-gain python3 112a
Language:Python 171
sypht-python-client
sypht-team / sypht-python-client
A python client for the Sypht API
data-extraction information-extraction api-client python python3 python3-library sypht sypht-python-client sypht-api invoice extract extract-fields extract-data-from-pdf receipt-scanner pdf-parser receipt-capture invoice-parser receipt-reader receipt-scanning document-capture
Language:Python 162
sypht-java-client
sypht-team / sypht-java-client
A Java client for the Sypht API
api-client data-extraction java java8 information-retrieval information-retrieval-engine extract-data-from-pdf extract-fields sypht-java-client sypht sypht-api invoice extract receipt-scanner pdf-parser receipt-capture invoice-parser receipt-reader receipt-scanning document-capture
Language:Java 88
datalogics / adobe-pdf-library-samples
Sample code for the Datalogics C++, Java, and .NET interfaces of the Adobe PDF Library
ocr ocr-pdf pdf pdf-compression pdf-conversion pdf-converter pdf-document pdf-generation pdf-lib pdf-manipulation pdf-merger pdf-parser pdf-render pdf-split pdf-to-image pdf-to-text pdf-tools pdfa pdf-to-office
83
oidlabs-com / Lexoid
Multimodal document parser for high quality data understanding and extraction
llms pdf-document parser-library pdf-parser multimodal genai large-language-models ocr ocr-python html-to-markdown html-to-pdf
Language:Python 80
BitMiracle / Docotic.Pdf.Samples
C# and VB.NET samples for Docotic.Pdf library
pdf-library docotic-pdf pdf-to-text pdf-to-image pdf-compression print-pdf sign-pdf pdf-signature pdf-generation pdf-merge pdf-forms extract-images extract-text net-core pdf-annotation images-to-pdf pdf-flattener pdf-manipulation pdf-parser html-to-pdf
Language:Visual Basic .NET 78
drmingler / smart-llm-loader
smart-llm-loader is a lightweight yet powerful Python package that transforms any document into LLM-ready chunks. Spend less time on preprocessing headaches and more time building what matters. From RAG systems to chatbots to document Q&A, SmartLLMLoader handles the heavy lifting so you can focus on creating exceptional AI applications.
chatbot chunking claude gemini langchain llama-index markdown openai pdf-converter pdf-parser pdf-to-markdown rag
Language:Python 71
tuffstuff9 / nextjs-pdf-parser
Next.js template for seamless PDF parsing using pdf2json and FilePond. Ideal for developers seeking a ready-to-use solution for PDF content extraction in Next.js projects.
content-extraction filepond nextjs nextjs-pdf nextjs-pdf-parse nextjs-pdf-parser nextjs-pdf-parsing pdf-parse pdf-parser pdf-parsing pdf-upload pdf2json react-pdf react-pdf-parser
Language:TypeScript 65
davendw49 / sciparser
PDF parsing toolkit for preparing academic text corpus
large-language-models pdf-parser
Language:Python 61
LianjiaTech / bella-domify
文档解析（Document Parser），支持 PDF、TXT、DOC、DOCX、Markdown 等文件格式，高效提取与解析内容，生成标准文档树结构。内置 PDF Parser、Text Parser、Word Parser，助力 RAG、知识库、全文检索等智能应用。
document-parser parser pdf-parser
Language:Python 56
genbs / poste-italiane-parser
A Python tool to parse PDF statements from Poste Italiane (Postepay, BancoPosta) and extract data as structured JSON.
bancoposta fintech pdf-parser personal-finance poste-italiane postepay
Language:Python 50
k16shikano / hpdft
tools to poke pdf using haskell
pdf pdf-parser
Language:Haskell 45
ashutoshvarma / pyxpdf
Fast and memory-efficient Python PDF Parser based on xpdf sources
pdf python cython pdf-converter pdftotext pdf-parser pdfparser pdftohtml pdftopng xpdf xpdf-reader
Language:Cython 43
RapidAI / RapidDoc
A high-performance, open-source PDF data extraction tool. 一站式开源高性能数据提取工具，将复杂 PDF 文档转换为 Markdown 和 JSON 格式，使用onnx模型。
ocr onnx parser pdf pdf-converter pdf-parser python
Language:Python 43
SimpleApp / PDFParser
Swift PDFParser for PDF parsing and text mining. Includes a TrueType font parser
pdf-parser truetype swift
Language:Swift 43
sypht-golang-client
sypht-team / sypht-golang-client
A Golang client for the Sypht API
data-extraction api-client golang go golang-library golang-package sypht sypht-golang-client sypht-api invoice extract extract-fields extract-data-from-pdf receipt-scanner pdf-parser receipt-capture invoice-parser receipt-reader receipt-scanning document-capture
Language:Go 33
shine-jayakumar / Extract-Data-From-PDF-In-Python
Batch-convert pdf to text, extract data from pdf in python
pdf-converter pdf-to-text pdf-tools pdf-parser python-pdf pypdf2 pypdf data-extraction regular-expressions pdf-reader batch-converter batch-conversion data-cleaning pdf-to-excel pdf-data-extraction pandas indirectobject xpdf pdftotext python-automation
Language:Python 32
lesterchan / linkedin-pdf-resume-parser
Parse LinkedIn PDF Resume and extract out name, email, education and work experiences.
linkedin pdf resume parser resume-parser pdf-parser techinasia
Language:PHP 28
content-parser
ridi / content-parser
Content data parser for Ridibooks services
epub-parser comic-parser pdf-parser
Language:JavaScript 25
dunso / pdf-parser
Convert PDF content and layout information with pdf.js
pdf convertor parser pdf-parser pdfjs pdf2json
Language:JavaScript 23
lucasjvds / Scanipy
Scanipy stands for "scan it with Python"—it's your smart Python library for scanning and parsing complex PDF files like books, reports, articles, and academic papers. Utilizing cutting-edge Deep Learning algorithms, Scanipy transforms your PDFs into a treasure trove of extractable information: tables, images, equations, and text.
deep-learning ocr ocr-recognition pdf pdf-parser
Language:Python 19

pdf-parser

PaddlePaddle / PaddleOCR

opendatalab / MinerU

py-pdf / pypdf

bytedance / Dolphin

yobix-ai / extractous

dromara / yft-design

NanoNets / docstrange

adithya-s-k / marker-api

drmingler / docling-api

ispras / dedoc

iamarunbrahma / vision-parse

titipata / scipdf_parser

michelcrypt4d4mus / pdfalyzer

sylphxltd / pdf-reader-mcp

lazyFrogLOL / llmdocparser

codereverser / casparser

sypht-team / sypht-python-client

sypht-team / sypht-java-client

datalogics / adobe-pdf-library-samples

oidlabs-com / Lexoid

BitMiracle / Docotic.Pdf.Samples

drmingler / smart-llm-loader

tuffstuff9 / nextjs-pdf-parser

davendw49 / sciparser

LianjiaTech / bella-domify

genbs / poste-italiane-parser

k16shikano / hpdft

ashutoshvarma / pyxpdf

RapidAI / RapidDoc

SimpleApp / PDFParser

sypht-team / sypht-golang-client

shine-jayakumar / Extract-Data-From-PDF-In-Python

lesterchan / linkedin-pdf-resume-parser

ridi / content-parser

dunso / pdf-parser

lucasjvds / Scanipy