There are 19 repositories under text-extraction topic.
Python & Command-line tool to gather text and metadata on the Web: Crawling, scraping, extraction, output as CSV, JSON, HTML, MD, TXT, XML
Module for automatic summarization of text documents and HTML pages.
Document intelligence framework for Python - Extract text, metadata, and structured data from PDFs, images, Office documents, and more. Built on Pandoc, PDFium, and Tesseract.
Tika-Python is a Python binding to the Apache Tika™ REST services allowing Tika to be called natively in the Python community.
A general list of resources to image text localization and recognition 场景文本位置感知与识别的论文资源与实现合集 シーンテキストの位置認識と識別のための論文リソースの要約
Heuristic based boilerplate removal tool
Parse PDFs into markdown using Vision LLMs
Benchmarking PDF libraries
Reworked https://www.readability.com/ parsing library (now https://mercury.postlight.com/ is living alternative)
hotpdf is a fast PDF parsing library to extract text and find text within PDF documents built on top of pdfminer.six
Entity Disambiguation as text extraction (ACL 2022)
AWS Lambda functions to extract text from various binary formats.
CUTIE (TensorFlow implementation of Convolutional Universal Text Information Extractor)
The Archives Unleashed Toolkit is an open-source toolkit for analyzing web archives.
PDF Reader Library for Native Julia.
Apache Tika bindings for PHP: extract text and metadata from documents, images and other formats
Simple app to extract text from pictures using Tesseract
PDF text data extraction web app with OCR for scanned documents
DocWire SDK: Award-winning modern data processing in C++20. SourceForge Community Choice & Microsoft support. AI-driven processing. Supports nearly 100 data formats, including email boxes and OCR. Boost efficiency in text extraction, web data extraction, data mining, document analysis. Offline processing is possible for security and confidentiality
Yet another library to extract text from MS Office and PDF files
Conversion of PDF documents to structured Markdown, optimized for Retrieval Augmented Generation (RAG) and other NLP tasks. Extract text, tables, and images with preserved formatting for enhanced information retrieval and processing.
:book: Labeled examples from wiki dumps in Python
A project about benchmarking and evaluating existing PDF extraction tools on their semantic abilities to extract the body texts from PDF documents, especially from scientific articles.
Get text content from any file