There are 17 repositories under text-extraction topic.
Module for automatic summarization of text documents and HTML pages.
Python & command-line tool to gather text on the Web: web crawling/scraping, extraction of text, metadata, comments
Tika-Python is a Python binding to the Apache Tika™ REST services allowing Tika to be called natively in the Python community.
A general list of resources to image text localization and recognition 场景文本位置感知与识别的论文资源与实现合集 シーンテキストの位置認識と識別のための論文リソースの要約
Heuristic based boilerplate removal tool
Reworked https://www.readability.com/ parsing library (now https://mercury.postlight.com/ is living alternative)
AWS Lambda functions to extract text from various binary formats.
Entity Disambiguation as text extraction (ACL 2022)
hotpdf is a fast PDF parsing library to extract text and find text within PDF documents built on top of pdfminer.six
Benchmarking PDF libraries
The Archives Unleashed Toolkit is an open-source toolkit for analyzing web archives.
PDF Reader Library for Native Julia.
Apache Tika bindings for PHP: extract text and metadata from documents, images and other formats
Simple app to extract text from pictures using Tesseract
:book: Labeled examples from wiki dumps in Python
PDF text data extraction web app with OCR for scanned documents
A project about benchmarking and evaluating existing PDF extraction tools on their semantic abilities to extract the body texts from PDF documents, especially from scientific articles.
Get text content from any file
DocWire SDK: Award-winning modern data processing in C++20. SourceForge Community Choice & Microsoft support. AI-driven processing. Supports nearly 100 data formats, including email boxes and OCR. Boost efficiency in text extraction, web data extraction, data mining, document analysis. Offline processing is possible for security and confidentiality
Yet another library to extract text from MS Office and PDF files
Text extraction for Wagtail document search