text-extraction

There are 19 repositories under text-extraction topic.

adbar / trafilatura
Python & Command-line tool to gather text and metadata on the Web: Crawling, scraping, extraction, output as CSV, JSON, HTML, MD, TXT, XML
web-scraping text-extraction nlp html2text text-mining crawler text-cleaning text-preprocessing article-extractor readability scraping news-crawler tei html-to-markdown corpus-builder corpus-tools rss-feed news-aggregator rag llm
Language:Python 4671
miso-belica / sumy
Module for automatic summarization of text documents and HTML pages.
html-extraction html-extractor html-page lsa nlp pagerank-algorithm python reduction summarization summarizer summary sumy text-extraction textteaser
Language:Python 3570
unidoc / unipdf
Golang PDF library for creating and processing PDF files (pure go)
golang pdf pdf-library pdf-generation pdf-document-processor text-extraction pdf-manipulation pdf-compression pdf-reports signing pdf-sign pdf-reader pdf-generator
Language:Go 2913
Goldziher / kreuzberg
Document intelligence framework for Python - Extract text, metadata, and structured data from PDFs, images, Office documents, and more. Built on Pandoc, PDFium, and Tesseract.
ocr text-extraction async document-intelligence mcp metadata-extraction pandoc pdf-extraction pdfium python rag table-extraction tesseract
Language:Python 2352
chrismattmann / tika-python
Tika-Python is a Python binding to the Apache Tika™ REST services allowing Tika to be called natively in the Python community.
tika-server python tika-python tika-server-jar parser-interface parse translation-interface usc text-extraction mime buffer memex text-recognition detection recognition nlp nlp-machine-learning nlp-library covid-19 extraction
Language:Python 1615
whitelok / image-text-localization-recognition
A general list of resources to image text localization and recognition 场景文本位置感知与识别的论文资源与实现合集シーンテキストの位置認識と識別のための論文リソースの要約
text-recognition text-detection convolutional-neural-networks scene-texts deep-learning ocr text-extraction deep-learning-algorithms machine-learning awesome
955
miso-belica / jusText
Heuristic based boilerplate removal tool
python text-extraction html-parser html-parsing
Language:Python 764
unidoc / unidoc
This repository has moved! https://github.com/unidoc/unipdf
unidoc golang pdf pdf-library pdf-files text-extraction pdf-invoice
Language:Go 709
datashare
ICIJ / datashare
A self-hosted search engine for documents.
named-entity-recognition text-extraction extract investigative-journalism elasticsearch datashare docker web-gui
Language:Java 626
ropensci / pdftools
Text Extraction, Rendering and Converting of PDF Documents
pdf-files pdf-format pdftools poppler poppler-library r r-package rstats text-extraction
Language:C++ 533
cdown / srt
A simple library and set of tools for parsing, modifying, and composing SRT files.
srt subtitle subtitles subtitles-parsing text-extraction python mit-license subtitle-parser subtitle-fixer tools command-line command-line-tool library
Language:Python 498
shixzie / nlp
[UNMANTEINED] Extract values from strings and fill your structs with nlp.
nlp parse natural-language-processing go text-extraction text golang
Language:Go 387
flairNLP / fundus
A very simple news crawler with a funny name
cc-news commoncrawl corpus crawler news-crawler news-scraping nlp python rss scraper sitemap text-extraction web-corpus web-scraping corpus-tools datasets image-classification image-extraction
Language:Python 367
iamarunbrahma / vision-parse
Parse PDFs into markdown using Vision LLMs
document-parser pdf-parser pdf-to-markdown text-extraction
Language:Python 337
pd3f
pd3f / pd3f
🏭 PDF text extraction pipeline: self-hosted, local-first, Docker-based
pdf text-extraction pdf-to-text pipeline machine-learning ocr language-model extract-text parsr python pd3f
Language:HTML 314
py-pdf / benchmarks
Benchmarking PDF libraries
benchmark data-extraction mupdf pdf poppler-utils pypdf2 text-extraction
Language:Python 269
bookieio / breadability
Reworked https://www.readability.com/ parsing library (now https://mercury.postlight.com/ is living alternative)
python text-mining text-extraction html-extraction html-extractor html-parsing
Language:HTML 204
weareprestatech / hotpdf
hotpdf is a fast PDF parsing library to extract text and find text within PDF documents built on top of pdfminer.six
pdf python text-extraction text-search
Language:Python 186
SapienzaNLP / extend
Entity Disambiguation as text extraction (ACL 2022)
natural-language-processing nlp entity-disambiguation entity-linking entity-disambiguation-models text-extraction pytorch acl acl2022
Language:Python 181
skylander86 / lambda-text-extractor
AWS Lambda functions to extract text from various binary formats.
text-extraction aws-lambda searchable-pdfs ocr lambda-functions pdf pdf-ocr-extraction tesseract
Language:Python 176
vsymbol / CUTIE
CUTIE (TensorFlow implementation of Convolutional Universal Text Information Extractor)
computer-vision deep-learning text-extraction
Language:Python 157
archivesunleashed / aut
The Archives Unleashed Toolkit is an open-source toolkit for analyzing web archives.
spark hadoop webarchives analysis apache-spark scala digital-humanities pyspark dataframe big-data-analytics python3 big-data network-graphing text-extraction
Language:Scala 143
sambitdash / PDFIO.jl
PDF Reader Library for Native Julia.
julia pdf pdf-development pdf-document pdf-files pdf-library pdf-specification text-extraction
Language:Julia 131
vaites / php-apache-tika
Apache Tika bindings for PHP: extract text and metadata from documents, images and other formats
apache tika text-extraction text-recognition ocr php-library
Language:PHP 116
ocr
victorqribeiro / ocr
Simple app to extract text from pictures using Tesseract
ocr text-extraction text-recognition image-recognition tesseract
Language:HTML 106
lu4p / cat
Extract text from plaintext, .docx, .odt and .rtf files. Pure go.
text-extraction docx2txt rtf-to-text odt2txt cross-platform go golang textextracting cat pdftotext pdf2txt extract-text
Language:Go 98
nainiayoub / pdf-text-data-extractor
PDF text data extraction web app with OCR for scanned documents
pdf-to-text streamlit streamlit-webapp text-extraction python ocr ocr-python ocr-text-reader pdf
Language:Python 87
jmriebold / BoilerPy3
Python port of Boilerpipe library
boilerpipe boilerpy html-text-extraction text-extraction full-text-extraction
Language:Python 86
docwire / docwire
DocWire SDK: Award-winning modern data processing in C++20. SourceForge Community Choice & Microsoft support. AI-driven processing. Supports nearly 100 data formats, including email boxes and OCR. Boost efficiency in text extraction, web data extraction, data mining, document analysis. Offline processing is possible for security and confidentiality
api c cli cpp linux macos parsing shell terminal windows tensorflow sdk text-extraction machine-learning artificial-intelligence data-extraction text-extraction-from-image data-processing text-mining extract-transform-load
Language:C++ 80
gamemaker1 / office-text-extractor
Yet another library to extract text from MS Office and PDF files
text-extraction get-text parser ms-office ms-word ms-excel ms-powerpoint xlsx docx pptx pdf
Language:TypeScript 73
iamarunbrahma / pdf-to-markdown
Conversion of PDF documents to structured Markdown, optimized for Retrieval Augmented Generation (RAG) and other NLP tasks. Extract text, tables, and images with preserved formatting for enhanced information retrieval and processing.
document-conversion document-processing information-retrieval pdf-extraction pdf-parsing pdf-to-markdown python rag retrieval-augmented-generation text-extraction pdf-converter
Language:Python 69
JonathanRaiman / wikipedia_ner
:book: Labeled examples from wiki dumps in Python
wikipedia python named-entity-recognition dataset text-extraction
Language:Jupyter Notebook 67
ckorzen / pdf-text-extraction-benchmark
A project about benchmarking and evaluating existing PDF extraction tools on their semantic abilities to extract the body texts from PDF documents, especially from scientific articles.
arxiv benchmark evaluation tex pdf extraction text-extraction
Language:TeX 66
abhinaba-ghosh / any-text
Get text content from any file
text-extraction text-extractor file-reader text reader
Language:JavaScript 65
iscc / mobi
python based software to unpack kindlegen generated ebooks
text-extraction mobi kindle
Language:Python 62
rajesh-bhat / spark-ai-summit-2020-text-extraction
spark-ai summit text-extraction text-detection text-recognition cnn lstm ctc-loss keras
Language:Jupyter Notebook 60