data-extraction

There are 24 repositories under data-extraction topic.

firecrawl
firecrawl / firecrawl
🔥 The Web Data API for AI - Turn entire websites into LLM-ready markdown or structured data
ai ai-agents ai-crawler ai-scraping ai-search crawler data-extraction html-to-markdown llm markdown scraper scraping web-crawler web-data web-data-extraction web-scraper web-scraping web-search webscraping
Language:TypeScript 66898
ScrapeGraphAI / Scrapegraph-ai
Python scraper based on AI
scraping scraping-python automated-scraper llm web-crawler web-scraping ai-scraping crawler markdown rag web-crawlers ai-crawler ai-search large-language-model web-data-extraction web-search web-scraper data-extraction web-data webscraping
Language:Python 21729
getmaxun / maxun
⚡ Easiest no code web data extraction platform • Instantly turn any website into API or spreadsheet ⚡
automation no-code scraper web-automation web-scraper web-scraping api browser browser-automation playwright self-hosted robotic-process-automation rpa no-code-web-scraper agents data-extraction webscraping hacktoberfest hacktoberfest-accepted nocode
Language:TypeScript 13826
Scrapling
D4Vinci / Scrapling
🕷️ An undetectable, powerful, flexible, high-performance Python library to make Web Scraping Easy and Effortless as it should be!
ai ai-scraping automation crawler crawling crawling-python data data-extraction mcp mcp-server playwright python scraping selectors stealth web-scraper web-scraping web-scraping-python webscraping xpath
Language:Python 8110
vi3k6i5 / flashtext
Extract Keywords from sentence or Replace keywords in sentences.
data-extraction keyword-extraction nlp search-in-text word2vec
Language:Python 5681
contextgem
shcherbak-ai / contextgem
ContextGem: Effortless LLM extraction from documents
ai contract-analysis data-extraction document-intelligence generative-ai legaltech llm llm-extraction llm-framework llm-pipeline llms nlp prompt-engineering text-analysis unstructured-data docx docx2md docx2txt
Language:Python 1705
JonathanLink / PDFLayoutTextStripper
Converts a pdf file into a text file while keeping the layout of the original pdf. Useful to extract the content from a table in a pdf file for instance. This is a subclass of PDFTextStripper class (from the Apache PDFBox library).
layout text java pdf extract data-extraction pdfbox
Language:Java 1595
brightdata / brightdata-mcp
A powerful Model Context Protocol (MCP) server that provides an all-in-one solution for public web access.
llm mcp modelcontextprotocol scraping ai-agents ai-integrations anti-bot-detection browser-automation data-collection data-extraction mcp-server scraping-tools structured-data web-crawling web-data web-scraping
Language:JavaScript 1557
hi-primus / optimus
:truck: Agile Data Preparation Workflows made easy with Pandas, Dask, cuDF, Dask-cuDF, Vaex and PySpark
big-data-cleaning bigdata cudf dask dask-cudf data-analysis data-cleaner data-cleaning data-cleansing data-exploration data-extraction data-preparation data-profiling data-science data-transformation data-wrangling machine-learning pyspark spark
Language:Python 1524
raznem / parsera
Lightweight library for scraping web-sites with LLMs
ai ai-scraping data-extraction llm opensource playwright python scraping webscraping
Language:Python 1235
saifyxpro / HeadlessX
A lightweight, self-hosted headless browser automation platform. Designed as an alternative to Browserless, built for speed, privacy, and scalability.
automation browser-automation browserless chromedriver headless automation-api automation-platform browser-testing chrome-headless data-extraction headless-chrome headless-service playwright playwright-automation puppeteer scraping-service web-automation web-scraping container-automation
Language:JavaScript 1059
vnstock
thinh-vu / vnstock
A beginner-friendly yet powerful Python toolkit for financial analysis and automation — built to make modern investing accessible to everyone
stock-market data-extraction stock-screener quantitative-finance quantitative-analysis quantitative-trading
Language:Python 1028
hacker-news-digest
polyrabbit / hacker-news-digest
:newspaper: Let ChatGPT Summarize Hacker News for You
hacker-news python data-extraction hacker-news-reader rss extract-summaries hacker-news-digest spider crawler machine-learning news-aggregator chatgpt chatgpt-api openai openai-api
Language:Python 733
adrienjoly / npm-pdfreader
🚜 Parse text and tables from PDF files.
data-extraction pdf-converter parsing javascript tabular-data pdf-reader parse-tables rule-based-parsing
Language:HTML 692
ScrapeGraphAI / scrapecraft
🤖 AI-powered web scraping editor with visual workflow builder. Build, test & deploy web scrapers using natural language. Powered by ScrapeGraphAI & LangGraph.
ai automation data-extraction docker fastapi hacktoberfest langgraph python react scrapegraphai typescript web-scraping webscraping
Language:Python 551
eclaire-labs / eclaire
Local-first, open-source AI assistant for your data. Unify tasks, notes, docs, photos, and bookmarks. Private, self-hosted, and extensible via APIs.
ai ai-assistant automation bookmark-manager bookmarks data-extraction document-processing llm local-first note-taking ocr on-device-ai open-source personal-knowledge-management privacy rest-api self-hosted task-management web-archiving
Language:TypeScript 487
a-maliarov / amazoncaptcha
Pure Python, lightweight, Pillow-based solver for Amazon's text captcha.
captcha captcha-solver amazon python3 pillow amazon-captcha amazon-scraper training-data amazoncaptcha data-extraction
Language:Python 486
py-pdf / benchmarks
Benchmarking PDF libraries
benchmark data-extraction mupdf pdf poppler-utils pypdf2 text-extraction
Language:Python 315
jpjacobpadilla / Stealth-Requests
Undetected web-scraping & seamless HTML parsing in Python!
python http-client data html-parsing http-requests python-scraping python-web-scraper requests web-crawler web-scraping webscraping xpath data-extraction
Language:Python 311
clauneck
serpapi / clauneck
A tool for scraping emails, social media accounts, and much more information from websites using Google Search Results.
automation command-line command-line-tool data-extraction data-extractor email email-extract-with-proxy email-extraction email-extractor email-marketing email-scraper open-source ruby rubygem serp social-media-scraper web-crawling webscraping
Language:Ruby 187
molybdenum-99 / infoboxer
Wikipedia information extraction library
wikipedia mediawiki data-extraction
Language:Ruby 176
sypht-python-client
sypht-team / sypht-python-client
A python client for the Sypht API
data-extraction information-extraction api-client python python3 python3-library sypht sypht-python-client sypht-api invoice extract extract-fields extract-data-from-pdf receipt-scanner pdf-parser receipt-capture invoice-parser receipt-reader receipt-scanning document-capture
Language:Python 162
dilawar / PlotDigitizer
A Python utility to digitize plots.
digitization data-extraction python3 image-processing
Language:Python 155
johnbumgarner / newspaper3_usage_overview
This repository provides usage examples for the Python module Newspaper3k.
beautifulsoup data-extraction news newspaper newspaper3k nlp-parsing python python-requests scraping-websites
Language:Python 148
CambioML / any-parser
Accurate, private and configurable document retrieval LLM
data-extraction document llm pdf privacy structured-data unstructured-data
Language:Python 130
nfx / go-htmltable
Structured HTML table data extraction from URLs in Go that has almost no external dependencies
go data-extraction go-generics html
Language:Go 122
sayn
173TECH / sayn
Data processing and modelling framework for automating tasks (incl. Python & SQL transformations).
analytics etl data-modeling data-engineering data-science python sql automation elt data-extraction
Language:Python 120
villagecomputing / superpipe
Superpipe - optimized LLM pipelines for structured data
classification data-extraction data-labeling llm llm-evaluation llm-optimization structured-data
Language:Python 108
hermit-crab / ScrapeMate
Scraping assistant tool. Editing and maintaining CSS/XPath selectors across webpages.
xpath-selector scraping css-selector extension firefox-extension chrome-extension data-mining data-extraction devtools
Language:JavaScript 105
tech-engine / goscrapy
GoScrapy: Harnessing Go's power for blazingly fast web scraping, inspired by Python's Scrapy framework.
golang goscraper scrapy webscraper spider data-extraction web-crawler webscrapping go-scrapy
Language:Go 101
Zubdata / Google-Maps-Scraper
Google maps scraper with gui
data-extraction googlemaps gui-application leadsheets python web-scraping-software scraper automation bot google-maps-scraper google-maps-scraper-python googlemapsscraper web-automation web-bot webautomation webbot
Language:Python 100
reincubate / ricloud
Python client for Reincubate's ricloud API. Yes, it works with iOS 14 & iPhone 12 backups!
icloud-api icloud-access icloud data-extraction cloudkit python-client data-recovery
Language:Python 96
sshniro / line-segmentation-algorithm-to-gcp-vision
Line segmentation algorithm for Google Vision API.
google-vision proposed-algorithm data-extraction invoice segmentation
Language:Kotlin 96
chenkovsky / cyac
High performance Trie and Ahocorasick automata (AC automata) Keyword Match & Replace Tool for python. Correct case insensitive implementation!
automata cython data-extraction double-array-trie keyword-extraction nlp search search-in-text trie
Language:Cython 95
docwire / docwire
DocWire SDK: Award-winning modern data processing in C++20. SourceForge Community Choice & Microsoft support. AI-driven processing. Supports nearly 100 data formats, including email boxes and OCR. Boost efficiency in text extraction, web data extraction, data mining, document analysis. Offline processing is possible for security and confidentiality
api c cli cpp linux macos parsing shell terminal windows tensorflow sdk text-extraction machine-learning artificial-intelligence data-extraction text-extraction-from-image data-processing text-mining extract-transform-load
Language:C++ 94
dav009 / flash
Golang Keyword extraction/replacement Datastructure using Tries instead of regexes
text golang search trie data-extraction go text-search
Language:Go 89