web-crawler

There are 118 repositories under web-crawler topic.

firecrawl
firecrawl / firecrawl
🔥 The Web Data API for AI - Turn entire websites into LLM-ready markdown or structured data
ai ai-agents ai-crawler ai-scraping ai-search crawler data-extraction html-to-markdown llm markdown scraper scraping web-crawler web-data web-data-extraction web-scraper web-scraping web-search webscraping
Language:TypeScript 66741
ScrapeGraphAI / Scrapegraph-ai
Python scraper based on AI
ai-crawler ai-scraping ai-search automated-scraper crawler data-extraction large-language-model llm markdown rag scraping scraping-python web-crawler web-crawlers web-data web-data-extraction web-scraper web-scraping web-search webscraping
Language:Python 21725
crawlee
apify / crawlee
Crawlee—A web scraping and browser automation library for Node.js to build reliable crawlers. In JavaScript and TypeScript. Extract data for AI, LLMs, RAG, or GPTs. Download HTML, PDF, JPG, PNG, and other files from websites. Works with Puppeteer, Playwright, Cheerio, JSDOM, and raw HTTP. Both headful and headless mode. With proxy rotation.
apify automation crawler crawling headless headless-chrome javascript nodejs npm playwright puppeteer scraper scraping typescript web-crawler web-crawling web-scraping
Language:TypeScript 20469
crawlab
crawlab-team / crawlab
Distributed web crawler admin platform for spiders management regardless of languages and frameworks. 分布式爬虫管理平台，支持任何语言和框架
crawlab crawler crawling-tasks docker go platform scrapy scrapyd-ui spider spiders-management web-crawler webcrawler webspider
Language:Go 12046
ssssssss-team / spider-flow
新一代爬虫平台，以图形化方式定义爬虫流程，不写代码即可完成爬虫。
crawler jsoup spider spider-flow web-crawler web-spider webcrawler webspider xpath
Language:Java 11049
apify / crawlee-python
Crawlee—A web scraping and browser automation library for Python to build reliable crawlers. Extract data for AI, LLMs, RAG, or GPTs. Download HTML, PDF, JPG, PNG, and other files from websites. Works with BeautifulSoup, Playwright, and raw HTTP. Both headful and headless mode. With proxy rotation.
apify automation beautifulsoup crawler crawling hacktoberfest headless headless-chrome pip playwright python scraper scraping web-crawler web-crawling web-scraping
Language:Python 7125
BruceDone / awesome-crawler
A collection of awesome web crawler,spider in different languages
awesome crawler node-crawler scraper spider web-crawler web-scraper
6993
omniparse
adithya-s-k / omniparse
Ingest, parse, and optimize any data format ➡️ from documents to multimedia ➡️ for enhanced compatibility with GenAI frameworks
ingestion-api ocr omniparser parse-server parser-library vision-transformer web-crawler whisper-api
Language:Python 6725
firecrawl / firecrawl-mcp-server
🔥 Official Firecrawl MCP Server - Adds powerful web scraping and search to Cursor, Claude and any other LLM clients.
batch-processing claude content-extraction data-collection firecrawl firecrawl-ai javascript-rendering llm-tools mcp mcp-server model-context-protocol search-api web-crawler web-scraping
Language:JavaScript 4880
apache / nutch
Apache Nutch is an extensible and scalable web crawler
apache crawling hadoop java nutch web-crawler
Language:Java 3087
jasonxtn / Argus
The Ultimate Information Gathering Toolkit
cms-detection directory-finder dns-lookup information-gathering osint recon-tools reconnaissance server-info ssl-analitcs txt-records virustotal web-crawler whois-lookup pastebin-monitoring email-harvester
Language:Python 2421
sjdirect / abot
Cross Platform C# web crawler framework built for speed and flexibility. Please star this project! +1.
abot abot-nuget c-sharp crawler cross-platform csharp csharp-library javascript-renderer netcore netcore2 netcore3 netsta netstandard20 netstandard21 parsing pluggable spider spiders unit-testing web-crawler
Language:C# 2289
xianhu / PSpider
简单易用的Python爬虫框架，QQ交流群：597510560
crawler spider python proxies web-spider multi-threading web-crawler python-spider multiprocessing
Language:Python 1837
MarginaliaSearch / MarginaliaSearch
Internet search engine for text-oriented websites. Indexing the small, old and weird web.
search-engine no-cloud small-web internet-search indexer language-processing web-crawler alt-search self-hostable java web-scale
Language:HTML 1558
JustinBeckwith / linkinator
Broken link checker that crawls websites and validates links. Find broken links, dead links, and invalid URLs in websites, documentation, and local files. Perfect for SEO audits and CI/CD.
link-checker typescript broken-links 404 broken-link-checker seo-tools seo dead-links link-validator url-validator website-crawler web-crawler html ci-cd testing nodejs
Language:TypeScript 1105
oxylabs / ai-crawler-py
Crawl a website starting from a URL, find relevant pages, and extract data – all guided by your natural language prompt.
ai ai-agents ai-crawler ai-studio web-crawler ai-web-crawler crawl-agent ai-scraping
1054
gildas-lormeau / single-file-cli
CLI tool for saving a faithful copy of a complete web page in a single HTML file (based on SingleFile)
cli nodejs single-file web-archiving web-scraper web-scraping archiving scraping-websites crawler web-crawler deno dockerfile
Language:JavaScript 1032
Algebra-FUN / WeReadScan
扫描“微信读书”已购图书并下载本地PDF的爬虫
selenium weread web-crawler book-downloader
Language:Python 976
apache / stormcrawler
A scalable, mature and versatile web crawler based on Apache Storm
web-crawler apache-storm distributed java crawler stormcrawler
Language:Java 946
webrecorder / browsertrix-crawler
Run a high-fidelity browser-based web archiving crawler in a single Docker container
crawler crawling wacz warc web-archiving web-crawler webrecorder
Language:TypeScript 910
postmodern / spidr
A versatile Ruby web spidering library that can spider a site, multiple domains, certain links or infinitely. Spidr is designed to be fast and easy to use.
spider ruby spider-links crawler web scraper web-scraping web-spider web-crawler web-scraper
Language:Ruby 827
scrapfly / scrapfly-scrapers
Scalable Python web scraping scripts for +40 popular domains
crawling python crawler scraping web-scraping web-scraping-python antibot automation captcha-bypass crawling-python datascraping proxies python-scraper scraper scraping-python spider twitter-scraper web-crawler webscraper webscraping
Language:Python 744
cxcscmu / Craw4LLM
Official repository for "Craw4LLM: Efficient Web Crawling for LLM Pretraining"
crawler crawling large-language-models llm pre-training pretraining web-crawler web-crawling
Language:Python 641
firecrawl / firecrawl-app-examples
🔥 This repository contains complete application examples, including websites and other projects, developed using Firecrawl.
ai ai-scraping data examples html-to-markdown llm markdown rag scrapers web-crawler templates
Language:Jupyter Notebook 593
devflowinc / firecrawl-simple
➖ Stripped down, stable version of firecrawl optimized for self-hosting and ease of contribution. Billing logic and AI features are completely removed. Crawl and convert any website into LLM-ready markdown.
ai ai-scraping crawler data embeddings html-to-markdown llm markdown rag scraper scraping search web-crawler webscraping
Language:TypeScript 536
VIDA-NYU / ache
ACHE is a web crawler for domain-specific search.
web-crawler focused-crawler domain-specific-search web-scraping web-spider web-search hacktoberfest
Language:Java 475
hyunwoongko / kochat
Opensource Korean chatbot framework
chatbot deep-learning deeplearning korean korean-chatbot sentence-classification sequance-tagging web-crawler
Language:Python 457
lefterisloukas / edgar-crawler
The only open-source toolkit that can download SEC EDGAR financial reports and extract textual data from specific item sections into nice & clean structured JSON files. Presented at WWW 2025 @ Sydney, Australia (https://dl.acm.org/doi/10.1145/3701716.3715289)
python finance sec edgar edgar-crawler business natural-language-processing nlp data-mining web-crawler
Language:Python 450
USCDataScience / sparkler
Spark-Crawler: Apache Nutch-like crawler that runs on Apache Spark.
solr web-crawler spark nutch tika big-data information-retrieval search-engine search distributed-systems
Language:Java 418
brendonboshell / supercrawler
A web crawler. Supercrawler automatically crawls websites. Define custom handlers to parse content. Obeys robots.txt, rate limits and concurrency limits.
crawler distributed-crawler robot sitemap web-crawler
Language:JavaScript 380
graphlit / graphlit-mcp-server
Model Context Protocol (MCP) Server for Graphlit Platform
claude content-extraction content-ingestion data-collection llm-tools mcp-server model-context-protocol search-api unstructured-data web-crawler web-scraping
Language:TypeScript 369
crawler
crwlrsoft / crawler
Library for Rapid (Web) Crawler and Scraper Development
crawling php scraper scraping scraping-websites web-crawler web-crawling web-scraping hacktoberfest crawler web-scraper
Language:PHP 366
commoncrawl / news-crawl
News crawling with StormCrawler - stores content as WARC
crawler news warc web-crawler apache-storm common-crawl commoncrawl storm-crawler
Language:Java 358
rivermont / spidy
The simple, easy to use command line web crawler.
web-crawler web-spider python python3 crawling crawler
Language:Python 348
google-news-scraper
lewisdonovan / google-news-scraper
Lightweight scraper for Google News
google-news google-news-scraper news news-scraper news-articles web-scraper crawler web-crawler news-crawler google-crawler
Language:TypeScript 347
internetarchive / Zeno
State-of-the-art web crawler 🔱
web-crawler zeno archiving
Language:Go 343

web-crawler

firecrawl / firecrawl

ScrapeGraphAI / Scrapegraph-ai

apify / crawlee

crawlab-team / crawlab

ssssssss-team / spider-flow

apify / crawlee-python

BruceDone / awesome-crawler

adithya-s-k / omniparse

firecrawl / firecrawl-mcp-server

apache / nutch

jasonxtn / Argus

sjdirect / abot

xianhu / PSpider

MarginaliaSearch / MarginaliaSearch

JustinBeckwith / linkinator

oxylabs / ai-crawler-py

gildas-lormeau / single-file-cli

Algebra-FUN / WeReadScan

apache / stormcrawler

webrecorder / browsertrix-crawler

postmodern / spidr

scrapfly / scrapfly-scrapers

cxcscmu / Craw4LLM

firecrawl / firecrawl-app-examples

devflowinc / firecrawl-simple

VIDA-NYU / ache

hyunwoongko / kochat

lefterisloukas / edgar-crawler

USCDataScience / sparkler

brendonboshell / supercrawler

graphlit / graphlit-mcp-server

crwlrsoft / crawler

commoncrawl / news-crawl

rivermont / spidy

lewisdonovan / google-news-scraper

internetarchive / Zeno