There are 118 repositories under web-crawler topic.
🔥 The Web Data API for AI - Turn entire websites into LLM-ready markdown or structured data
Python scraper based on AI
Crawlee—A web scraping and browser automation library for Node.js to build reliable crawlers. In JavaScript and TypeScript. Extract data for AI, LLMs, RAG, or GPTs. Download HTML, PDF, JPG, PNG, and other files from websites. Works with Puppeteer, Playwright, Cheerio, JSDOM, and raw HTTP. Both headful and headless mode. With proxy rotation.
Distributed web crawler admin platform for spiders management regardless of languages and frameworks. 分布式爬虫管理平台,支持任何语言和框架
新一代爬虫平台,以图形化方式定义爬虫流程,不写代码即可完成爬虫。
Crawlee—A web scraping and browser automation library for Python to build reliable crawlers. Extract data for AI, LLMs, RAG, or GPTs. Download HTML, PDF, JPG, PNG, and other files from websites. Works with BeautifulSoup, Playwright, and raw HTTP. Both headful and headless mode. With proxy rotation.
A collection of awesome web crawler,spider in different languages
Ingest, parse, and optimize any data format ➡️ from documents to multimedia ➡️ for enhanced compatibility with GenAI frameworks
🔥 Official Firecrawl MCP Server - Adds powerful web scraping and search to Cursor, Claude and any other LLM clients.
Cross Platform C# web crawler framework built for speed and flexibility. Please star this project! +1.
Internet search engine for text-oriented websites. Indexing the small, old and weird web.
Broken link checker that crawls websites and validates links. Find broken links, dead links, and invalid URLs in websites, documentation, and local files. Perfect for SEO audits and CI/CD.
Crawl a website starting from a URL, find relevant pages, and extract data – all guided by your natural language prompt.
CLI tool for saving a faithful copy of a complete web page in a single HTML file (based on SingleFile)
A scalable, mature and versatile web crawler based on Apache Storm
Run a high-fidelity browser-based web archiving crawler in a single Docker container
A versatile Ruby web spidering library that can spider a site, multiple domains, certain links or infinitely. Spidr is designed to be fast and easy to use.
Scalable Python web scraping scripts for +40 popular domains
🔥 This repository contains complete application examples, including websites and other projects, developed using Firecrawl.
➖ Stripped down, stable version of firecrawl optimized for self-hosting and ease of contribution. Billing logic and AI features are completely removed. Crawl and convert any website into LLM-ready markdown.
Opensource Korean chatbot framework
The only open-source toolkit that can download SEC EDGAR financial reports and extract textual data from specific item sections into nice & clean structured JSON files. Presented at WWW 2025 @ Sydney, Australia (https://dl.acm.org/doi/10.1145/3701716.3715289)
Spark-Crawler: Apache Nutch-like crawler that runs on Apache Spark.
A web crawler. Supercrawler automatically crawls websites. Define custom handlers to parse content. Obeys robots.txt, rate limits and concurrency limits.
Model Context Protocol (MCP) Server for Graphlit Platform
News crawling with StormCrawler - stores content as WARC
Lightweight scraper for Google News