web-crawling

There are 24 repositories under web-crawling topic.

crawlee
apify / crawlee
Crawlee—A web scraping and browser automation library for Node.js to build reliable crawlers. In JavaScript and TypeScript. Extract data for AI, LLMs, RAG, or GPTs. Download HTML, PDF, JPG, PNG, and other files from websites. Works with Puppeteer, Playwright, Cheerio, JSDOM, and raw HTTP. Both headful and headless mode. With proxy rotation.
web-scraping web-crawling npm headless-chrome puppeteer automation apify scraping crawling crawler headless scraper web-crawler javascript nodejs playwright typescript
Language:TypeScript 20477
apify / crawlee-python
Crawlee—A web scraping and browser automation library for Python to build reliable crawlers. Extract data for AI, LLMs, RAG, or GPTs. Download HTML, PDF, JPG, PNG, and other files from websites. Works with BeautifulSoup, Playwright, and raw HTTP. Both headful and headless mode. With proxy rotation.
apify automation beautifulsoup crawler crawling hacktoberfest headless headless-chrome pip playwright python scraper scraping web-crawler web-crawling web-scraping
Language:Python 7129
botasaurus
omkarcloud / botasaurus
The All in One Framework to Build Undefeatable Scrapers
anti-bot anti-detection cloudflare-bypass cloudflare-scrape anti-detect anti-detect-browser antidetect-browser undetected undetected-chromedriver bypass-cloudflare python-web-scraper python-web-scraping scraping-framework scraping-tool undetectable web-scraping-python bot-detection scraping-python web-crawling python-scraper
Language:Python 3189
brightdata / brightdata-mcp
A powerful Model Context Protocol (MCP) server that provides an all-in-one solution for public web access.
ai-agents ai-integrations anti-bot-detection browser-automation data-collection data-extraction llm mcp mcp-server modelcontextprotocol scraping scraping-tools structured-data web-crawling web-data web-scraping
Language:JavaScript 1556
cxcscmu / Craw4LLM
Official repository for "Craw4LLM: Efficient Web Crawling for LLM Pretraining"
crawler crawling large-language-models llm pre-training pretraining web-crawler web-crawling
Language:Python 642
scrapehero-code / amazon-scraper
A simple web scraper to extract Product Data and Pricing from Amazon
amazon-scraper page-scraper scrape-products web-scraping web-scraping-tutorials web-crawling
Language:Python 415
crawler
crwlrsoft / crawler
Library for Rapid (Web) Crawler and Scraper Development
crawling php scraper scraping scraping-websites web-crawler web-crawling web-scraping hacktoberfest crawler web-scraper
Language:PHP 366
spyboy-productions / omnisci3nt
Omnisci3nt – See What They’ve Tried to Hide Extract deep intelligence from any domain. From subdomains to SSL certs, archived secrets to exposed ports — Omnisci3nt gives you the full picture in seconds.
dns-enumeration ip-lookup port-scanning ssl-certificate subdomain-enumeration technology-analysis web-crawling web-reconnaissance whois dmarc-record-examination reconnaissance-tool social-media-and-email-discovery wayback-machine-access directory-enumeration osint admin-login-finder admin-panel-finder admin-panel-finder-of-any-website website-hacking vulnerability-scanner
Language:Python 306
godkingjay / selenium-twitter-scraper
This is a Twitter Scraper which uses Selenium for scraping tweets. It is capable of scraping tweets from home, user profile, hashtag, query or search, and advanced searches.
scraper selenium-scraper twitter twitter-scraper web-crawling hacktoberfest hacktoberfest-accepted collaborate selenium
Language:Jupyter Notebook 292
jrbadiabo / Bet-on-Sibyl
Machine Learning Model for Sport Predictions (Football, Basketball, Baseball, Hockey, Soccer & Tennis)
machine-learning sportsanalytics sports-stats machine-learning-algorithms predictive-analysis algorithms selenium beautifulsoup python python-2 scikit-learn web-scraping web-crawling machinelearning
Language:Jupyter Notebook 275
InfinityCrawler
TurnerSoftware / InfinityCrawler
A simple but powerful web crawler library for .NET
crawler robots-txt spider web-crawler web-crawling
Language:C# 252
ayakashi
ayakashi-io / ayakashi
:zap: Ayakashi.io - The next generation web scraping framework
web-scraping automation headless-chrome data-mining web-crawling
Language:TypeScript 215
clauneck
serpapi / clauneck
A tool for scraping emails, social media accounts, and much more information from websites using Google Search Results.
automation command-line command-line-tool data-extraction data-extractor email email-extract-with-proxy email-extraction email-extractor email-marketing email-scraper open-source ruby rubygem serp social-media-scraper web-crawling webscraping
Language:Ruby 187
scrapinghub / scrapy-training
Scrapy Training companion code
scrapy python training web-scraping web-crawling
Language:Python 173
brianmadden / krawler
A web crawling framework written in Kotlin
webcrawler kotlin framework crawler4j link-checker web-crawler web-crawling
Language:Kotlin 131
MaxValue / Terpene-Profile-Parser-for-Cannabis-Strains
Parser and database to index the terpene profile of different strains of Cannabis from online databases
cannabis data-science web-crawler-python web-crawler web-crawling python-3 terpenes plants biological-data-analysis biological-data scrapy health cannabis-strains crawler python bioinformatics analysis database aromatherapy terpene-profile
Language:Python 127
leogregianin / bancocentralbrasil
💵 💰 :brazil: Informações sobre taxas oficiais diárias de Inflação, Selic, Poupança, Dólar, Dólar PTAX, Euro e Euro PTAX pelo site do Banco Central do Brasil
banco-central money web-scraping web-crawling brasil brazil
Language:Python 126
my8100 / scrapyd-cluster-on-heroku
Set up free and scalable Scrapyd cluster for distributed web-crawling with just a few clicks. DEMO :point_right:
scrapy scrapyd cluster heroku python scrapydweb logparser web-crawling web-scraping
Language:Python 122
maxmindlin / scout-lang
A web crawling programming language
dsl programming-language web-crawling web-scraping scraper scraping scraping-websites
Language:Rust 116
SoheilKhodayari / JAW
JAW: A Graph-based Security Analysis Framework for Client-side JavaScript
client-side csrf javascript neo4j property-graph static-analysis vulnerability-detection web-crawling
Language:JavaScript 112
ScrapingAnt / amazon_scraper
Amazon products scraper with using of rotating proxies and headless Chrome from ScrapingAnt
scraping scraping-api scraping-websites scraping-web scraping-python scraping-data price-scraping price-scraper web-crawler web-crawling web-crawlers amazon amazon-scraper amazon-scraping-library scraper data-mining node-js js scrape-products
Language:JavaScript 87
jonasjacek / robots.txt
Simple robots.txt template. Keep unwanted robots out (disallow). White lists (allow) legitimate user-agents. Useful for all websites.
googlebot bingbot robots-txt robots-exclusion-standard blocking-bots user-agent web-robots seo search-engine whitelist crawlers web-crawling crawling search-engine-optimization baiduspider twitterbot
84
alyakhtar / Katastrophe
Command Line Tool to download torrents
screenshot deluge bittorrent torrent kickass-torrents command-line python web-crawling
Language:Python 83
spyboy-productions / PhantomCrawler
Boost website hits by generating requests from multiple proxy IPs.
ddos-attack-tools proxy proxy-configuration proxy-rotation web-crawling web-scrapping website-analytics website-hits
Language:Python 75
sushantPatrikar / Amazon-Flipkart-Price-Comparison-Engine
Compares price of the product entered by the user from e-commerce sites Amazon and Flipkart :moneybag: :bar_chart:
flipkart ecommerce-sites-amazon corresponding-prices amazon web-crawling web-crawler-python python python3 python-3 tkinter
Language:Python 68
GoTrained / Scrapy-Craigslist
Web Scraping Craigslist's Engineering Jobs in NY with Scrapy
python scrapy web-scraping web-crawling scrapy-crawler scrapy-spider scrapy-tutorial web-scraper craigslist
Language:Python 66
dongweiming / daenerys
Scraping and Web Crawling Framework For Zhihu Live
zhihu zhihulive web-crawling scraping
Language:Python 63
jgujerry / python-frameworks
Another curated list of Python frameworks
ai ai-agent ai-agent-framework api artificial-intelligence cms data-workflow deep-learning devops distributed-computing enterprise-integrations frameworks machine-learning messaging parallel-computing pipeline python task-queue web-crawling webapp
Language:HTML 59
MohamedHmini / tweetsOLAPing
implementing an end-to-end tweets ETL/Analysis pipeline.
datawarehousing datawarehouse etl-pipeline tweets tweets-classification tweets-scraper twitter-api google-api-client api-client web-crawling analysis cube-analysis multi-dimensional-analysis ssis ssas-multidimensional multithreading powerbi-report
Language:Python 58
ScaleUnlimited / flink-crawler
Continuous scalable web crawler built on top of Flink and crawler-commons
web-crawler web-crawling crawler crawling spider flink
Language:Java 52
mike-gee / webtranspose
Web scraping API for building AI applications.
chatbots crawling crawling-python python scraping scraping-python web-crawling web-scraping web-scraping-python
Language:Python 40
ScrapingAnt / zoominfo_scraper
Zoominfo scraper with using of rotating proxies and headless Chrome from ScrapingAnt
scraping scraping-api scraping-websites scraping-data scraping-tool python web-harvesting web-crawler web-crawling web-crawler-python scraper zoominfo-client datamining leadgen leadgeneration
Language:Python 34
Cheng-Lin-Li / KnowledgeGraph
This repository for Web Crawling, Information Extraction, and Knowledge Graph build up.
cdr conditional conditional-random-fields crfsuite facebook-crawler facebook-graph-api information-extraction jsonlines knowledge-graph python python3 web-crawling
Language:Julia 33
chrislicodes / udacity-data-analyst-nanodegree
Repository for the projects needed to complete the Data Analyst Nanodegree.
udacity data data-analysis data-visualization dataset data-cleaning data-analytics data-analyst-nanodegree statistics data-wrangling data-gathering web-crawling api text-mining pandas numpy matplotlib seaborn tweepy
Language:Jupyter Notebook 33
HRN-Projects / amazon-captcha-solver
A TensorFlow (Deep Learning - CNN) based solution for tackling captcha when collecting data from Amazon.
captcha amazon-captcha captcha-solving captcha-solver python python3 tensorflow keras open-cv hrn-projects web-scraping web-scraping-solution web-crawling amazon-captcha-solver amazon-captcha-solving api flask-api captcha-solver-api captcha-images
Language:Python 31
zytedata / spidyquotes
Example site for web scraping tutorials
scraping crawling tutorials web-scraping-tutorials web-scraping web-crawling playground
Language:Julia 31

web-crawling

apify / crawlee

apify / crawlee-python

omkarcloud / botasaurus

brightdata / brightdata-mcp

cxcscmu / Craw4LLM

scrapehero-code / amazon-scraper

crwlrsoft / crawler

spyboy-productions / omnisci3nt

godkingjay / selenium-twitter-scraper

jrbadiabo / Bet-on-Sibyl

TurnerSoftware / InfinityCrawler

ayakashi-io / ayakashi

serpapi / clauneck

scrapinghub / scrapy-training

brianmadden / krawler

MaxValue / Terpene-Profile-Parser-for-Cannabis-Strains

leogregianin / bancocentralbrasil

my8100 / scrapyd-cluster-on-heroku

maxmindlin / scout-lang

SoheilKhodayari / JAW

ScrapingAnt / amazon_scraper

jonasjacek / robots.txt

alyakhtar / Katastrophe

spyboy-productions / PhantomCrawler

sushantPatrikar / Amazon-Flipkart-Price-Comparison-Engine

GoTrained / Scrapy-Craigslist

dongweiming / daenerys

jgujerry / python-frameworks

MohamedHmini / tweetsOLAPing

ScaleUnlimited / flink-crawler

mike-gee / webtranspose

ScrapingAnt / zoominfo_scraper

Cheng-Lin-Li / KnowledgeGraph

chrislicodes / udacity-data-analyst-nanodegree

HRN-Projects / amazon-captcha-solver

zytedata / spidyquotes