crawler distributed scrapy spider web-crawler

millions-crawler

This the NCKU course WEB RESOURCE DISCOVERY AND EXPLOITATION homework III, targe is create a crawler application to crawling millions webpage.

image source

Part of the homework:

Medium Article

Homework Scope

Crawl millions of webpages
Remove non-HTML pages
Performance optimization
- How many page can crawl per hour
- Total time to crawl millions of pages

Project architecture

Distributed architecture

Each spider

Spider with 台灣 E 院

Spider with 問 8 健康諮詢

Spider with Wiki

Anti-Anti-Spider

Skip robot.txt

# edit settings.py
ROBOTSTXT_OBEY = False

Use random user-agent

pip install fake-useragent

# edit middlewares.py
class FakeUserAgentMiddleware(UserAgentMiddleware):
    def __init__(self, user_agent=''):
        self.user_agent = user_agent

    def process_request(self, request, spider):
        ua = UserAgent()
        request.headers['User-Agent'] = ua.random

DOWNLOADER_MIDDLEWARES = {
   "millions_crawler.middlewares.FakeUserAgentMiddleware": 543,
}

Result

single spider in 2023/03/21

Spider	Total Page	Total Time (hrs)	Page per Hour
tweh	152,958	1.3	117,409
w8h	4,759	0.1	32,203
wiki*	13,000,320	43	30,240

distributed spider (4 spider) in 2023/03/24

Spider	Total Page	Total Time (hrs)	Page per Hour
tweh	153,288	0.52	-
w8h	4,921	0.16	-
wiki*	4,731,249	43.2	109,492

How to use

create a .env file

bash create_env.sh

Install Redis

sudo apt-get install redis-server

Install MongoDB

sudo apt-get install mongodb

Run Redis

redis-server

run MongoDB

sudo service mongod start

Run spider

cd millions-crawler
scrapy crawl [$spider_name] # $spider_name = tweh, w8h, wiki

Requirement

pip install -r requirements.txt

Reference

About

Homework III of NCKU course WEB RESOURCE DISCOVERY AND EXPLOITATION , I've used the distribute crawler to crawling over miliion web page.

crawler distributed scrapy spider web-crawler

MIT License

Languages

Language:Python 99.5%Language:Shell 0.5%