This the NCKU course WEB RESOURCE DISCOVERY AND EXPLOITATION homework III, targe is create a crawler application to crawling millions webpage.
image source
Medium Article
- Crawl millions of webpages
- Remove non-HTML pages
- Performance optimization
- How many page can crawl per hour
- Total time to crawl millions of pages
![distributed_architecture](https://raw.githubusercontent.com/NeroHin/millions-crawler/main/./image/scrapy-redis.png)
![spider](https://raw.githubusercontent.com/NeroHin/millions-crawler/main/./image/Scrapy_architecture.png)
![tweh_parse_flowchat](https://raw.githubusercontent.com/NeroHin/millions-crawler/main/./image/%E8%87%BA%E7%81%A3%20E%20%E9%99%A2%E7%88%AC%E8%9F%B2%E7%B5%90%E6%A7%8B.png)
![w8h_parse_flowchat](https://raw.githubusercontent.com/NeroHin/millions-crawler/main/./image/%E5%95%8F%208%20%E5%81%A5%E5%BA%B7%E5%92%A8%E8%A9%A2%E7%88%AC%E8%9F%B2%E7%B5%90%E6%A7%8B.png)
![wiki_parse_flowchat](https://raw.githubusercontent.com/NeroHin/millions-crawler/main/./image/Wiki%20%E7%88%AC%E8%9F%B2%E7%B5%90%E6%A7%8B.png)
- Skip robot.txt
# edit settings.py
ROBOTSTXT_OBEY = False
- Use random user-agent
pip install fake-useragent
# edit middlewares.py
class FakeUserAgentMiddleware(UserAgentMiddleware):
def __init__(self, user_agent=''):
self.user_agent = user_agent
def process_request(self, request, spider):
ua = UserAgent()
request.headers['User-Agent'] = ua.random
DOWNLOADER_MIDDLEWARES = {
"millions_crawler.middlewares.FakeUserAgentMiddleware": 543,
}
single spider in 2023/03/21
Spider |
Total Page |
Total Time (hrs) |
Page per Hour |
tweh |
152,958 |
1.3 |
117,409 |
w8h |
4,759 |
0.1 |
32,203 |
wiki* |
13,000,320 |
43 |
30,240 |
distributed spider (4 spider) in 2023/03/24
Spider |
Total Page |
Total Time (hrs) |
Page per Hour |
tweh |
153,288 |
0.52 |
- |
w8h |
4,921 |
0.16 |
- |
wiki* |
4,731,249 |
43.2 |
109,492 |
- create a .env file
- Install Redis
sudo apt-get install redis-server
- Install MongoDB
sudo apt-get install mongodb
- Run Redis
- run MongoDB
sudo service mongod start
- Run spider
cd millions-crawler
scrapy crawl [$spider_name] # $spider_name = tweh, w8h, wiki
pip install -r requirements.txt
- GitHub | fake-useragent
- GitHub | scrapy
- 【Day 20】反反爬蟲
- Documentation of Scrapy
- 解决 Redis 之 MISCONF Redis is configured to save RDB snapshots, but is currently not able to persist o...
- Ubuntu Linux 安裝、設定 Redis 資料庫教學與範例
- 如何連線到遠端的 Linux + MongoDB 伺服器?
- Scrapy-redis 之終結篇