You can use response.request.headers
know what is the user agent.
- revise
settings.py
USER_AGENT
- revise
settings.py
DEFAULT_REQUEST_HEADERS
, Dict addUser-Agent
- revise spieder code
response.follow
orscrapy.Request
headers
argument addUser-Agent
https://docs.scrapy.org/en/latest/topics/debug.html
directly assigen which spider method you want to test.
e.g.: scrapy parse --spider=stock_rank -c parse -d 3 https://histock.tw/stock/rank.aspx?&p=3&d=1
-c
: you want to test method
-d
: depth level
--meta
: e.g. --meta='{\"key\":\"value\"}'
enter shell mode to debug
inspect_response(response, self)
then in terminal run scrapy crawl stock_rank
open_in_browser(response)
self.logger.warning()
create runner.py
- VSCode terminal environment choose scrapy version.
- VSCode IDE / Run / Start Debugging
e.g. scrapy genspider -t crawl stock_news https://histock.tw/stock/rank.aspx
https://docs.scrapy.org/en/latest/topics/spiders.html?highlight=Rule#crawling-rules
https://github.com/scrapy-plugins/scrapy-splash
pull splash image
docker pull scrapinghub/splash
run splash
docker run -it -p 8050:8050 scrapinghub/splash
- add
SPLASH_URL = 'http://127.0.0.1:8050'
- add
DOWNLOADER_MIDDLEWARES = {
'scrapy_splash.SplashCookiesMiddleware': 723,
'scrapy_splash.SplashMiddleware': 725,
'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware': 810,
}
- add
SPIDER_MIDDLEWARES = {
'scrapy_splash.SplashDeduplicateArgsMiddleware': 100,
}
- add
DUPEFILTER_CLASS = 'scrapy_splash.SplashAwareDupeFilter'
Official Website
Python Selenium
Selenium Cheat Sheet Selenium with Python 中文翻译文档
pip install selenium
- download driver by different browser (要注意你的 broswer 版本號)
- upt the driver to your python path.