- Chrome 74, ChromeDrvier 74
- python3.7
- 请求库:requests,selenium, aiohttp
- 解析库:lxml, beautifulsoup4, pyquery
- OCR识别库:tesseract-ocr 4.0,tesserocr2.4
- 数据库:MySQL, MongoDB
- python储存库:pymysql, pymongo, redis, redis-dump
- web库:Flask, Tornado
- 爬虫框架:Scrapy
- 分布式爬虫:Docker
urllib.request.urlopen('URL', data=None, [timeout,]*, cafile=None, capath=None, cadefault=False, context=None)
import urllib.request
response = urllib.request.urlopen('https://www.baidu.com')
print(type(response))
得到response是一个HTTPResponse类型的对象,包含read(), getheader('name'), getheaders(), readinto(), fileno()等方法,以及msg, version, status, reason, debuglevel, closed等属性。
若要传递data参数,要将其转码为 byte 类型
import urllib.parse
data = bytes(urllib.parse.urlencode({'key':'value}'), encode='utf8')
通过 timeout 可以设置对响应时间过长的网页跳过抓取:
import socket
import urllib.request
import urllib.error
try:
response = urllib.request.urlopen('URL', timeout=10)
except urllib.error.URLError as e:
if isinstance(e.reason, socket.timeout):
print('Time Out!')
pass