will4906 / PatentCrawler

scrapy专利爬虫(停止维护)

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

运行出问题

my-dady opened this issue · comments

2018-05-30 15:33:15 [scrapy.utils.log] INFO: Scrapy 1.5.0 started (bot: crawler)
2018-05-30 15:33:15 [scrapy.utils.log] INFO: Versions: lxml 4.1.1.0, libxml2 2.9.7, cssselect 1.0.3, parsel 1.4.0, w3lib 1.19.0, Twisted 17.5.0, Python 3.6.4 |Anaconda, Inc.| (default, Jan 16 2018, 10:22:32) [MSC v.1900 64 bit (AMD64)], pyOpenSSL 17.5.0 (OpenSSL 1.0.2n 7 Dec 2017), cryptography 2.1.4, Platform Windows-10-10.0.16299-SP0
2018-05-30 15:33:15 [scrapy.crawler] INFO: Overridden settings: {'BOT_NAME': 'crawler', 'COOKIES_DEBUG': True, 'DOWNLOAD_DELAY': 1.0, 'DOWNLOAD_TIMEOUT': 10, 'LOG_FILE': 'C:\Users\myh\Desktop\PatentCrawler-master\output\20180530_153315\PatentCrawler.log', 'NEWSPIDER_MODULE': 'crawler.spiders', 'RETRY_TIMES': 3, 'SPIDER_MODULES': ['crawler.spiders']}
2018-05-30 15:33:15 [scrapy.middleware] INFO: Enabled extensions:
['scrapy.extensions.corestats.CoreStats',
'scrapy.extensions.telnet.TelnetConsole',
'scrapy.extensions.logstats.LogStats']
2018-05-30 15:33:16 [scrapy.middleware] INFO: Enabled downloader middlewares:
['scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',
'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',
'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',
'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware',
'crawler.middlewares.PatentMiddleware',
'scrapy.downloadermiddlewares.retry.RetryMiddleware',
'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware',
'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware',
'scrapy.downloadermiddlewares.redirect.RedirectMiddleware',
'scrapy.downloadermiddlewares.cookies.CookiesMiddleware',
'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware',
'scrapy.downloadermiddlewares.stats.DownloaderStats']
2018-05-30 15:33:16 [scrapy.middleware] INFO: Enabled spider middlewares:
['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware',
'scrapy.spidermiddlewares.offsite.OffsiteMiddleware',
'scrapy.spidermiddlewares.referer.RefererMiddleware',
'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware',
'scrapy.spidermiddlewares.depth.DepthMiddleware']
2018-05-30 15:33:16 [scrapy.middleware] INFO: Enabled item pipelines:
['crawler.pipelines.CrawlerPipeline']
2018-05-30 15:33:16 [scrapy.core.engine] INFO: Spider opened
2018-05-30 15:33:16 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2018-05-30 15:33:16 [scrapy.extensions.telnet] DEBUG: Telnet console listening on 127.0.0.1:6023
2018-05-30 15:33:17 [urllib3.connectionpool] DEBUG: Starting new HTTP connection (1): www.pss-system.gov.cn
2018-05-30 15:33:17 [urllib3.connectionpool] DEBUG: http://www.pss-system.gov.cn:80 "POST /sipopublicsearch/patentsearch/pageIsUesd-pageUsed.shtml HTTP/1.1" 200 None
2018-05-30 15:33:17 [urllib3.connectionpool] DEBUG: Starting new HTTP connection (1): www.pss-system.gov.cn
2018-05-30 15:33:17 [urllib3.connectionpool] DEBUG: http://www.pss-system.gov.cn:80 "GET /sipopublicsearch/patentsearch/tableSearch-showTableSearchIndex.shtml HTTP/1.1" 200 None
2018-05-30 15:33:18 [urllib3.connectionpool] DEBUG: Starting new HTTP connection (1): www.pss-system.gov.cn
2018-05-30 15:33:18 [urllib3.connectionpool] DEBUG: http://www.pss-system.gov.cn:80 "GET /sipopublicsearch/portal/login-showPic.shtml HTTP/1.1" 200 None
2018-05-30 15:33:18 [urllib3.connectionpool] DEBUG: Starting new HTTP connection (1): www.pss-system.gov.cn
2018-05-30 15:33:18 [urllib3.connectionpool] DEBUG: http://www.pss-system.gov.cn:80 "POST /sipopublicsearch/wee/platform/wee_security_check HTTP/1.1" 302 None
2018-05-30 15:33:18 [urllib3.connectionpool] DEBUG: http://www.pss-system.gov.cn:80 "GET /sipopublicsearch/portal/uilogin-loginSuccess.shtml?params=991CFE73D4DF553253D44E119219BF31366856FF4B15222669397E093A956A2C&j_loginsuccess_url= HTTP/1.1" 302 None
2018-05-30 15:33:18 [urllib3.connectionpool] DEBUG: http://www.pss-system.gov.cn:80 "GET /sipopublicsearch/portal/uiIndex.shtml HTTP/1.1" 200 None
2018-05-30 15:33:19 [urllib3.connectionpool] DEBUG: Starting new HTTP connection (1): www.pss-system.gov.cn
2018-05-30 15:33:19 [urllib3.connectionpool] DEBUG: http://www.pss-system.gov.cn:80 "POST /sipopublicsearch/patentsearch/showViewList-jumpToView.shtml HTTP/1.1" 200 None
2018-05-30 15:33:19 [scrapy.downloadermiddlewares.retry] DEBUG: Retrying <POST http://www.pss-system.gov.cn/sipopublicsearch/patentsearch/executeTableSearch0402-executeCommandSearch.shtml> (failed 1 times): unlogin
2018-05-30 15:33:19 [scrapy.downloadermiddlewares.cookies] DEBUG: Sending cookies to: <POST http://www.pss-system.gov.cn/sipopublicsearch/patentsearch/executeTableSearch0402-executeCommandSearch.shtml>
Cookie: JSESSIONID=x1Sv9YxmnHdXesCJk04Y3SMqTX3yBIpnhcwf0uKlEOg9TlE-gYYY!309799008!187544033; IS_LOGIN=true; WEE_SID=x1Sv9YxmnHdXesCJk04Y3SMqTX3yBIpnhcwf0uKlEOg9TlE-gYYY!309799008!187544033!1527665495142

2018-05-30 15:33:19 [urllib3.connectionpool] DEBUG: Starting new HTTP connection (1): www.pss-system.gov.cn
2018-05-30 15:33:19 [urllib3.connectionpool] DEBUG: http://www.pss-system.gov.cn:80 "POST /sipopublicsearch/patentsearch/pageIsUesd-pageUsed.shtml HTTP/1.1" 200 None
2018-05-30 15:33:19 [urllib3.connectionpool] DEBUG: Starting new HTTP connection (1): www.pss-system.gov.cn
2018-05-30 15:33:19 [urllib3.connectionpool] DEBUG: http://www.pss-system.gov.cn:80 "GET /sipopublicsearch/patentsearch/tableSearch-showTableSearchIndex.shtml HTTP/1.1" 200 None
2018-05-30 15:33:19 [urllib3.connectionpool] DEBUG: Starting new HTTP connection (1): www.pss-system.gov.cn
2018-05-30 15:33:20 [urllib3.connectionpool] DEBUG: http://www.pss-system.gov.cn:80 "GET /sipopublicsearch/portal/login-showPic.shtml HTTP/1.1" 200 None
2018-05-30 15:33:20 [urllib3.connectionpool] DEBUG: Starting new HTTP connection (1): www.pss-system.gov.cn
2018-05-30 15:33:20 [urllib3.connectionpool] DEBUG: http://www.pss-system.gov.cn:80 "POST /sipopublicsearch/wee/platform/wee_security_check HTTP/1.1" 302 None
2018-05-30 15:33:20 [urllib3.connectionpool] DEBUG: http://www.pss-system.gov.cn:80 "GET /sipopublicsearch/portal/uilogin-loginSuccess.shtml?params=991CFE73D4DF553253D44E119219BF31366856FF4B15222669397E093A956A2C&j_loginsuccess_url= HTTP/1.1" 302 None
2018-05-30 15:33:20 [urllib3.connectionpool] DEBUG: http://www.pss-system.gov.cn:80 "GET /sipopublicsearch/portal/uiIndex.shtml HTTP/1.1" 200 None
2018-05-30 15:33:20 [urllib3.connectionpool] DEBUG: Starting new HTTP connection (1): www.pss-system.gov.cn
2018-05-30 15:33:20 [urllib3.connectionpool] DEBUG: http://www.pss-system.gov.cn:80 "POST /sipopublicsearch/patentsearch/showViewList-jumpToView.shtml HTTP/1.1" 200 None
2018-05-30 15:33:20 [scrapy.downloadermiddlewares.retry] DEBUG: Retrying <POST http://www.pss-system.gov.cn/sipopublicsearch/patentsearch/executeTableSearch0402-executeCommandSearch.shtml> (failed 2 times): unlogin
2018-05-30 15:33:20 [scrapy.downloadermiddlewares.cookies] DEBUG: Sending cookies to: <POST http://www.pss-system.gov.cn/sipopublicsearch/patentsearch/executeTableSearch0402-executeCommandSearch.shtml>
Cookie: JSESSIONID=enOv9ZPDdp7oeLhqlYjU_gHhiJA63dF52InwKDPUfwSJwT4OC0x4!309799008!187544033; IS_LOGIN=true; WEE_SID=enOv9ZPDdp7oeLhqlYjU_gHhiJA63dF52InwKDPUfwSJwT4OC0x4!309799008!187544033!1527665497027

2018-05-30 15:33:20 [urllib3.connectionpool] DEBUG: Starting new HTTP connection (1): www.pss-system.gov.cn
2018-05-30 15:33:21 [urllib3.connectionpool] DEBUG: http://www.pss-system.gov.cn:80 "POST /sipopublicsearch/patentsearch/pageIsUesd-pageUsed.shtml HTTP/1.1" 200 None
2018-05-30 15:33:21 [urllib3.connectionpool] DEBUG: Starting new HTTP connection (1): www.pss-system.gov.cn
2018-05-30 15:33:21 [urllib3.connectionpool] DEBUG: http://www.pss-system.gov.cn:80 "GET /sipopublicsearch/patentsearch/tableSearch-showTableSearchIndex.shtml HTTP/1.1" 200 None
2018-05-30 15:33:21 [urllib3.connectionpool] DEBUG: Starting new HTTP connection (1): www.pss-system.gov.cn
2018-05-30 15:33:21 [urllib3.connectionpool] DEBUG: http://www.pss-system.gov.cn:80 "GET /sipopublicsearch/portal/login-showPic.shtml HTTP/1.1" 200 None
2018-05-30 15:33:21 [urllib3.connectionpool] DEBUG: Starting new HTTP connection (1): www.pss-system.gov.cn
2018-05-30 15:33:22 [urllib3.connectionpool] DEBUG: http://www.pss-system.gov.cn:80 "POST /sipopublicsearch/wee/platform/wee_security_check HTTP/1.1" 302 None
2018-05-30 15:33:22 [urllib3.connectionpool] DEBUG: http://www.pss-system.gov.cn:80 "GET /sipopublicsearch/portal/uilogin-loginSuccess.shtml?params=991CFE73D4DF553253D44E119219BF31366856FF4B15222669397E093A956A2C&j_loginsuccess_url= HTTP/1.1" 302 None
2018-05-30 15:33:22 [urllib3.connectionpool] DEBUG: http://www.pss-system.gov.cn:80 "GET /sipopublicsearch/portal/uiIndex.shtml HTTP/1.1" 200 None
2018-05-30 15:33:22 [urllib3.connectionpool] DEBUG: Starting new HTTP connection (1): www.pss-system.gov.cn
2018-05-30 15:33:22 [urllib3.connectionpool] DEBUG: http://www.pss-system.gov.cn:80 "POST /sipopublicsearch/patentsearch/showViewList-jumpToView.shtml HTTP/1.1" 200 None
2018-05-30 15:33:22 [scrapy.downloadermiddlewares.retry] DEBUG: Retrying <POST http://www.pss-system.gov.cn/sipopublicsearch/patentsearch/executeTableSearch0402-executeCommandSearch.shtml> (failed 3 times): unlogin
2018-05-30 15:33:22 [scrapy.downloadermiddlewares.cookies] DEBUG: Sending cookies to: <POST http://www.pss-system.gov.cn/sipopublicsearch/patentsearch/executeTableSearch0402-executeCommandSearch.shtml>
Cookie: JSESSIONID=fdyv9Zmxa7oMcWvdvBHwiuh8nvKhmeaYnZ03iat0rUfX2SfDs-5E!309799008!187544033; IS_LOGIN=true; WEE_SID=fdyv9Zmxa7oMcWvdvBHwiuh8nvKhmeaYnZ03iat0rUfX2SfDs-5E!309799008!187544033!1527665498545

2018-05-30 15:33:22 [urllib3.connectionpool] DEBUG: Starting new HTTP connection (1): www.pss-system.gov.cn
2018-05-30 15:33:22 [urllib3.connectionpool] DEBUG: http://www.pss-system.gov.cn:80 "POST /sipopublicsearch/patentsearch/pageIsUesd-pageUsed.shtml HTTP/1.1" 200 None
2018-05-30 15:33:22 [urllib3.connectionpool] DEBUG: Starting new HTTP connection (1): www.pss-system.gov.cn
2018-05-30 15:33:23 [urllib3.connectionpool] DEBUG: http://www.pss-system.gov.cn:80 "GET /sipopublicsearch/patentsearch/tableSearch-showTableSearchIndex.shtml HTTP/1.1" 200 None
2018-05-30 15:33:23 [urllib3.connectionpool] DEBUG: Starting new HTTP connection (1): www.pss-system.gov.cn
2018-05-30 15:33:23 [urllib3.connectionpool] DEBUG: http://www.pss-system.gov.cn:80 "GET /sipopublicsearch/portal/login-showPic.shtml HTTP/1.1" 200 None
2018-05-30 15:33:23 [urllib3.connectionpool] DEBUG: Starting new HTTP connection (1): www.pss-system.gov.cn
2018-05-30 15:33:23 [urllib3.connectionpool] DEBUG: http://www.pss-system.gov.cn:80 "POST /sipopublicsearch/wee/platform/wee_security_check HTTP/1.1" 302 None
2018-05-30 15:33:23 [urllib3.connectionpool] DEBUG: http://www.pss-system.gov.cn:80 "GET /sipopublicsearch/portal/uilogin-loginSuccess.shtml?params=991CFE73D4DF553253D44E119219BF31366856FF4B15222669397E093A956A2C&j_loginsuccess_url= HTTP/1.1" 302 None
2018-05-30 15:33:24 [urllib3.connectionpool] DEBUG: http://www.pss-system.gov.cn:80 "GET /sipopublicsearch/portal/uiIndex.shtml HTTP/1.1" 200 None
2018-05-30 15:33:24 [urllib3.connectionpool] DEBUG: Starting new HTTP connection (1): www.pss-system.gov.cn
2018-05-30 15:33:24 [urllib3.connectionpool] DEBUG: http://www.pss-system.gov.cn:80 "POST /sipopublicsearch/patentsearch/showViewList-jumpToView.shtml HTTP/1.1" 200 None
2018-05-30 15:33:24 [scrapy.downloadermiddlewares.retry] DEBUG: Gave up retrying <POST http://www.pss-system.gov.cn/sipopublicsearch/patentsearch/executeTableSearch0402-executeCommandSearch.shtml> (failed 4 times): unlogin
2018-05-30 15:33:24 [scrapy.core.scraper] ERROR: Error downloading <POST http://www.pss-system.gov.cn/sipopublicsearch/patentsearch/executeTableSearch0402-executeCommandSearch.shtml>
Traceback (most recent call last):
File "D:\Program Files (x86)\anaconda\lib\site-packages\twisted\internet\defer.py", line 1386, in _inlineCallbacks
result = g.send(result)
File "D:\Program Files (x86)\anaconda\lib\site-packages\scrapy\core\downloader\middleware.py", line 43, in process_request
defer.returnValue((yield download_func(request=request,spider=spider)))
File "D:\Program Files (x86)\anaconda\lib\site-packages\twisted\internet\defer.py", line 1363, in returnValue
raise _DefGen_Return(val)
twisted.internet.defer._DefGen_Return: <404 http://www.pss-system.gov.cn/sipopublicsearch/patentsearch/executeTableSearch0402-executeCommandSearch.shtml>

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File "D:\Program Files (x86)\anaconda\lib\site-packages\twisted\internet\defer.py", line 1386, in _inlineCallbacks
result = g.send(result)
File "D:\Program Files (x86)\anaconda\lib\site-packages\scrapy\core\downloader\middleware.py", line 56, in process_response
(six.get_method_self(method).class.name, type(response))
AssertionError: Middleware PatentMiddleware.process_response must return Response or Request, got <class 'NoneType'>
2018-05-30 15:33:24 [scrapy.core.engine] INFO: Closing spider (finished)
2018-05-30 15:33:24 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 4368,
'downloader/request_count': 4,
'downloader/request_method_count/POST': 4,
'downloader/response_bytes': 6301,
'downloader/response_count': 4,
'downloader/response_status_count/404': 4,
'finish_reason': 'finished',
'finish_time': datetime.datetime(2018, 5, 30, 7, 33, 24, 666286),
'log_count/DEBUG': 56,
'log_count/ERROR': 1,
'log_count/INFO': 7,
'retry/count': 3,
'retry/max_reached': 1,
'retry/reason_count/unlogin': 3,
'scheduler/dequeued': 4,
'scheduler/dequeued/memory': 4,
'scheduler/enqueued': 4,
'scheduler/enqueued/memory': 4,
'start_time': datetime.datetime(2018, 5, 30, 7, 33, 16, 985230)}
2018-05-30 15:33:24 [scrapy.core.engine] INFO: Spider closed (finished)

1、5月29号,网站调整了几个搜索入口的地址,要修改一下。
url_config.py文件中http://www.pss-system.gov.cn/sipopublicsearch/patentsearch/executeTableSearch0402-executeCommandSearch.shtml
0402要改为0529
还有几处:
http://www.pss-system.gov.cn/sipopublicsearch/patentsearch/showPatentInfo0405-showPatentInfo.shtml
http://www.pss-system.gov.cn/sipopublicsearch/patentsearch/viewAbstractInfo0404-viewAbstractInfo.shtml
http://www.pss-system.gov.cn/sipopublicsearch/patentsearch/showFullText0406-viewFullText.shtml
几个地方的0405均要改成0529
2、此外,基本检索结果的解析方法也部分改变,需要更新。具体如何改,抓包分析一下就知道了,在这里三言两语说不清楚。

能力有限,还没改好,急着用数据做毕业设计,有最新版本吗?万分感谢!

我没有用scrapy,我自己的解析文件贴出来,你参考下。
1、基本搜索结果,以前需要解析html页面,现在都改为返回json了,实际上更简单了。下面是基本搜索结果的解析

def _parse_basic(record_list):
    if not record_list:
        return None
    result = []
    try:
        for record in record_list:
            basic = {}
            basic['nrdAn'] = record.get('fieldMap').get('AP')
            basic['nrdPn'] = record.get('fieldMap').get('PN')
            basic['patent_id'] = record.get('fieldMap').get('ID')
            basic['request_number'] = record.get('fieldMap').get('APO')
            basic['request_date'] = record.get('fieldMap').get('APD')
            basic['publish_number'] = record.get('fieldMap').get('PN')
            basic['publish_date'] = record.get('fieldMap').get('PD')
            basic['invention_name'] = record.get('fieldMap').get('TIVIEW')
            basic['inventor'] = record.get('fieldMap').get('INVIEW')
            basic['proposer'] = record.get('fieldMap').get('PAVIEW')
            basic['agent'] = record.get('fieldMap').get('AGT')
            basic['agency'] = record.get('fieldMap').get('AGY')
            # 去除<FONT>和</FONT>格式
            for key, value in basic.items():
                basic[key] = re.sub(r'</{0,1}FONT>', '', value)
            result.append(basic)
        return result
    except Exception as e:
        print(e)
        return None