problem in example page 46 (populating an item)

Question

problem in example page 46 (populating an item)

MasRa opened this issue 6 years ago · comments

Hi
Could you please help with this:
I did follow step by step the example on page 46 exactly, but I got the following report as output and not as same the book's example:

_root@dev:~/book/MasoudProject/properties# scrapy crawl basic
2018-02-04 14:40:25 [scrapy] INFO: Scrapy 1.0.3 started (bot: properties)
2018-02-04 14:40:25 [scrapy] INFO: Optional features available: ssl, http11, boto
2018-02-04 14:40:25 [scrapy] INFO: Overridden settings: {'NEWSPIDER_MODULE': 'properties.spiders', 'SPIDER_MODULES': ['properties.spiders'], 'BOT_NAME': 'properties'}
2018-02-04 14:40:25 [scrapy] INFO: Enabled extensions: CloseSpider, TelnetConsole, LogStats, CoreStats, SpiderState
2018-02-04 14:40:25 [boto] DEBUG: Retrieving credentials from metadata server.
2018-02-04 14:40:25 [boto] ERROR: Caught exception reading instance data
Traceback (most recent call last):
File "/usr/local/lib/python2.7/dist-packages/boto/utils.py", line 210, in retry_url
r = opener.open(req, timeout=timeout)
File "/usr/lib/python2.7/urllib2.py", line 404, in open
response = self._open(req, data)
File "/usr/lib/python2.7/urllib2.py", line 422, in _open
'_open', req)
File "/usr/lib/python2.7/urllib2.py", line 382, in _call_chain
result = func(*args)
File "/usr/lib/python2.7/urllib2.py", line 1214, in http_open
return self.do_open(httplib.HTTPConnection, req)
File "/usr/lib/python2.7/urllib2.py", line 1184, in do_open
raise URLError(err)
URLError: <urlopen error [Errno 101] Network is unreachable>
2018-02-04 14:40:25 [boto] ERROR: Unable to read instance data, giving up
2018-02-04 14:40:25 [scrapy] INFO: Enabled downloader middlewares: HttpAuthMiddleware, DownloadTimeoutMiddleware, UserAgentMiddleware, RetryMiddleware, DefaultHeadersMiddleware, MetaRefreshMiddleware, HttpCompressionMiddleware, RedirectMiddleware, CookiesMiddleware, ChunkedTransferMiddleware, DownloaderStats
2018-02-04 14:40:25 [scrapy] INFO: Enabled spider middlewares: HttpErrorMiddleware, OffsiteMiddleware, RefererMiddleware, UrlLengthMiddleware, DepthMiddleware
2018-02-04 14:40:25 [scrapy] INFO: Enabled item pipelines:
2018-02-04 14:40:25 [scrapy] INFO: Spider opened
2018-02-04 14:40:25 [scrapy] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2018-02-04 14:40:25 [scrapy] DEBUG: Telnet console listening on 127.0.0.1:6023
2018-02-04 14:40:25 [scrapy] DEBUG: Crawled (200) <GET http://web:9312/properties/property_000000.html> (referer: None)
2018-02-04 14:40:25 [scrapy] ERROR: Spider error processing <GET http://web:9312/properties/property_000000.html> (referer: None)
Traceback (most recent call last):
File "/usr/local/lib/python2.7/dist-packages/twisted/internet/defer.py", line 588, in runCallbacks
current.result = callback(current.result, args, **kw)
File "/root/book/MasoudProject/properties/properties/spiders/basic.py", line 38, in parse
item['address'] = response.xpath('//[@itemtype="http://schema.org/''Place"][1]/text()').extract()
File "/usr/local/lib/python2.7/dist-packages/scrapy/item.py", line 63, in setitem
(self.class.name, key))
KeyError: 'PropertiesItem does not support field: address'
2018-02-04 14:40:25 [scrapy] INFO: Closing spider (finished)
2018-02-04 14:40:25 [scrapy] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 232,
'downloader/request_count': 1,
'downloader/request_method_count/GET': 1,
'downloader/response_bytes': 792,
'downloader/response_count': 1,
'downloader/response_status_count/200': 1,
'finish_reason': 'finished',
'finish_time': datetime.datetime(2018, 2, 4, 14, 40, 25, 736406),
'log_count/DEBUG': 3,
'log_count/ERROR': 3,
'log_count/INFO': 7,
'response_received_count': 1,
'scheduler/dequeued': 1,
'scheduler/dequeued/memory': 1,
'scheduler/enqueued': 1,
'scheduler/enqueued/memory': 1,
'spider_exceptions/KeyError': 1,
'start_time': datetime.datetime(2018, 2, 4, 14, 40, 25, 241964)}
2018-02-04 14:40:25 [scrapy] INFO: Spider closed (finished)
Could you please guide me how make it true?
Thank you

Dimitrios Kouzis-Loukas · Answer 1 · Mon Feb 05 2018 01:27:16 GMT+0800 (China Standard Time)

Hello, this looks mostly ok, with something minor that a restart could fix. Let’s arange a teamviewer session and I can quickly fix it.

…

On Sun, Feb 4, 2018 at 10:09 AM MasRa ***@***.***> wrote: Hi Could you please help with this: I did follow step by step the example on page 46 exactly, but I got the following report as output and not as same the book's example: ***@***.***:~/book/MasoudProject/properties# scrapy crawl basic 2018-02-04 14:40:25 [scrapy] INFO: Scrapy 1.0.3 started (bot: properties) 2018-02-04 14:40:25 [scrapy] INFO: Optional features available: ssl, http11, boto 2018-02-04 14:40:25 [scrapy] INFO: Overridden settings: {'NEWSPIDER_MODULE': 'properties.spiders', 'SPIDER_MODULES': ['properties.spiders'], 'BOT_NAME': 'properties'} 2018-02-04 14:40:25 [scrapy] INFO: Enabled extensions: CloseSpider, TelnetConsole, LogStats, CoreStats, SpiderState 2018-02-04 14:40:25 [boto] DEBUG: Retrieving credentials from metadata server. 2018-02-04 14:40:25 [boto] ERROR: *Caught exception reading instance data* Traceback (most recent call last): File "/usr/local/lib/python2.7/dist-packages/boto/utils.py", line 210, in retry_url r = opener.open(req, timeout=timeout) File "/usr/lib/python2.7/urllib2.py", line 404, in open response = self._open(req, data) File "/usr/lib/python2.7/urllib2.py", line 422, in _open '_open', req) File "/usr/lib/python2.7/urllib2.py", line 382, in _call_chain result = func(*args) File "/usr/lib/python2.7/urllib2.py", line 1214, in http_open return self.do_open(httplib.HTTPConnection, req) File "/usr/lib/python2.7/urllib2.py", line 1184, in do_open raise URLError(err) URLError: <urlopen error [Errno 101] Network is unreachable> 2018-02-04 14:40:25 [boto] ERROR: Unable to read instance data, giving up 2018-02-04 14:40:25 [scrapy] INFO: Enabled downloader middlewares: HttpAuthMiddleware, DownloadTimeoutMiddleware, UserAgentMiddleware, RetryMiddleware, DefaultHeadersMiddleware, MetaRefreshMiddleware, HttpCompressionMiddleware, RedirectMiddleware, CookiesMiddleware, ChunkedTransferMiddleware, DownloaderStats 2018-02-04 14:40:25 [scrapy] INFO: Enabled spider middlewares: HttpErrorMiddleware, OffsiteMiddleware, RefererMiddleware, UrlLengthMiddleware, DepthMiddleware 2018-02-04 14:40:25 [scrapy] INFO: Enabled item pipelines: 2018-02-04 14:40:25 [scrapy] INFO: Spider opened 2018-02-04 14:40:25 [scrapy] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min) 2018-02-04 14:40:25 [scrapy] DEBUG: Telnet console listening on 127.0.0.1:6023 2018-02-04 14:40:25 [scrapy] DEBUG: Crawled (200) <GET http://web:9312/properties/property_000000.html> (referer: None) 2018-02-04 14:40:25 [scrapy] ERROR: Spider error processing <GET http://web:9312/properties/property_000000.html> (referer: None) Traceback (most recent call last): File "/usr/local/lib/python2.7/dist-packages/twisted/internet/defer.py", line 588, in *runCallbacks current.result = callback(current.result, args, **kw) File "/root/book/MasoudProject/properties/properties/spiders/basic.py", line 38, in parse item['address'] = ***@***.***="http://schema.org/''Place <http://schema.org/''Place>"][1]/text()').extract() File "/usr/local/lib/python2.7/dist-packages/scrapy/item.py", line 63, in setitem (self.class.name, key)) KeyError: 'PropertiesItem does not support field: address' 2018-02-04 14:40:25 [scrapy] INFO: Closing spider (finished) 2018-02-04 14:40:25 [scrapy] INFO: Dumping Scrapy stats: {'downloader/request_bytes': 232, 'downloader/request_count': 1, 'downloader/request_method_count/GET': 1, 'downloader/response_bytes': 792, 'downloader/response_count': 1, 'downloader/response_status_count/200': 1, 'finish_reason': 'finished', 'finish_time': datetime.datetime(2018, 2, 4, 14, 40, 25, 736406), 'log_count/DEBUG': 3, 'log_count/ERROR': 3, 'log_count/INFO': 7, 'response_received_count': 1, 'scheduler/dequeued': 1, 'scheduler/dequeued/memory': 1, 'scheduler/enqueued': 1, 'scheduler/enqueued/memory': 1, 'spider_exceptions/KeyError': 1, 'start_time': datetime.datetime(2018, 2, 4, 14, 40, 25, 241964)} 2018-02-04 14:40:25 [scrapy] INFO: Spider closed (finished)* Could you please guide me how make it true? Thank you — You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub <#47>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AAwLb-lCE0GjuVI_sdF8tdBs8GSWmHZ0ks5tRcgYgaJpZM4R4mVd> .

Dimitrios Kouzis-Loukas · Answer 2 · Wed Feb 07 2018 11:17:07 GMT+0800 (China Standard Time)

So this was while playing with your own copy that has different settings.py than the ones in the chapter. This was the boto problem with that version of scrapy. Nothing important - just a warning essentially. The rest of crawling should be fine. One way to mitigate it is to add the following two lines in settings.py:

# Disable S3
AWS_ACCESS_KEY_ID = ""
AWS_SECRET_ACCESS_KEY = ""

MasRa · Answer 3 · Wed Feb 07 2018 11:22:39 GMT+0800 (China Standard Time)

Thank you so much.