scalingexcellence / scrapybook

Scrapy Book Code

Home Page:http://scrapybook.com/

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Issue on chater 3

OscarDgrouch opened this issue · comments

This is related to chapter 3, the book instructs me to run on Addess Item xpath => //[@itemtype="http://schema.org/Place"][1]/text().
However I'm getting this:
In [27]: response.xpath('//
[@itemtype="http://schema.org/Place"][1]/text()').extract()
Out[27]:
[u'\n ',
u'\n ',
u'\n ',
u'\n ',
u'\n ',
u'\n ',
u'\n ',
u'\n ',
u'\n ',
u'\n ',
u'\n ',
u'\n ',
u'\n ',
u'\n ',
u'\n ',
u'\n ',
u'\n ',
u'\n ',
u'\n ',
u'\n ',
u'\n ',
u'\n ',
u'\n ',
u'\n ',
u'\n ',
u'\n ',
u'\n ',
u'\n ',
u'\n ',
u'\n ']

When I run it with out the text () I get this:
[u'\n West Hampstead, London',
u'\n Angel, London',
u'\n Tower Bridge, London',
u'\n Canary Wharf, London',
u'\n Whitechapel, London',
u'\n Chelsea, London',
u'\n Hackney, London',
u'\n Stratford, London',
u'\n Canary Wharf, London',
u'\n Chiswick, London',
u'\n Highbury, London',
u'\n Notting Hill, London',
u'\n Brixton, London',
u'\n Greenwich, London',
u'\n Canary Wharf, London',
u'\n Battersea, London',
u'\n South Kensington, London',
u'\n Camden, London',
u'\n Wimbledon, London',
u'\n West Hampstead, London',
u'\n West Hampstead, London',
u'\n Elephant And Castle, London',
u'\n Angel, London',
u'\n Heathrow, London',
u'\n Bayswater, London',
u'\n Seven Sisters, London',
u'\n Angel, London',
u'\n Angel, London',
u'\n Battersea, London',
u'\n Bethnal Green, London']
I tried paying with it and I came up with this:
In [32]: response.xpath('//*[@itemtype="http://schema.org/Place"][1]/span/text()').extract()
Out[32]:
[u'West Hampstead, London',
u'Angel, London',
u'Tower Bridge, London',
u'Canary Wharf, London',
u'Whitechapel, London',
u'Chelsea, London',
u'Hackney, London',
u'Stratford, London',
u'Canary Wharf, London',
u'Chiswick, London',
u'Highbury, London',
u'Notting Hill, London',
u'Brixton, London',
u'Greenwich, London',
u'Canary Wharf, London',
u'Battersea, London',
u'South Kensington, London',
u'Camden, London',
u'Wimbledon, London',
u'West Hampstead, London',
u'West Hampstead, London',
u'Elephant And Castle, London',
u'Angel, London',
u'Heathrow, London',
u'Bayswater, London',
u'Seven Sisters, London',
u'Angel, London',
u'Angel, London',
u'Battersea, London',
u'Bethnal Green, London']

**My questions which xpath expresion is right????? And why I'm getting an array instead of single values???

Hello, I see what you mean. I can confirm that:

scrapy shell http://web:9312/properties/index_00000.html
>>> response.xpath('//*[@itemtype="http://schema.org/Place"][1]/text()').extract()
[u'\n  ', ... u'\n  ', u'\n  ']
>>> response.xpath('//*[@itemtype="http://schema.org/Place"][1]/span/text()').extract()
[u'West Hampstead, London', ... , u'Bethnal Green, London']

The only issue is that in the context of Chapter you want to be crawling individual pages e.g.

scrapy shell http://web:9312/properties/property_000000.html
>>> response.xpath('//*[@itemtype="http://schema.org/Place"][1]/text()').extract()
[u'West Hampstead, London']

In Chapter 5, page 99 you can find how to crawl the index pages directly with relative XPaths (see also here).

P.S. Sorry for the typo - they are mentioned as "Relevant XPath" in that page.