All the Places

Website

A project to extract GeoJSON from the web focusing on websites that have 'store locator' pages like restaurants, gas stations, retailers, etc. Each chain has its own bit of software to extract useful information from their site (a "spider"). Each spider can be individually configured to throttle request rate to act as a good citizen on the Internet. The default User-Agent for the spiders can be found here, so websites wishing to prevent our spiders from accessing the data on their website can block that User Agent, but please feel free to contact us with any requests or recommendations.

The project is built using scrapy, a Python-based web scraping framework. Each target website gets its own spider, which does the work of extracting interesting details about locations and outputting results in a useful format.

Adding a spider

To scrape a new website for locations, you'll want to create a new spider. You can copy from existing spiders or start from a blank, but the result is always a Python class that has a process() function that yields GeojsonPointItems. The Scrapy framework does the work of outputting the GeoJSON based on these objects that the spider generates.

Development setup

To get started, you'll want to install the dependencies for this project.

This project uses pipenv to handle dependencies and virtual environments. To get started, make sure you have pipenv installed.
With pipenv installed, make sure you have the all-the-places repository checked out
```
git clone git@github.com:alltheplaces/alltheplaces.git
```
Then you can install the dependencies for the project
```
cd alltheplaces
pipenv install
```
After dependencies are installed, make sure you can run the scrapy command without error
```
pipenv run scrapy
```
If pipenv run scrapy ran without complaining, then you have a functional scrapy setup and are ready to write a scraper.

Create a new spider

Create a new file in locations/spiders/ with this content:
```
# -*- coding: utf-8 -*-
import scrapy
from locations.items import GeojsonPointItem

class TemplateSpider(scrapy.Spider):
    name = "template"
    allowed_domains = ["www.sample.com"]
    start_urls = (
        'https://www.sample.com/locations/',
    )

    def parse(self, response):
        pass
```
This blank/template spider will start at the given start_urls, only touch the domains listed in allowed_domains, and all web requests will be returned to the parse() function with response content in the response argument. Once you have the response content, you can perform various operations on it. For example, the most useful is probably running XPath selections on the HTML of the page to extract data out of the page. Check out the "Scraper tips" section below for more information about how to use these tools to efficiently get data out of the page.
Once you have your spider written, you can give it a test run to make sure it's finding the expected results.
```
pipenv run scrapy crawl template
```
The scrapy crawl template command runs a spider named template. If you changed the name of your spider, you should use the name you chose. By default, scrapy crawl does not save the output anywhere, but it does log the results of the spider operation fairly verbosely.

To generate GeoJSON locally, you can enable a couple options during the crawl process to use the GeoJSON exporter and to specify the file to write it to:
```
pipenv run scrapy crawl template -O output.geojson
```

Finally, make sure your parse() function is yielding GeojsonPointItems that contain the location and property data that you extract from the page:

def parse(self, response):
   yield GeojsonPointItem(
       lat=latitude,
       lon=longitude,
       street_address="1234 Fifth Street",
       city="San Francisco",
       state="CA",
       country="US"
   )

Once you have a spider that logs out useful results, you can create a new branch and push it up to your fork to create a pull request. The build system will run your spider and output information about the results as a comment on your pull request.

Tips for writing a spider

Preferred discovery methods

There is usually a few ways to find locations:

An XML sitemap, often https://<domain>/sitemap.xml, the domain's robots.txt file can also be useful for finding sitemaps (https://<domain>/robots.txt). These can crawled with a SitemapSpider.
A "store directory" that is a hierarchical listing of all locations. These listings are sometimes hidden in the footer or on the site map page. Keep an eye out for these, because it's a lot easier if they enumerate all the locations for you rather than having to program a spider to do it for you. These can be crawled with CrawlSpider.
A "store finder" that lets the user search by location. Keep an eye on your browser's developer tools "network" tab to see what the request is so you can replicate it in your spider. You may be able to change the request to get the API to return all the stores. These can be made with a normal Spider and specific start_urls or start_requests().
But if the only option is search by latitude/longitude, these can be crawled with Searchable Points.

Structured Data

Some websites may already be publishing there data in a standard way. We can parse these with our StructuredDataSpider, use a SitemapSpider or CrawlSpider to obtain the pages and pass them to parse_sd it will parse any Microdata or Linked Data with a type defined in wanted_types, you can then clean up the item, or add extra attributes with inspect_item.

validator.schema.org can be really helpful when making spiders to see what structured data is available.

Searchable Points

For store locators that do allow searches by latitude/longitude, a grid of searchable latlon points is available for the US, CA, AU, and Europe here. Each point represents the centroid of a search where the radius distance is indicated in the file name. See the Dollar General scraper for an example of how you might utilize them for national searches.

For stores that do not have a national footprint (e.g. #1034), there are separate point files that include a state/territory attribute e.g. 'us_centroids_100mile_radius_state.csv'. This allows for points to be filtered down to specific states/territories when a national search is unnecessary.

Note: A search radius may overlap multiple states especially when it’s centered near a state boundary. This creates a one to many relationship between the search radius point and the states covered in that search zone. This means that for the state files, there will be records that share the same latlon associated to differing states. The same is true for the European and Canadian territory files.

You can send the spider to other pages

The simplest thing a spider can do is to load the start_urls, process the page, and yield the data as GeojsonPointItem objects from the parse() method. Usually that's not enough to get at useful data, though. The parse() method can also yield a Request object, which scrapy will use to add another URL to the request queue.

By default, the parse() method on the spider will be called with the response for the new request. In many cases it's easier to create a new function to parse the new page's content and pass that function in via the Request object's callback parameter like so:

yield scrapy.Request(
  response.urljoin(store_url.extract()),
  callback=self.parse_store
)

Since the next URL you want to request is usually pulled from an href in the page and relative to the page you're on, you can use the response.urljoin() method as a shortcut to build the URL for the next request.

Using the scrapy shell

Instead of running the scrapy crawl every time you want to try your spider, you can use the Scrapy shell to load a page and experiment with XPath queries. Once you're happy with the query that extracts interesting data you can use it in your spider. This is a whole lot easier than running the whole crawl command every time you make a change to your spider.

To enter the shell, use scrapy shell http://example.com (where you replace the URL with your own). It will dump you into a Python shell after having requested the page and parsing it. Once in the shell, you can do things with the response object as if you were in your spider. The shell also offers a shortcut function called fetch() that lets you pull up a different page.

License

The data generated by our spiders is provided on our website and released under Creative Commons’ CC-0 waiver.

The spider software that produces this data (this repository) is licensed under the MIT license.

Dilshan-H / alltheplaces