reidfalconer / 14D004_scraping_project

14D004 Scraping Project: Scrape all the available courses on datacamp.com and scape all job posts on jobsinbarcelona.es using scrapy

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

14D004 Scraping Project

Project Description

The data and code in this repository allows users to scrape all the available courses on datacamp.com and scape all job posts on jobsinbarcelona.es using scrapy an open source and collaborative framework for extracting the data you need from websites.

  • The code was written Python 3.6 and Scrapy 1.5.1

Datacamp:

On the datacamp course page itself, you can search for courses of interest or browse all the courses by technology.

browse_by_tech

The datacamp.py script extracts all of the course titles within these six technologies, along with their course description, author, authors occupation and URL.

browse_by_tech

Jobs in barcelona:

Jobs in Barcelona is a platform of tech orientated jobs in Barcelona.

browse_by_tech

The jobsinbarcelona.py script scrapes all of the job listings along with the company, location, published date, job source and URL.

browse_by_tech

Datacamp Instructors:

On the datacamp instructors page, you can find the details of all of the various course instructors.

browse_by_tech

The datacamp_instruct.py script extracts all of the instructor's titles along with their subscriber count, occupation and URL. Furthermore, the script extracts their personal descriptions from their "Full Bios" (see example below).

browse_by_tech

Folders

  • datacamp: Scrapy datacamp project stored here
  • jobsinbarcelona: Scrapy jobsinbarcelona project stored here
  • datacamp_instructors: Scrapy datacamp instructors project stored here

Each of which is a directory with the following contents (datacamp used for example):

datacamp/
    scrapy.cfg            # deploy configuration file
    datacamp.csv          # scaped data exported as .csv
    datacamp.json          # scaped data exported as .json

    datacamp/             # project's Python module, you'll import your code from here
        __init__.py

        items.py          # project items definition file (not used)

        middlewares.py    # project middlewares file (not used)

        pipelines.py      # project pipelines file (not used)

        settings.py       # project settings file (not used)

        spiders/          # a directory with the spiders
            __init__.py
            datacamp.py   # This is the code for our datacampe Spider

Prerequisites

Installing Scrapy

Install the latest version of Scrapy (I recommend using Anaconda)

  • Anaconda distribution
conda install scrapy
  • PyPI
pip install scrapy

How to run the Spiders

To put the spiders to work, go to the relevant project’s top-level directory (i.e. datacamp, jobsinbarcelona or datacamp_instructors) and run:

scrapy crawl datacamp

or

scrapy crawl jobsinbarcelona

or

scrapy crawl datacamp_instructors

Storing the scraped data

The simplest way to store the scraped data is by using Feed exports, with the following command:

scrapy crawl datacamp -o datacamp.csv

or

scrapy crawl jobsinbarcelona -o jobsinbarcelona.csv

or

scrapy crawl datacamp_instruct -o datacamp_instructors.csv

That will generate a datacamp.csv, jobsinbarcelona.csv and datacamp_instructors.csv file containing all the scraped items.

You can also use other formats, like JSON:

scrapy crawl datacamp -o datacamp.json

Note: for historical reasons, Scrapy appends to a given file instead of overwriting its contents. If you run this command twice without removing the file before the second time, you’ll end up with a broken file.

About

14D004 Scraping Project: Scrape all the available courses on datacamp.com and scape all job posts on jobsinbarcelona.es using scrapy


Languages

Language:Python 100.0%