wengsengh / Carlist.my-Web-Scraping

scrap carlist

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Carlist.my Web Scraping with scrapy

1.0 Project Background

Carlist.my is website for used cars listing for sale in Malaysia.

This project uses scrapy to extract the web car listing data from carlist.my

image

image

2.0 The Required Package and Extension

pip install scrapy

pip lxml

The Chrome Extension

xpath helper

3.0 Extract the Web Data

  1. Analyze the URL rules and format
  2. Develop a data extraction strategy
  3. Determine how data is stored

4.0 Data Cleansing

  1. Remove the column that is not relevant like 'type', 'position', 'item_type', 'item_additionalType', 'item_url', 'item_image', 'item_offers_type', 'item_offers_priceCurrency', 'item_offers_itemCondition', 'item_offers_seller_url', etc.
  2. Extract the car model year and engine capacity (cc) from the 'item_name' column by using regular expression (RegEx).

    image

5.0 Data Visualization with Tableau

The link: https://public.tableau.com/app/profile/weng.seng/viz/carlist2/Story1?publish=yes

5.1 Toyota

image

The top listing model: Vios

The top listing model year: 2014

The top listing body type: Sedan then followed by MPV

5.2 Peroduo

image

The top listing model: Myvi

The top listing model year: 2015

The top listing body type: Hatchback

About

scrap carlist


Languages

Language:Jupyter Notebook 97.7%Language:Python 2.3%