Gulilil / Seleksi-2023-Tugas-1

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Seleksi Warga Basdat 2023

Data Scraping, Data Modelling, and Data Storing

Table of Content

Repository Description

This repository contains the program to scrape, preprocess, model, and store data. The scrapped data is the Top Electronic Deals on Ebay.com. This website is chosen with consideration that Ebay is well-known to people and the website is allowed to be scrapped. In these times of globalization, electronic products are considered to be top-searched in e-commerce website. However, I believe that scraping all the displayed products is considered too much. Therefore, the scrapped data will be limited to every first pagination in every type of electronic products.

The data will be scrapped and then preprocessed with the steps of Data Cleaning and Data Transformation. All the preprocessed steps is done using pandas and numpy in Python programming language. After the data is preprocessed, the data will be stored in a RDBMS using MySQL. This exporting steps are done with the mysql.connector library in Python programming language. First thing first, I initially chose PostgreSQl as the RDBMS. But unfortunately, the PostgreSQL installation got somekind of error and did not seem to be working. Therefore, I changed the chosen RDMS to MySQL since I have already installed it for the long time. In addition, I believe that MySQL is sufficiently well-known to be used for data storing.

Program Specification

The program specification that I used in order to develop the programs are displayed as below:

Specification Additional Info
Python v3.11.3
MySQL or MariaDB
Tableau 2023.2

How To Use

First thing first, clone this repository:

git clone git@github.com:Gulilil/Seleksi-2023-Tugas-1.git

All the programs are stored inside .ipynb files in the Data Scraping/src folder. Therefore, all these files needed to be run sequentially, starts from scraping.ipynb, then preprocess.ipynb, and last export.ipynb.

The scraping.ipynb contains the algorithm starts from scraping the website to storing the data as .json file.

The preprocess.ipynb contains the algorithm to prepare the data from previous .json file to be processed further.

The export.ipynb contains the algorithm starts from connecting the code to local RDBMS, creating database and tables, and inserting the needed data into the database.

After the database is inserted with data, the specific database will be dumped into an .sql file, with a command as below:

mysqldump -u root -p {database_name} > {sql_file_name}.sql 

JSON Structure

Scraped data is stored inside Data Scraping/data/data.json. The explanation of the structure is explained as below:

JSON Structure

// orient = 'index'
{
    <Dataframe Index> : {
        <Attribute 1> : <Value 1>,
        <Attribute 2> : <Value 2>,
        .
        .
        <Attribute n> : <Value n>
    },
}

// orient = 'records'
{
    {
        <Attribute 1> : <Value 1>,
        <Attribute 2> : <Value 2>,
        .
        .
        <Attribute n> : <Value n>
    },
}

data.json

{
    "0":{
    "listing_id":"354907151520",
    "title":"Microsoft 13\" Surface Pro 8 (i5, 8GB RAM, 256GB SSD, Graphite) with Surface Pen",
    "type":"Laptops & Netbooks",
    "available":"Limited quantity available",
    "sold":"1 ",
    "price_original":"US $669.99",
    "price_idr":"IDR9,999,850.75",
    "shipping_price_original":"US $54.46 ",
    "shipping_location":"Avenel, New Jersey, United States",
    "return":"30 days returns",
    "seller":"Always Deals Store",
    "condition":"New"
    },
}

After the preprocessing steps, considering the readable aspect, the data will be stored in four seperate .json file inside the Data Scraping/data folder, which are data_payment.json, data_price.json, data_product.json, and data_shipment.json. The explanation of each attributes is displayed as below:

data_payment.json

[
  {
    "listing_id":"333563280818",
    "method":"PayPal"
  },
]

data_price.json

[
  {
    "price_original_currency":"EUR ",
    "price_original_value":239.99,
    "price_idr":4026996.3799999999
  },
]

data_product.json

[
  {
    "listing_id":333563280818,
    "title":"CHUWI Laptop HeroBook GemiBook CoreBook Pro\/Plus\/X Intel Core Laptop Windows",
    "type":"Laptops & Netbooks",
    "available":"2 ",
    "sold":32,
    "return":"30 days ",
    "seller":"CHUWI Official Store\uff08chuwi2018\uff09",
    "condition":"New"
  },
]

data_shipment.json

[
  {
    "shipping_availability":true,
    "shipping_price_currency":null,
    "shipping_price_value":0.0,
    "shipping_location":null
  },
]

Database Structure

Entity Relationship Diagram

Relational Diagram

ERD to Relational Diagram

Here are the explanation steps to translate the ERD into a Relational Diagram.

  1. Composite attribute inside an entity is needed to be seperated into a different table. The new tables made from this attributes have two primary keys, which are the primary key from the previous entity and the composite attribute itself. The primary key from the previous entity will be set as a reference to the primary key from the origin entity.
  2. Create a table for each weak entity with its primary key is the attribute from the strong attribute it depended on. This primary key will be set as a reference to the primary key of the strong entity.
  3. For the strong entity, make a table with the same attributes with the origin entity, excluding the composite attribute which has been seperated.

Screenshot

Data Scraping

Function to scrap item and to get href of a product page in a webpage

Function to scrap product data from a product page

List of product type pages to be scraped

Data Preprocess

Splitting price attributes to currency and value

Drop unused attributes

Replace unnecessary value for string data

Drop products that might be categorized as multiple types

Split payment value into a new dataframe

Data Exporting

Establish connection to local RDBMS and create a new database

Query to create tables

Insert data to product table

Insert data to price table

Insert data to shipment table

Insert data to payment table

Data Storing

Tables in the constructed database

payment and price tables' description

product and shipment tables' description

Selecting data in product table

Selecting data in price, payment, and shipment tables

References

Here are the libraries used the program development:

Libraries
BeautifulSoup (bs4) urllib.request
pandas numpy
mysql.conector

Visualization

The Data Visualization aspect is developed using Tableau software. The result dashboards and the link to the dashboards are displayed as png as below:

Author

Name Student ID
Juan Christopher Santoso 13521116

About


Languages

Language:Jupyter Notebook 100.0%