An ETL Pipeline built on Google Cloud Platform (GCP). First, an Airbnb dataset from kaggle is loaded into GCS Bucket.
Then, a Dimensional Model (Star Schema) is built and the Data is Transformed and loaded into BigQuery. Further, a Looker Dashboard is built for analysis.
This process is orchestrated by Mage, a modern ETL tool.
The Dataset used in this project is based on Listing Reviews on Airbnb. Download the Dataset used in this project here
First, a GCS Bucket is created, then the dataset is loaded in Bucket.
![Screenshot 2023-08-26 at 10 06 34 PM](https://private-user-images.githubusercontent.com/100070155/263476367-3b53d74c-000c-439d-b7e1-ec7a41ab1548.png?jwt=eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJpc3MiOiJnaXRodWIuY29tIiwiYXVkIjoicmF3LmdpdGh1YnVzZXJjb250ZW50LmNvbSIsImtleSI6ImtleTUiLCJleHAiOjE3MjM1MzM5ODksIm5iZiI6MTcyMzUzMzY4OSwicGF0aCI6Ii8xMDAwNzAxNTUvMjYzNDc2MzY3LTNiNTNkNzRjLTAwMGMtNDM5ZC1iN2UxLWVjN2E0MWFiMTU0OC5wbmc_WC1BbXotQWxnb3JpdGhtPUFXUzQtSE1BQy1TSEEyNTYmWC1BbXotQ3JlZGVudGlhbD1BS0lBVkNPRFlMU0E1M1BRSzRaQSUyRjIwMjQwODEzJTJGdXMtZWFzdC0xJTJGczMlMkZhd3M0X3JlcXVlc3QmWC1BbXotRGF0ZT0yMDI0MDgxM1QwNzIxMjlaJlgtQW16LUV4cGlyZXM9MzAwJlgtQW16LVNpZ25hdHVyZT1iZDBkNjI2OWMxNTQ4ZjhhZGY0NDA0ZDFiYmNmZjlmZjIwZWUxODk4MDcxMTRiNjVhNGExNWJhNjY1NGVjZmM2JlgtQW16LVNpZ25lZEhlYWRlcnM9aG9zdCZhY3Rvcl9pZD0wJmtleV9pZD0wJnJlcG9faWQ9MCJ9.I-Knq0Kco4WgKMyck_Z_oyzL9Q66ef_08dXn2QHDUCQ)
After setting up the Data source, a Star Schema Dimensional Model is built in BigQuery Console.
Here's the Model Overview:
Next, BigQuery Dataset named airbnb_reviews is created.
Further, the following script is executed in BigQuery, to create the required Dimension Tables and Fact Tables under airbnb_reviews:
CREATE OR REPLACE TABLE airbnb_reviews.DIM_DATE(
date_key INT NOT NULL,
last_review_day INT,
last_review_month INT,
last_review_year INT,
PRIMARY KEY (date_key) NOT ENFORCED
);
CREATE OR REPLACE TABLE airbnb_reviews.DIM_LOCATION(
location_key INT NOT NULL,
latitude NUMERIC,
longitude NUMERIC,
country STRING(30),
country_code STRING(2),
PRIMARY KEY (location_key) NOT ENFORCED
);
CREATE OR REPLACE TABLE airbnb_reviews.DIM_HOST(
host_key INT NOT NULL,
host_id INT,
host_name STRING(30),
isVerified BOOL,
licence STRING,
PRIMARY KEY (host_key) NOT ENFORCED
);
CREATE OR REPLACE TABLE airbnb_reviews.DIM_LISTING(
listing_key INT NOT NULL,
listing_id INT,
listing_name STRING,
listing_house_rules STRING,
neighbourhood_group STRING,
neighbourhood STRING,
cancellation_policy STRING,
room_type STRING,
isInstantBookable BOOL,
PRIMARY KEY (listing_key) NOT ENFORCED
);
CREATE OR REPLACE TABLE airbnb_reviews.FACT_PRICE(
listing_key INT,
location_key INT,
host_key INT,
date_key INT,
price INT,
service_fee INT,
FOREIGN KEY (listing_key) REFERENCES airbnb_reviews.DIM_LISTING (listing_key) NOT ENFORCED,
FOREIGN KEY (location_key) REFERENCES airbnb_reviews.DIM_LOCATION (location_key) NOT ENFORCED,
FOREIGN KEY (host_key) REFERENCES airbnb_reviews.DIM_HOST (host_key) NOT ENFORCED,
FOREIGN KEY (date_key) REFERENCES airbnb_reviews.DIM_DATE (date_key) NOT ENFORCED
);
CREATE OR REPLACE TABLE airbnb_reviews.FACT_REVIEWS(
listing_key INT,
location_key INT,
host_key INT,
date_key INT,
review_count INT,
review_per_month NUMERIC,
review_rate_number INT,
FOREIGN KEY (listing_key) REFERENCES airbnb_reviews.DIM_LISTING (listing_key) NOT ENFORCED,
FOREIGN KEY (location_key) REFERENCES airbnb_reviews.DIM_LOCATION (location_key) NOT ENFORCED,
FOREIGN KEY (host_key) REFERENCES airbnb_reviews.DIM_HOST (host_key) NOT ENFORCED,
FOREIGN KEY (date_key) REFERENCES airbnb_reviews.DIM_DATE (date_key) NOT ENFORCED
);
CREATE OR REPLACE TABLE airbnb_reviews.FACT_LISTING_INFO(
listing_key INT,
location_key INT,
host_key INT,
date_key INT,
minimum_nights INT,
host_listings_count INT,
days_available INT,
FOREIGN KEY (listing_key) REFERENCES airbnb_reviews.DIM_LISTING (listing_key) NOT ENFORCED,
FOREIGN KEY (location_key) REFERENCES airbnb_reviews.DIM_LOCATION (location_key) NOT ENFORCED,
FOREIGN KEY (host_key) REFERENCES airbnb_reviews.DIM_HOST (host_key) NOT ENFORCED,
FOREIGN KEY (date_key) REFERENCES airbnb_reviews.DIM_DATE (date_key) NOT ENFORCED
);
![Screenshot 2023-08-26 at 11 28 11 PM](https://private-user-images.githubusercontent.com/100070155/263490641-d47eb2ef-91a6-4fc2-ac19-1e6df7af7695.png?jwt=eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJpc3MiOiJnaXRodWIuY29tIiwiYXVkIjoicmF3LmdpdGh1YnVzZXJjb250ZW50LmNvbSIsImtleSI6ImtleTUiLCJleHAiOjE3MjM1MzM5ODksIm5iZiI6MTcyMzUzMzY4OSwicGF0aCI6Ii8xMDAwNzAxNTUvMjYzNDkwNjQxLWQ0N2ViMmVmLTkxYTYtNGZjMi1hYzE5LTFlNmRmN2FmNzY5NS5wbmc_WC1BbXotQWxnb3JpdGhtPUFXUzQtSE1BQy1TSEEyNTYmWC1BbXotQ3JlZGVudGlhbD1BS0lBVkNPRFlMU0E1M1BRSzRaQSUyRjIwMjQwODEzJTJGdXMtZWFzdC0xJTJGczMlMkZhd3M0X3JlcXVlc3QmWC1BbXotRGF0ZT0yMDI0MDgxM1QwNzIxMjlaJlgtQW16LUV4cGlyZXM9MzAwJlgtQW16LVNpZ25hdHVyZT1mOWU2YjA4MDQ0NWM5OWJiMDNjMmM4NTY2NDA4ODA3NTM0NGExZjJkNWJjMmNjZDk1ZGM3YTY2OGNmZDMzZjI2JlgtQW16LVNpZ25lZEhlYWRlcnM9aG9zdCZhY3Rvcl9pZD0wJmtleV9pZD0wJnJlcG9faWQ9MCJ9.2M7BTswlKgt9YHfof6tO7GUy09GYiTvFJg1a4yKSIIU)
Further, on VM Instance, Mage is set up and the following scripts are used for Data Loader, Transformer, and Data Exporter respectively:
Here's the Mage UI's overview:
![Screenshot 2023-08-26 at 11 15 08 PM](https://private-user-images.githubusercontent.com/100070155/263488229-5d5283a8-60b4-451a-b3cb-79bf06473e70.png?jwt=eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJpc3MiOiJnaXRodWIuY29tIiwiYXVkIjoicmF3LmdpdGh1YnVzZXJjb250ZW50LmNvbSIsImtleSI6ImtleTUiLCJleHAiOjE3MjM1MzM5ODksIm5iZiI6MTcyMzUzMzY4OSwicGF0aCI6Ii8xMDAwNzAxNTUvMjYzNDg4MjI5LTVkNTI4M2E4LTYwYjQtNDUxYS1iM2NiLTc5YmYwNjQ3M2U3MC5wbmc_WC1BbXotQWxnb3JpdGhtPUFXUzQtSE1BQy1TSEEyNTYmWC1BbXotQ3JlZGVudGlhbD1BS0lBVkNPRFlMU0E1M1BRSzRaQSUyRjIwMjQwODEzJTJGdXMtZWFzdC0xJTJGczMlMkZhd3M0X3JlcXVlc3QmWC1BbXotRGF0ZT0yMDI0MDgxM1QwNzIxMjlaJlgtQW16LUV4cGlyZXM9MzAwJlgtQW16LVNpZ25hdHVyZT1hN2YyNzc3YWE1NWZlMGUzMWQzYWRmOTY3NTkzMDJhYTg3MmJhMTViYmRiOGVhMmQ4ZDQyN2RjYjM1MzViMWNkJlgtQW16LVNpZ25lZEhlYWRlcnM9aG9zdCZhY3Rvcl9pZD0wJmtleV9pZD0wJnJlcG9faWQ9MCJ9.rokKTx7s5iBvmeQJ0Yau8YQuXI1a3cnrQj3nvf1XNJg)
Next, the ETL Process is triggered and Data is Exported into Bigquery Tables.
The following Query is Executed in BigQuery Console to create a table for further analysis:
CREATE OR REPLACE TABLE airbnb_reviews.TBL_REVIEWS_LOOKER AS(
SELECT
h.host_name,
h.isVerified,
l.listing_name,
l.neighbourhood,
l.room_type,
l.isInstantBookable,
n.latitude,
n.longitude,
n.country,
d.last_review_month,
d.construction_year,
p.price,
p.service_fee,
r.review_count,
r.review_rate_number,
i.minimum_nights,
i.host_listings_count,
i.days_available
FROM airbnb_reviews.DIM_HOST h
LEFT JOIN airbnb_reviews.FACT_PRICE p
ON h.host_key = p.host_key
LEFT JOIN airbnb_reviews.FACT_REVIEWS r
ON p.host_key = r.host_key
LEFT JOIN airbnb_reviews.FACT_LISTING_INFO i
ON r.host_key = i.host_key
LEFT JOIN airbnb_reviews.DIM_LISTING l
ON i.listing_key = l.listing_key
LEFT JOIN airbnb_reviews.DIM_LOCATION n
ON i.location_key = n.location_key
LEFT JOIN airbnb_reviews.DIM_DATE d
ON i.date_key = d.date_key
ORDER BY l.listing_id
);
Next, a Dashboard is created by using the BigQuery table created in the previous step:
![Screenshot 2023-08-26 at 11 32 01 PM](https://private-user-images.githubusercontent.com/100070155/263490784-fad637d8-dead-4bb5-841d-eda72781fae1.png?jwt=eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJpc3MiOiJnaXRodWIuY29tIiwiYXVkIjoicmF3LmdpdGh1YnVzZXJjb250ZW50LmNvbSIsImtleSI6ImtleTUiLCJleHAiOjE3MjM1MzM5ODksIm5iZiI6MTcyMzUzMzY4OSwicGF0aCI6Ii8xMDAwNzAxNTUvMjYzNDkwNzg0LWZhZDYzN2Q4LWRlYWQtNGJiNS04NDFkLWVkYTcyNzgxZmFlMS5wbmc_WC1BbXotQWxnb3JpdGhtPUFXUzQtSE1BQy1TSEEyNTYmWC1BbXotQ3JlZGVudGlhbD1BS0lBVkNPRFlMU0E1M1BRSzRaQSUyRjIwMjQwODEzJTJGdXMtZWFzdC0xJTJGczMlMkZhd3M0X3JlcXVlc3QmWC1BbXotRGF0ZT0yMDI0MDgxM1QwNzIxMjlaJlgtQW16LUV4cGlyZXM9MzAwJlgtQW16LVNpZ25hdHVyZT0xZmIxOThiM2Y1N2U4OTYxODEzMzVmYjdkM2M5MmIwZjBmZDNmYTMzYTU3YzQ3ZDhkZTE3NTI3NWRlMjU1Mzc2JlgtQW16LVNpZ25lZEhlYWRlcnM9aG9zdCZhY3Rvcl9pZD0wJmtleV9pZD0wJnJlcG9faWQ9MCJ9.KY-I_T_bCB4FwM6dtBVVxb5JWCsNbTyaXUmIgMUkW2Q)