AdityaMathur08 / PEI

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

High Level Design:

Data Ingestion: The first step is to ingest the data from the e-commerce platform into Databricks. This could be done using Databricks' built-in connectors for various data sources.

Data Processing: Once the data is ingested, it needs to be processed. This could involve cleaning the data, transforming it into a suitable format, and extracting useful features. This can be done using Databricks' support for Apache Spark, which allows for distributed data processing.

Data Storage: After processing, the data should be stored in a suitable format for analysis. This could be a data lake, a data warehouse, or a database, depending on the needs of the business. Databricks has built-in support for various storage options, including Azure Data Lake Storage, Amazon S3, and others.

Data Analysis: The processed data can then be analyzed to provide valuable insights. This could involve creating dashboards, running ad-hoc queries, or building machine learning models. Databricks provides a unified analytics platform that allows for all of these.

Data Visualization: Finally, the results of the analysis should be visualized in a way that is easy for business stakeholders to understand. This could be done using Databricks' built-in visualization tools, or by integrating with other tools like Power BI or Tableau.

alt text

Running a DataBricks cluster on AWS - setup

Tasks:

01. Create raw tables for each source dataset:

Strategy adopted for extraction is to use a python script to upload the raw data on an S3 Object Store. Link to the code alt text

Followed by creating External Tables in Spark catalog alt text Link to the code

Customers:

alt text alt text

Products:

alt text alt text

Orders:

alt text alt text

02. Create an enriched table for customers and products

Link to the code

alt text alt text alt text

03. Create an enriched table which has

Creating curated table using the enriched data in step 2. Using Pyspark DataFrame APIs

- order information 
    - Profit rounded to 2 decimal places
- Customer name and country
- Product category and sub category

alt text

04. Create an aggregate table that shows profit by

For this task, I am using the curated table created in Step 3. and using pyspark DataFrame API created aggregate tables.

Link to the code

- Year

alt text

- Product Category

alt text

- Product Sub Category

alt text

- Customer

alt text

05. Using SQL output the following aggregates

Using table created in previous task and SQL endpoint editor to run the queries

- Profit by Year

alt text

- Profit by Year + Product Category

alt text

- Profit by Customer

alt text

- Profit by Customer + Year

alt text

Adding Test Suite:

for the modules used in extraction and transformation pipelines, tests to verify the reliability of functions is included in project's pipeline_test module.

Link to the code module

alt text

Conclusion:

This project showcases a comprehensive data pipeline implemented in Databricks, covering data ingestion, processing, storage, analysis, and visualization. By leveraging Databricks' powerful capabilities and integration with various data sources and tools, it enables efficient and scalable data workflows for actionable insights and decision-making. The adherence to Test-Driven Development (TDD) principles ensures the reliability and correctness of the implemented functionality, providing confidence in the project's outcomes.

About


Languages

Language:Jupyter Notebook 52.1%Language:Python 47.9%