gupta-aayushkr / NYC-Taxi-Project

An end-to-end data engineering project using Azure Synapse Analytics to analyze and transform NYC taxi data.

Home Page:https://medium.com/@aayushkumargupta/nyc-taxi-project-using-azure-synapse-analytics-68f9a8b220c1

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

NYC Taxi Data Analysis Project

Overview

This data engineering project utilizes Azure Synapse Analytics to analyze and transform New York City taxi data released by nyc.gov. The project covers the entire data processing pipeline, from raw data ingestion to creating meaningful insights using Azure Synapse Analytics, Apache Spark, and Power BI.

Table of Contents

  1. NYC Taxi Data Overview
  2. Project Resources and Architecture
  3. Architecture Explanation & Project Working
  4. Synapse Pipeline Orchestration
  5. Power BI Reporting
  6. Budget Analysis For Project
  7. Conclusion and Future Enhancements

NYC Taxi Data Overview

The project analyzes NYC taxi data, categorizing taxis into types (Yellow Taxis, Green Taxis, For-Hire Vehicles) and considering boroughs as distinct administrative divisions. The seven main tables contributing to the project include Trip Data, Taxi Zone, Calendar, Trip Type, Payment Type, Rate Code, and Vendor.

Project Resources and Architecture

The project relies on Azure Synapse Analytics, utilizing Azure Data Lake Storage, Serverless SQL Pool, Apache Spark, Synapse Pipelines, and Power BI. The architecture ensures seamless integration and ease of use for handling big data projects.

Architecture Explanation & Project Working

Detailed explanations are provided for loading raw data into the Raw Container, transforming data from the Bronze Schema to Silver Schema, and further transforming it into the Gold Schema. The project utilizes External Tables, CETAS, and Stored Procedures for efficient data processing.

Synapse Pipeline Orchestration

The Synapse pipeline orchestrates various stages of the data processing pipeline, including creating Silver External Tables, handling Trip Data partitioning in the Silver Schema, and transforming data from the Silver Schema to the Gold Schema. Triggers are used for scheduling these pipelines.

Power BI Reporting

Power BI is employed for creating insightful reports on payment methods used by passengers and taxi demand in NYC. The reports offer valuable insights for decision-making.

Explore the Power BI reports for detailed insights:

Budget Analysis For Project

A budget analysis section outlines the incurred costs, primarily from Azure Synapse Analytics Workspace, SQL Serverless Pool, Pipelines, and Storage.

Conclusion and Future Enhancements

The project successfully demonstrates the capabilities of Azure Synapse Analytics. Future enhancements could include cost optimization, real-time data processing, machine learning integration, data governance, security measures, and improved Power BI dashboards.

Feel free to explore the GitHub repository and use the code as a reference or starting point for similar projects.

About

An end-to-end data engineering project using Azure Synapse Analytics to analyze and transform NYC taxi data.

https://medium.com/@aayushkumargupta/nyc-taxi-project-using-azure-synapse-analytics-68f9a8b220c1


Languages

Language:TSQL 100.0%