Data Engineering Pipeline in Azure

Reporting of COVID-19 cases, deaths, hospital and ICU occupancies till week 47 of 2023 using a data engineering pipeline in Azure, visualized in PowerBI.

You can find the tutorial for this project in this Article

Project architecture

The data flow begins by retrieving files from a GitHub link and storing them in a data lake storage account. Next, the data is processed in Azure Data Factory using data flow and data pipelines to transform the data. After processing, it is uploaded to the Azure SQL Database. The final step is using Power BI to visualize the data taken from the SQL database.

Data Flow for Copy Activity

The Copy Activity involves copying the 'cases_and_deaths' file from GitHub to the data lake storage. For each kind of source or sink, a linked service needs to be created. Every file also needs a dataset to specify the type and schema of any incoming or outgoing data. Two datasets and two linked services must be set up for this task. A similar process is followed for the 'hospital_admissions' file, but with an automated pipeline. The copy activity for the 'cases_and_deaths' file is shown in the picture below.

Data Pipelines for transformation steps

Two pipelines are created: one for the 'cases_and_deaths' file and one for 'hospital_admissions'. Each pipeline comprises various activity components like copy, filter, select, split, pivot, lookup, join, sort, and sink, each serving a specific purpose. To move those files to the SQL database we first built, a copy activity is added to both dataflows, which are connected to the pipeline.

Dashboard Visualization

Instructions on how to create resources (for Azure's free one-month trial):

All resources can be created from Azure portal's homepage.
A resource group is used to club your resources related to a particular project. You can select any region.
Create a data factory, give it a unique name, and select V2. Click next and then enable 'Configure Git later' and then click 'Review + create'.
A storage account stores our raw and processed files. Select your subscription and resource group. Give it a unique name (you can always append your name) and select 'Locally Redundant Storage' for redundancy as it is the most cost-effective option. Enable hierarchical namespace to tell Azure that it is a Data Lake. You can select the default for everything else.
Now, we create an SQL Database for storing our results. Name the database and click 'Create new' for Server. In the new window that pops up, provide a unique server name, select 'Use SQL authentication' under the Authentication method, and create credentials for server login.
You don't want an SQL elastic pool and choose 'Development' as your workload environment for cost-effectiveness. Click 'Configure database' to change to 'Basic' for your Compute + storage option. Also, choose 'Locally-redundant backup storage' for Backup storage redundancy.

About

Reporting of COVID-19 cases, deaths, hospital and ICU occupancies till week 47 of 2023 using a data engineering pipeline in Azure, visualized in PowerBI

https://www.linkedin.com/pulse/data-engineering-pipeline-azure-factory-covid-19-project-sahoo-er4he/

Languages

Language:Python 85.8%Language:TSQL 14.2%