krmsmsk / ETL

Trying to understand ETL

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Let's Talk About ETL

In this article, I will tell you shortly about ETL.

ETL (Extract, Transform, Load) :

ETL stands for extract, transform, and load. First, the ETL tool extracts the data from different sources. Next, it changes the data according to specific business rules, formats, and conventions. For example, the ETL tool could convert all transaction values to US dollars, even if the sales were in other currencies. Finally, it loads the transformed data to the target system, such as a data warehouse.

ELT (Extract, Load, Transform) :

ELT stands for extract, load, and transform. It is similar to ETL, except that ELT switches the final two data processes on the sequence. All the data is loaded in an unstructured data system, like a data lake, and transformed only when required. ELT takes advantage of cloud computing’s processing power and scalability to provide real-time data integration capabilities.

Why is ETL important?

Today, we obtain data from many different sources. ETL processes are very important for the quality, integration, comprehensibility and interpretability of these data.

How does ETL work?

Extract, transform, and load (ETL) works by moving data from the source system to the destination system at periodic intervals. The ETL process works in three steps:

1 - Extract the relevant data from the source database
2 - Transform the data so that it is better suited for analytics
3 - Load the data into the target database

Basic data transformation

- Data cleansing
Data cleansing removes errors and maps source data to the target data format. For example, you can map empty data fields to the number 0, map the data value “Parent” to “P,” or map “Child” to “C.”

- Data deduplication
Deduplication in data cleansing identifies and removes duplicate records.

- Data format revision
Format revision converts data, such as character sets, measurement units, and date/time values, into a consistent format. For example, a food company might have different recipe databases with ingredients measured in kilograms and pounds. ETL will convert everything to pounds.

Advanced data transformation

- Derivation
Derivation applies business rules to your data to calculate new values from existing values. For example, you can convert revenue to profit by subtracting expenses or calculating the total cost of a purchase by multiplying the price of each item by the number of items ordered.

- Joining
In data preparation, joining links the same data from different data sources. For example, you can find the total purchase cost of one item by adding the purchase value from different vendors and storing only the final total in the target system.

- Splitting
You can divide a column or data attribute into multiple columns in the target system. For example, if the data source saves the customer name as “Jane John Doe,” you can split it into a first, middle, and last name.

- Summarization
Summarization improves data quality by reducing a large number of data values into a smaller dataset. For example, customer order invoice values can have many different small amounts. You can summarize the data by adding them up over a given period to build a customer lifetime value (CLV) metric.

- Encryption
You can protect sensitive data to comply with data laws or data privacy by adding encryption before the data streams to the target database.

٭ Let's make a small application with SSIS. Here I will just make a simple example to show how the system works. Even so, it will take some time to explain.

At first, the Control Flow Pane welcomes us. We have more to do with the Data Flow Pane. Let's get there quickly.

The Toolbox contains the tools necessary for our ETL operations.

Let's start by adding an excel source for our extraction and double-clicking on it.
And then we need to click New Excel connection manager.

When we clicked New... then this page will welcome us. We need to select a destination for our excel file in this page.

when we clicked OK then we need to choose an excel sheet to process

And then we can preview our data with click to Preview button.

If we have more source like text file. We can add them too. I accidentally selected Destination in this image. I then changed it to Source.

If we don't need to transform our any data, we can load them our database or any destination.
If we need to transform them, we should use a tool from transform pane which i show you.

I used Conditional Split for split our data with a specific filter.

When i open the conditional split, I saw Columns folder and some ready-made functions which created automatically.
Then i clicked Columns folder for giving a filter.
I selected City column and gave two filter on there.

And then I load filtered data to two different destination.
The city name is Bothell and loaded into a table I created in the database.
The city name is Dallas and loaded into an excel file that I created.

★ In this article, I tried to explain the concept of ETL with SSIS in a simple way. My goal is to help you understand these concepts more easily. I hope it was educational and instructive for you. ★

About

Trying to understand ETL