deva-246 / DataCleaning-Excel-PowerQueryEditor

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

DataCleaning-Excel-PowerQueryEditor

Microsoft Excel is one of the most used data handling/analysis software. At the same time, one tiny mistake in analyzing data can cause headaches. Simple errors like spacing, value error, format, duplicates, etc. usually miss our eye.

Imagine the chaos when you handle large chunks of information. Keeping your Data clean and organized can take you miles ahead in your work ethics and efficiency.

Data cleaning

Data cleaning, data cleansing, or data scrubbing is the act of first identifying any issues or bad data, then systematically correcting these issues. If the data is unfixable, you will need to remove the bad elements to properly clean your data.

Unclean data normally comes as a result of human error, scraping data, or combining data from multiple sources. Multichannel data is now the norm, so inconsistencies across different data sets are to be expected.

You have to clean this bad data before you start analyzing it, especially if you will be running it through machine learning models.

Why? Because it can offer misleading or incorrect insights. If you’re using these insights to navigate important business decisions then this could be potentially devastating.

It can also be costly.At the least working with bad data is a massive waste of time.

It’s useful here to think of the commonly used phrase: “garbage in, garbage out”.

This is a great way to illustrate that if you use data that is bad or unclean, you are likely to get results that are also bad. If you put good data in, you’re likely to get good results.

Data cleaning Using Excel

Microsoft Excel forms part of the Microsoft Office suite of software products. It is a spreadsheet that not only enables data to be stored in a tabular form i.e. rows and columns, but it also features calculation functionalities, graphing tools, pivot tables, and so much more. Before you can do any analysis with a dataset, it should be correct, consistent, and complete.

Here are 8 effective data cleaning techniques:

  1. Remove duplicates

  2. Remove irrelevant data

  3. Standardize capitalization

  4. Convert data type

  5. Clear formatting

  6. Fix errors

  7. Language translation

  8. Handle missing values

Power Query in Excel

Power Query is an excel tool used to import data from different sources, transform (change) it as required, and return a refined dataset in the workbook. Every change made to the data is recorded and saved as a step. In future, whenever the data source is updated, the same changes are performed automatically with the click of the “refresh” button.

Power Query in excel performs the extract, transform, and load operations (ETL) on a dataset. All transformations (steps or changes) applied to the data are collectively known as a query. By performing these transformations, the data is said to be shaped.

The major advantage of Power Query in excel is that it is a fast and efficient way of working on large datasets. Besides, it is reusable as the same query can be used again on a new dataset. Moreover, with just a few clicks, one can have access to cleansed and sorted data.

Power Query can be installed as an add-in in Excel 2010 and 2013. In Excel 2016 and the subsequent versions, Power Query is a built-in excel feature. It can be accessed from the “get data” drop-down (in the “get and transform data” group) of the Data tab of Excel.

To use Power Query in Excel, the following steps need to be performed:

Import data: Import data from the different sources. The data source can be a text file, Excel workbook, web, pdf, and so on. With Power Query, one can work with data from any source having any size and shape.

Transform data: Change, sort and shape data as per the requirements. For instance, one can delete or insert a row and/or column, replace a missing value, delete a duplicate entry, filter a column, and so on. These changes are recorded as a query in the sequence in which they are applied to the data.

Consolidate data: Consolidate or combine the data from the different sources. Once integrated, a consolidated database can be generated. The merging and appending of queries are carried out at this stage.

Load data: Load the data on a worksheet once it has been transformed and consolidated. Loading the data helps return an output in the workbook. The output can be in the form of a table, pivot chart or a pivot table Prior to loading, one can preview the data to ensure it is on the right track.