Data Analysis using Python

Data Analysts and Data Scientists use Python for data analysis because the Python programming language is easy to understand, it is scalable and flexible. In addition, it has an extensive collection of libraries for numerical computation and data manipulation. On top of that, python provides libraries for data visualizations. This project I will show you the skills I learned on how to use Python for Data Analysis. I will start by learning about the different libraries while including hands on practice utilizing Google Colab. Next, I will display my data cleaning skills and after the data has been cleaned I will show you some data visualizations.

[Part 1] NumPy is one of the most important libraries to data analysis. It is a numeric computing library that does mathematical operations efficiently. Since computers can only process 1's and 0's and not decimal numbers processing code can take up lots of space. Hence, using NumPy allows you to control size in terms of bits. In NumPy there are a few objects [ OBJECT- a collection of data and or variables in methods]. You can store data in lists or in arrays. Arrays are a container which can hold a fix number of items and these items should be of the same type. In my practice I learned how to create arrays, index arrays, slice arrays, and determine the size, shape, and dimension of arrays. This was quite fun and I hope you enjoy reading my code

[Part 2 Pandas] Pandas is an open source Python package that is most widely used for data science/data analysis and machine learning tasks. It is built on top of another package named Numpy, which provides support for multi-dimensional arrays. The reason why pandas is important is that Pandas allows us to analyze big data and make conclusions based on statistical theories. Pandas can clean messy data sets, and make them readable and relevant. Hence, relevant data is very important in data science.

[Part 3 Pandas Dataframe] Dataframes in pandas are two-dimensional size-mutable, potentially heterogeneous tabular data structure with labeled axes (rows and columns). A Data frame is a two-dimensional data structure, i.e., data is aligned in a tabular fashion in rows and columns. Pandas DataFrame consists of three principal components, the data, rows, and columns. Pandas DataFrame can be created from the lists, dictionary, and from a list of dictionary etc but most of the time your dataframes will be already created either as a CSV file or stored on a SQL database. You can perform indexing, find data types, determine the size and shape, convert data types, and even return Boolean values using pandas dataframe.

[Part 4 Visualization using Matplotlib] Data visualization allows us to present data in the form of graphs and charts. Hence, matplotlib is a library in Python where we can create said visualizations. Like Pandas it is built upon NumPy arrays and consists of graphs, charts, and plots. It is a very flexible library but one disadvantage is that it requires the users to employ more code. Pyplot is a matplotlib module that utilizes a MATLAB similar interface. MATLAB is a multi-paradigm programming language and numeric computing environment developed by MathWorks. Thus using the code (import matplotlib.pyplot as plt) we can create line graphs, parabolas, histograms, bar graphs, and scatter plots.

Syelding / Python-Data-Analysis

Data Analysis using Python

About

Languages