bgizaa / using-pandas-library-for-visualization

Usin pandas for data visualization

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

The dataset is one of the historical sales of supermarket company which has recorded in 3 different branches for 3 months data

Before starting, download Anaconda Navigator & Jupyter Notebook Editor & the Supermarket dataset here

import numpy as np 
import pandas as pd 
import matplotlib.pyplot as plt
from pandas import DataFrame
import seaborn as sns
df = pd.read_csv('supermarket.csv') # read the file to visualize.
df.head(5)
Invoice ID Branch City Customer type Gender Product line Unit price Quantity Tax 5% Total Date Time Payment cogs gross margin percentage gross income Rating
0 750-67-8428 A Yangon Member Female Health and beauty 74.69 7 26.1415 548.9715 1/5/2019 13:08 Ewallet 522.83 4.761905 26.1415 9.1
1 226-31-3081 C Naypyitaw Normal Female Electronic accessories 15.28 5 3.8200 80.2200 3/8/2019 10:29 Cash 76.40 4.761905 3.8200 9.6
2 631-41-3108 A Yangon Normal Male Home and lifestyle 46.33 7 16.2155 340.5255 3/3/2019 13:23 Credit card 324.31 4.761905 16.2155 7.4
3 123-19-1176 A Yangon Member Male Health and beauty 58.22 8 23.2880 489.0480 1/27/2019 20:33 Ewallet 465.76 4.761905 23.2880 8.4
4 373-73-7910 A Yangon Normal Male Sports and travel 86.31 7 30.2085 634.3785 2/8/2019 10:37 Ewallet 604.17 4.761905 30.2085 5.3
df.Branch.unique()
array(['A', 'C', 'B'], dtype=object)

To specificially observe the data of Females in Mandalay who have shopped a Quantity of more than 1

new_df = df.loc[(df['City'] == 'Mandalay') & (df['Gender'] == 'Female') & (df['Quantity'] > 1)]
new_df.head(5)
Invoice ID Branch City Customer type Gender Product line Unit price Quantity Tax 5% Total Date Time Payment cogs gross margin percentage gross income Rating
9 692-92-5582 B Mandalay Member Female Food and beverages 54.84 3 8.226 172.746 2/20/2019 13:27 Credit card 164.52 4.761905 8.226 5.9
10 351-62-0822 B Mandalay Member Female Fashion accessories 14.48 4 2.896 60.816 2/6/2019 18:07 Ewallet 57.92 4.761905 2.896 4.5
15 299-46-1805 B Mandalay Member Female Sports and travel 93.72 6 28.116 590.436 1/15/2019 16:19 Cash 562.32 4.761905 28.116 4.5
19 319-50-3348 B Mandalay Normal Female Home and lifestyle 40.30 2 4.030 84.630 3/11/2019 15:30 Ewallet 80.60 4.761905 4.030 4.4
28 145-94-9061 B Mandalay Normal Female Food and beverages 88.36 5 22.090 463.890 1/25/2019 19:48 Cash 441.80 4.761905 22.090 9.6

To concentrate only specific columns and plot them

In this case we will observe Branch frequency of specific branches

import seaborn as sns
sns.countplot(x="Branch", data = df).set_title("Branch Frequency") 
Text(0.5, 1.0, 'Branch Frequency')

In this section we will create a new column for frequency of each product

df['Frequency of Product'] = df['Product line'].map(df['Product line'].value_counts())
df.head(4)
</style>
Invoice ID Branch City Customer type Gender Product line Unit price Quantity Tax 5% Total Date Time Payment cogs gross margin percentage gross income Rating Frequency of Product
0 750-67-8428 A Yangon Member Female Health and beauty 74.69 7 26.1415 548.9715 1/5/2019 13:08 Ewallet 522.83 4.761905 26.1415 9.1 152
1 226-31-3081 C Naypyitaw Normal Female Electronic accessories 15.28 5 3.8200 80.2200 3/8/2019 10:29 Cash 76.40 4.761905 3.8200 9.6 170
2 631-41-3108 A Yangon Normal Male Home and lifestyle 46.33 7 16.2155 340.5255 3/3/2019 13:23 Credit card 324.31 4.761905 16.2155 7.4 160
3 123-19-1176 A Yangon Member Male Health and beauty 58.22 8 23.2880 489.0480 1/27/2019 20:33 Ewallet 465.76 4.761905 23.2880 8.4 152

In this section we will create a dataframe derived from the major dataframe 'df' to focus on frequency of product and the product

forbranchfreq = df[['Product line','Frequency of Product']]
forbranchfreq.head(5)
Product line Frequency of Product
0 Health and beauty 152
1 Electronic accessories 170
2 Home and lifestyle 160
3 Health and beauty 152
4 Sports and travel 166

In this section we will group the frequencies in order to derive unique values of the product that we can plot sefully later and also sort it from first to last

forbranchfreq100 = forbranchfreq.groupby(['Product line']).sum()
forbranchfreq120 = forbranchfreq100.sort_values(by=['Frequency of Product'],  ascending=False)
forbranchfreq120 = DataFrame.drop_duplicates(forbranchfreq120)
forbranchfreq120.head(3)
Frequency of Product
Product line
Fashion accessories 31684
Food and beverages 30276
Electronic accessories 28900

In this section we wil plot the new grouped and sorted dataframe of product and the frequency. We derive from this that Fashions and accessories come first and Health and beauty products come last

this can inform supermarket management to put rackets of fashions and accesories at the door of the supermarket because they are a best-seller

forbranchfreq120.plot(kind="barh", color = 'blue',figsize=(15,10))
plt.xticks(rotation=45);
plt.savefig("figure1.png")

png

Plot to count frequency of Payment Channel using seaborn plotting library

sns.countplot(x="Payment", data = df).set_title("Payment Channel Frequency") 
Text(0.5, 1.0, 'Payment Channel Frequency')

png

In this section we will determine the most active times in which customers shop at the supermarket.

We will derive only the hour from the time column by stripping the column

df['STime'] = df['Time'].str[:2]
df.head(5)
Invoice ID Branch City Customer type Gender Product line Unit price Quantity Tax 5% Total Date Time Payment cogs gross margin percentage gross income Rating Frequency of Product STime
0 750-67-8428 A Yangon Member Female Health and beauty 74.69 7 26.1415 548.9715 1/5/2019 13:08 Ewallet 522.83 4.761905 26.1415 9.1 152 13
1 226-31-3081 C Naypyitaw Normal Female Electronic accessories 15.28 5 3.8200 80.2200 3/8/2019 10:29 Cash 76.40 4.761905 3.8200 9.6 170 10
2 631-41-3108 A Yangon Normal Male Home and lifestyle 46.33 7 16.2155 340.5255 3/3/2019 13:23 Credit card 324.31 4.761905 16.2155 7.4 160 13
3 123-19-1176 A Yangon Member Male Health and beauty 58.22 8 23.2880 489.0480 1/27/2019 20:33 Ewallet 465.76 4.761905 23.2880 8.4 152 20
4 373-73-7910 A Yangon Normal Male Sports and travel 86.31 7 30.2085 634.3785 2/8/2019 10:37 Ewallet 604.17 4.761905 30.2085 5.3 166 10
fortime = df[['Total', 'STime']]
fortime = fortime.groupby(['STime']).sum()
fortime.head(5)
Total
STime
10 31421.4810
11 30377.3295
12 26065.8825
13 34723.2270
14 30828.3990

From this bar chart we can observe that people shop at 7pm the most.

fortime.plot(kind="bar", color = 'red',figsize=(15,10))
plt.xticks(rotation=45);
plt.xlabel("Time")
plt.ylabel("Amount")
plt.savefig("figure2.png")

png

Pivot table for date, customer type, amount, Total, mean

forcustomertype = df[['Customer type','STime','Total']]
forcustomertype = DataFrame.drop_duplicates(forcustomertype)
forcustomertype
Customer type STime Total
0 Member 13 548.9715
1 Normal 10 80.2200
2 Normal 13 340.5255
3 Member 20 489.0480
4 Normal 10 634.3785
... ... ... ...
995 Normal 13 42.3675
996 Normal 17 1022.4900
997 Member 13 33.4320
998 Normal 15 69.1110
999 Member 13 649.2990

999 rows × 3 columns

pivottable = pd.pivot_table(forcustomertype,index=["Customer type"],columns = ["STime"],values=["Total"], aggfunc=np.sum, margins=True, margins_name='Amount', fill_value=0) 
pivottable = pivottable.style.format("{:,.0f}") 
pivottable
Total
STime 10 11 12 13 14 15 16 17 18 19 20 Amount
Customer type
Member 12,267 15,228 13,730 16,007 19,048 18,750 10,601 12,775 11,659 21,058 12,913 164,034
Normal 19,154 15,150 12,336 18,716 11,781 12,240 14,625 11,670 14,371 18,642 10,057 158,743
Amount 31,421 30,377 26,066 34,723 30,828 30,990 25,226 24,445 26,030 39,700 22,970 322,778

Exporting the pivot table to excel

pivottable.to_excel("pivottableforcustomertype.xlsx") 

About

Usin pandas for data visualization