Winxent / Store-statistic

Data Analysis on store sales and profits dataset using Python

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Store-statistic

Goal: Data Analysis on store sales and profits dataset using Python and data visualisation using Tableau.

rainbow

Introduction

Analyzing store profit and sales is crucial for a business's success because it provides actionable insights into its financial health and performance. This analysis helps in identifying trends, assessing the impact of various factors on profitability, and making informed decisions to optimize sales strategies and operations. It enables businesses to allocate resources effectively, respond to market changes, and ultimately maximize their profitability and sustainability.

Below is the raw dataset file:

https://docs.google.com/spreadsheets/d/1Scs5u9jgiYKOVWj_CVlRHaSSn-RuTEpQ/edit?usp=sharing&ouid=107402225492318840480&rtpof=true&sd=true

rainbow

Data Cleaning (Python)

Check the chosen dataset if it needs any data cleaning. Go through all the fields, check for any null values or incorrect data types

Import Pandas library and load the dataset into Google Collaboratory

https://colab.research.google.com/drive/1SLPRove3m7KUGXSEDrB1DWQCFbnysIyd?usp=sharing

import pandas as pd
df=pd.read_csv("/content/drive/MyDrive/Colab Notebooks/Expert+-+Superstore+-+Master.xlsx - Master.csv")
df.head()
image

A – Remove duplicate rows

First we have to find the duplicate:

df.duplicated().sum()

Using these 2 functions we can find the total number of duplicates. 2

View the duplicated rows:

df[df.duplicated(keep=False)]
image

B – Handle missing values

df.isna().sum()
image

No missing value found.

C – Correct data formats

Check data type

df.dtypes
image

Due to too many columns, .info function could not show all the columns. The below function is made to display all the columns:

pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', None)
print(df.head())
image image image

Datatypes error:

  1. Profit per Customer, should be in float type
  2. Profit per Order, should be in float type
  3. Sales Forecast, should be in float type
  4. Sales per Customer, should be in float type.
  5. Order Date, should be in date format
  6. Profit, should be in float type
  7. Sales, should be in int type
  8. Ship Date, should be in date type

After investigation, it seems that the data type can’t convert to float is because there is a comma

image

1. String to float datatype

We need to remove the comma first

df['Profit per Customer']=df['Profit per Customer'].str.replace(',','')

Then we change then to float type:

df[‘Profit per Customer’] = df[‘Profit per Customer’].astype(float)

Repeat all for float type error.

2. String to date datatype

Change to date function

df['Order Date'] = pd.to_datetime(df['Order Date'])

Repeat for Ship Date

3. String to int datatype

Change string to int for sales column

df['Sales']=df['Sales'].str.replace(',','')
df['Sales'] = df['Sales'].astype(int)
image

Now all the datatypes are correct

D – Drop irrelevant columns

Number of record has no meaning as it is always one, checked by unique and nunique function:

df["Number of Records"].nunique()

ans: 1

It has no comparison value, hence it is dropped

df.drop('Number of Records', axis=1,inplace = True)

Since there is always only 1 customer, sales per customer, profit per order and sales per profit can be drop as it is repeat in profit and sales column.

df.drop(['Sales per Customer','Profit per Customer','Profit per Order'], axis=1,inplace = True)

Sales forecast and unit estimated can also be drop as they show no significant analysis

df.drop(['Sales Forecast','Unit Estimated'], axis=1,inplace = True)

E – Fix inconsistent data entry

Check inconsistent data entry using unique function

df["City"].unique()
list(df['City'].unique())

After checking the columns one by one, there is no inconsistent data entry

F – Trim whitespaces

No columns have any extra whitespaces errors

G – Correct spelling errors

There are no spelling errors

H – Correct numerical errors

There are no numerical errors

rainbow

Data Analysis (Python)

There are 10000 rows, 22 columns. There are Column types of both categorical and numerical and they provide us the information about the Store details. Day to ship actual vs schedule, shipping status and shipping mode for the product. Segment, category and sub category of the product. Product name, customer name, manufacturer for identification. City, Country, Region and Sate for the location information. Order Id and order date to keep track of the order. Profit, Profit Ratio, Sales and Quantity for analysis.

Key performance indicators: Sales, profit, profit ratio, of the products can be used to analyse the performance of the store. The analysis can even be segregated based on product category, location, shipping mode and so on. We can even investigate base on manufacturer as well.

New information, indicators can be drawn through this dataset is Cost which can be generated from the profit ratio and sales.

df['Cost']=df['Sales']*(1-(df['Profit Ratio'].str.rstrip('%').astype('float') / 100.0))

Data Exploration

Describing the datasets

df.describe()
image

Shape and Size of your dataset

In order to have a better data description we usually check the Shape and Size of our dataset along with the general description of datasets such as count, unique values etc.

df.shape

10000 rows and 24 columns after adding 2 new indicators

pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', None)
print(df.head(16))
image image image

By investigating the shape and size of our data set:

  1. Aware of the size of your datasets
  2. Aware that your columns have numerical or categorical
  3. Major information/ description of the datasets, provided some insight of the data stored in the datasets. Example: a. Sub-Category b. Manufacturer c. Ship Mode d. Location e. Segment f. Ship Status

Data Aggregation:

It helps us understand the data trends and values based on the compact display of values. Helps describe the data, and generate insight from the characteristic of the data. A store business owner might want to look into the sales and decide which products have better performance so that he or she can focus more than these products. A store owner can also look into which products give negative profit.

1. Check the types of columns we have

df.dtypes
image

2. For Categorical Columns we check the count of unique entries and their values

df['Ship Status'].unique()

array(['Shipped Early', 'Shipped Late', 'Shipped On Time'], dtype=object)

df['Category'].unique()

array(['Office Supplies', 'Technology', 'Furniture'], dtype=object)

df['Country'].unique()

array(['United Kingdom', 'France', 'Germany', 'Italy', 'Spain', 'Netherlands', 'Sweden', 'Belgium', 'Austria', 'Ireland', 'Portugal', 'Finland', 'Denmark', 'Norway', 'Switzerland'], dtype=object)

df['Discount'].unique()

array(['0%', '10%', '15%', '40%', '50%', '60%', '35%', '20%', '30%', '45%', '70%', '65%', '80%', '85%'], dtype=object)

df['Region'].unique()

array(['North', 'Central', 'South'], dtype=object)

df['Segment'].unique()

array(['Corporate', 'Consumer', 'Home Office'], dtype=object)

df['Ship Mode'].unique()

array(['Standard Class', 'Second Class', 'Same Day', 'First Class'], dtype=object)

df['Sub-Category'].unique()

array(['Storage', 'Accessories', 'Labels', 'Phones', 'Copiers', 'Appliances', 'Fasteners', 'Art', 'Envelopes', 'Binders', 'Bookcases', 'Machines', 'Paper', 'Supplies', 'Tables', 'Chairs', 'Furnishings'], dtype=object)

Some categorical columns have too many unique values to be displayed.

For the Numerical Columns lets find the minimum and maximum values

df.describe()
image image

Summary Statistic:

Summarized the large datasets into insightful numbers and gist of information about the data. Business owner can understand the general situation, make decisions and monitor the changes. Summaries of data help us understand the detailed trends followed in datasets based on concise information using measures of location and spread

There are 3 types of summary statistics:

1. Measures of location:

Mean (Average of a data set), Median (middle value of the data set), Mode (most repeated number),

image
df.mode()
image image

To count the number of mode:

df[df["Product Name"] == 'Eldon File Cart, Single Width'].count()

count: 30

df[df["Profit"] == 0].count()

Count: 293

df.var()
image

Group by

Creating table with grouped information

df.groupby('Ship Status').mean()
image

Shipped early gained more avg profit ratio

df.groupby('Category').mean()
image

Technology has the higher average but office supplies has the highest avg in terms of profit ratio.

df.groupby('Country').mean()
image

In terms of country, Switzerland has the highest avg profit and profit ratio

df.groupby('Discount').mean()
image

The higher the discount rate the lower the earning, in some cases, loss incurred.

df.groupby('Segment').mean()
image

Corporate has a slightly higher profit and sales, but home office profit ratio is higher.

2. measures of spread:

To understand the spread and distribution of data. and to find outliers.

Distribution in Quantiles

Here we are calculating the Quartiles by dividing the dataset into 4 groups

df.quantile([0.25,0.5, 0.75, 1], axis = 0)
image

IQR of Dataset

Making quartiles by profit

q1, q3 = df["Profit"].quantile([0.25,0.75])
iqr = q3- q1
iqr

47.25

lower_min = q1 - (1.5*iqr)
upper_max = q3 + (1.5*iqr)
print("Lower expected min of IQR = ", lower_min)
print("Upper expected max of IQR = ", upper_max)

Lower expected min of IQR = -69.875 Upper expected max of IQR = 119.125

df[(df["Profit"] < -69.875) | (df["Profit"] > 119.125 )].count()

There are 1718 outliers more than 119.124 .

Summary:

Base on profit mean median mode and quartile, there is an average of 37.29 profit per order with a median of 14, negative skewness. In mode, out of 10000, 293 orders have zero profit, business owner need to investigate the high occurrence of zero profit. 1718 orders are outliers with profit higher than 119.125, it can be concluded that there are a few orders that performed highly.

From the few group by the information table, a few insights can be made. Shipping early gained more profit for the store. Technology products has higher profit average. Business should focus more on technology. In addition, selling to Switzerland gain more profit than other countries. Lastly, high discount causes business to lose money. Business owner should reconsider in these discount rate.

3. Graphics and charts:

Interactive dash board.

A data story is created through Tableau using this dataset:

https://public.tableau.com/views/StorePerformance_16971065556690/Story1?:language=en-US&publish=yes&:display_count=n&:origin=viz_share_link

1. Overview:

First page shows the overview of the datasets. Size, shape and average or the data for customer to understand the data range.

image

2. Comparison:

Second page shows the profit base on different products and profit generated from differen states.

image

3. Performance:

Last page shows the profit performance base on different factors such as segment, product category and discount. A historgram is ploted to understand the distribution of profit.

image

Conclusion:

From the data analysis, we found out that the as the year progesses, more profit is generated, the business increasing for this store. Furthermore, some countries generated negative profits, business owner need to take note and action has to be taken to limit the losses. One of the cases for loses might be due to discount rate given is too high for some cases. Business owner need to reconsider on these discount rate. Through data sorting, we can see that hoover stove generates most profit and England has the highest profit generated area. Store owner needs to put in more focus on these high profit generated area and product. Other than that, sorting through segment, category and discount can also give an overview for analysis and let us know which selection generated the most profit, for example: Consumer segment produces the highest profit. Using price historgram we can see that most of the order generated 0 profit. Store owner needs to elimited and reduce these occurance and investigate the reason behind these issue.

Using interactive visual dashboard, we can sort products, segments and category to compare performance of the business based on sales, profit, area, and shipment types. By combining data analysis and data visualisation, store owner can make their decision better to get the most out of the business.

About

Data Analysis on store sales and profits dataset using Python