Hypothesis Testing


SQL data extraction for performing A/B Testing to inform buisiness decisions making for global retailer.

Main Files:

student.ipynb - Notebook inclues data, methodologies and modeling around hypothesis testing

Mod_3_preso.pdf - presentation summarizing findings for a non-technical audience

Additional Files Blog Post: Categorical Data

SQL Database Table Schematic Layout:

import pandas as pd
import functions as fn

## Uncomment the line below to see the source code for the imported functions
import sqlite3
connect = sqlite3.connect('Northwind_small.sqlite')
cur = connect.cursor()

List of Tables:

cur.execute("""SELECT name FROM sqlite_master WHERE type='table';""")
df_tables = pd.DataFrame(cur.fetchall(), columns=['Table'])
<style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; }
.dataframe tbody tr th {
    vertical-align: top;

.dataframe thead th {
    text-align: right;
0 Employee
1 Category
2 Customer
3 Shipper
4 Supplier
5 Order
6 Product
7 OrderDetail
8 CustomerCustomerDemo
9 CustomerDemographic
10 Region
11 Territory
12 EmployeeTerritory
cur.execute("""SELECT * FROM `Order`;""")
df = pd.DataFrame(cur.fetchall(), columns=[x[0] for x in cur.description])
<style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; }
.dataframe tbody tr th {
    vertical-align: top;

.dataframe thead th {
    text-align: right;
Id CustomerId EmployeeId OrderDate RequiredDate ShippedDate ShipVia Freight ShipName ShipAddress ShipCity ShipRegion ShipPostalCode ShipCountry
0 10248 VINET 5 2012-07-04 2012-08-01 2012-07-16 3 32.38 Vins et alcools Chevalier 59 rue de l'Abbaye Reims Western Europe 51100 France
1 10249 TOMSP 6 2012-07-05 2012-08-16 2012-07-10 1 11.61 Toms Spezialitäten Luisenstr. 48 Münster Western Europe 44087 Germany
2 10250 HANAR 4 2012-07-08 2012-08-05 2012-07-12 2 65.83 Hanari Carnes Rua do Paço, 67 Rio de Janeiro South America 05454-876 Brazil
3 10251 VICTE 3 2012-07-08 2012-08-05 2012-07-15 1 41.34 Victuailles en stock 2, rue du Commerce Lyon Western Europe 69004 France
4 10252 SUPRD 4 2012-07-09 2012-08-06 2012-07-11 2 51.30 Suprêmes délices Boulevard Tirou, 255 Charleroi Western Europe B-6000 Belgium
## looking at dates to get an idea of timeframe
count                     830
unique                    480
top       2014-02-26 00:00:00
freq                        6
first     2012-07-04 00:00:00
last      2014-05-06 00:00:00
Name: OrderDate, dtype: object

Timespan is ~ 2 years : April 7 2012 - June 5 2014


Does discount amount have a statistically significant effect on the quantity of a product in an order? If so, at what level(s) of discount?

  • $H_0$:There is no statistcally significant effect on the quantity of a product in an order in relation to a discount amount.
  • $H_1$:Discounts have a statistically significant effect on the quantiy of a product in an order.
  • $H_1a$:Certain discount values have a greater effect than others.

Importing and inspecting data from OrderDetail table:

This table from the Northwind database includes order information on:

1) Quantity
2) Discount 

Once imported, the data will be grouped by discount level and the means of the quantity sold for each will be compared against each other to evaluate if they are statistically significantly different. This will determine if the null hypothesis can be rejected with a probability of 5% error in reporting a false negative.

cur.execute("""SELECT * FROM OrderDetail;""")
df = pd.DataFrame(cur.fetchall(), columns=[x[0] for x in cur.description])
<style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; }
.dataframe tbody tr th {
    vertical-align: top;

.dataframe thead th {
    text-align: right;
Id OrderId ProductId UnitPrice Quantity Discount
0 10248/11 10248 11 14.0 12 0.0
1 10248/42 10248 42 9.8 10 0.0
2 10248/72 10248 72 34.8 5 0.0
3 10249/14 10249 14 18.6 9 0.0
4 10249/51 10249 51 42.4 40 0.0
#explorting dataset
specs = df.describe()
<style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; }
.dataframe tbody tr th {
    vertical-align: top;

.dataframe thead th {
    text-align: right;
OrderId ProductId UnitPrice Quantity Discount
count 2155.000000 2155.000000 2155.000000 2155.000000 2155.000000
mean 10659.375870 40.793039 26.218520 23.812993 0.056167
std 241.378032 22.159019 29.827418 19.022047 0.083450
min 10248.000000 1.000000 2.000000 1.000000 0.000000
25% 10451.000000 22.000000 12.000000 10.000000 0.000000
50% 10657.000000 41.000000 18.400000 20.000000 0.000000
75% 10862.500000 60.000000 32.000000 30.000000 0.100000
max 11077.000000 77.000000 263.500000 130.000000 0.250000

General Observation: Pricing ranges from $2 - $263 with an average price of $26.21 and average qty of 23 items ordered, no null values
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2155 entries, 0 to 2154
Data columns (total 6 columns):
Id           2155 non-null object
OrderId      2155 non-null int64
ProductId    2155 non-null int64
UnitPrice    2155 non-null float64
Quantity     2155 non-null int64
Discount     2155 non-null float64
dtypes: float64(2), int64(3), object(1)
memory usage: 101.1+ KB

Initial Visual Inspection:

From this dataset of 2155 line items that span 830 orders, the average quantity ordered is 24 regardless of discount, the minimum ordered is 0 and max ordered is 130, although the IQR is between 10 and 30.

qty = df['Quantity']
qty_specs = qty.describe()
count    2155.000000
mean       23.812993
std        19.022047
min         1.000000
25%        10.000000
50%        20.000000
75%        30.000000
max       130.000000
Name: Quantity, dtype: float64
qty_mu = round(qty_specs['mean'],0)
n = len(df)
print(f'The average quantity ordered from this sample is : {qty_mu}')
print(f'There are {n} orders in this sample.')
The average quantity ordered from this sample is : 24.0
There are 2155 orders in this sample.
d =list(df['Discount'].unique())
#Dscounts are as follows:
[0.0, 0.15, 0.05, 0.2, 0.25, 0.1, 0.02, 0.03, 0.04, 0.06, 0.01]
import matplotlib.pyplot as plt
import seaborn as sns
plt.axvline(qty_mu, label='Qty Mean', color='purple')
<Figure size 640x480 with 1 Axes>
import seaborn as sns
from ipywidgets import interact
def plt_discounts(d=d):
    plt.axvline(qty_mu, color='purple')
interactive(children=(Dropdown(description='d', options=(0.0, 0.15, 0.05, 0.2, 0.25, 0.1, 0.02, 0.03, 0.04, 0.…
dfa = df.groupby('Discount').count()['Quantity']
0.00    1317
0.01       1
0.02       2
0.03       3
0.04       1
0.05     185
0.06       1
0.10     173
0.15     157
0.20     161
0.25     154
Name: Quantity, dtype: int64

The sample sizes associated with discounts .01, .02, .03, .04 and .06 are relatively nominal <4, and will be dropped as without a normal or comperable dataset to evaulate their impact in comparison with the other groups.

discs = {}
for disc in df['Discount'].unique():
    discs[disc] = df.groupby('Discount').get_group(disc)['Quantity']
for k,v in discs.items():
oneper = discs.pop(.01)
twoper = discs.pop(.02)
threeper = discs.pop(.03)
fourper = discs.pop(.04)
sixper = discs.pop(.06)
#visualizing distributions
fig, ax = plt.subplots(figsize=(10,5))
for k,v in discs.items():

plt.title('Quantity Distribution')
print('Distributions appear roughly equal,')
Distributions appear roughly equal,


Initial Observations:

Datatype is numeric.

The average quantity ordered from this sample is : 24.0 There are 2155 orders in this sample.

Discounts are as follows: 0.0, 0.15, 0.05, 0.2, 0.25, 0.1, 0.02, 0.03, 0.04, 0.06, 0.01

The majority of product purchases are without the discount(1317/2155) 61%, and frequency of discounts are as follows: 5% 10% 20% 15% 25%

For discounts 1%,2%,3%,4%, and 6% , the amount of data provided was relatively small to evaluate the impact on the whole. This data was removed from further testing.

Overall distributions appears relatively uniform.

Since we are comparing multiple discounts to inspect it's impact on quantity ordered an AVNOVA or Kruksal test will be run depending on how assumptions are met:

Assumptions for ANOVA Testing: (see corresponding sections

  1. No significant outliers

    • Upon a quick visual inspection, the distribution is skewed and visually there appears to be some outliers
  2. Equal variance

    • Lavene's testing demonstrates NOT equal variance
  3. Normality (if n>15)

    • Not required for discounts: 5%, 10%, 15%, 20% and 25% since n > 15

Assumption 1: Outliers

Evaluation and removal via Z-Score testing:

for disc, disc_data in discs.items():
    idx_outs = fn.find_outliers_Z(disc_data)
    print(f'Found {idx_outs.sum()} outliers in Discount Group {disc}')
    discs[disc] = disc_data[~idx_outs]
print('\n All of these outliers were removed')
Found 20 outliers in Discount Group 0.0
Found 2 outliers in Discount Group 0.15
Found 3 outliers in Discount Group 0.05
Found 2 outliers in Discount Group 0.2
Found 3 outliers in Discount Group 0.25
Found 3 outliers in Discount Group 0.1

 All of these outliers were removed

Assumption 2: Equal Variance

Levines testing conducted on cleaned dataset

#preparing data for levene's testing
datad = []
for k,v in discs.items():
import scipy.stats as stats
stat,p = stats.levene(discs[0.0],discs[0.05],discs[0.1],discs[0.25], discs[0.15], discs[0.20])
print(f'Lavene test for equal variance results are {round(p,4)}')
sig = 'do NOT' if p < .05 else 'DO'

print(f'The groups {sig} have equal variance')
Lavene test for equal variance results are 0.0003
The groups do NOT have equal variance

Since group does not prove to be equal variance, a kruksal will be conducted.

Looking at sample sizes to determine if normality needs to be tested.

Assumption 3: Normality

First, checking sample sizes since assumption for normality depends on sample size. If 2-9 groups, each group n >= 15

For Discounts of 5%, 10%, 15%, 20% and 25% n>15

n = []
for disc, disc_data in discs.items():
    print(f'There are {len(disc_data)} samples in the {disc} discount group.')    
if all(n):
    print('\nAll samples are >15: Normality Assumption Criterion is met.')
There are 1297 samples in the 0.0 discount group.
There are 155 samples in the 0.15 discount group.
There are 182 samples in the 0.05 discount group.
There are 159 samples in the 0.2 discount group.
There are 151 samples in the 0.25 discount group.
There are 170 samples in the 0.1 discount group.

All samples are >15: Normality Assumption Criterion is met.

Kruksal Testing:

stat, p = stats.kruskal(discs[0.0],discs[0.05],discs[0.1],discs[0.25], discs[0.15], discs[0.20])
print(f"Kruskal test p value: {round(p,4)}")
if p < .05 :
    print(f'Reject the null hypothesis')
    print(f'Null hypotheis remains true')
Kruskal test p value: 0.0
Reject the null hypothesis

ANOVA Testing for comparison

stat, p = stats.f_oneway(*datad)
print(f"ANOVA test p value: {round(p,4)}")
if p < .05 :
    print(f'Reject the null hypothesis')
    print(f'Null hypotheis remains true')
ANOVA test p value: 0.0
Reject the null hypothesis

Post-Hoc Testing:

disc_df = fn.prep_data_for_tukeys(discs)
<style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; }
.dataframe tbody tr th {
    vertical-align: top;

.dataframe thead th {
    text-align: right;
data group
0 12.0 0.0
1 10.0 0.0
2 5.0 0.0
3 9.0 0.0
4 40.0 0.0
... ... ...
2095 30.0 0.1
2096 77.0 0.1
2098 25.0 0.1
2099 4.0 0.1
2135 2.0 0.1

2114 rows × 2 columns
<class 'pandas.core.frame.DataFrame'>
Int64Index: 2114 entries, 0 to 2135
Data columns (total 2 columns):
data     2114 non-null float64
group    2114 non-null object
dtypes: float64(1), object(1)
memory usage: 49.5+ KB
d =list(disc_df['group'].unique())
['0.0', '0.15', '0.05', '0.2', '0.25', '0.1']
import statsmodels.api as sms
tukey = sms.stats.multicomp.pairwise_tukeyhsd(disc_df['data'],disc_df['group'])
Multiple Comparison of Means - Tukey HSD, FWER=0.05
group1 group2 meandiff p-adj lower upper reject
0.0 0.05 6.0639 0.001 2.4368 9.691 True
0.0 0.1 2.9654 0.2098 -0.7723 6.7031 False
0.0 0.15 6.9176 0.001 3.0233 10.8119 True
0.0 0.2 5.6293 0.001 1.7791 9.4796 True
0.0 0.25 6.1416 0.001 2.2016 10.0817 True
0.05 0.1 -3.0985 0.4621 -7.9861 1.789 False
0.05 0.15 0.8537 0.9 -4.1547 5.862 False
0.05 0.2 -0.4346 0.9 -5.4088 4.5396 False
0.05 0.25 0.0777 0.9 -4.9663 5.1218 False
0.1 0.15 3.9522 0.2311 -1.1368 9.0412 False
0.1 0.2 2.6639 0.6409 -2.3915 7.7193 False
0.1 0.25 3.1762 0.4872 -1.9479 8.3004 False
0.15 0.2 -1.2883 0.9 -6.4605 3.884 False
0.15 0.25 -0.7759 0.9 -6.0154 4.4635 False
0.2 0.25 0.5123 0.9 -4.6945 5.7191 False

There is a statistically significant effect on quantity purchased based on discount. The discounts below are statistically considered equal:

disc_tukey = fn.tukey_df(tukey)
disc_tukey_trues = disc_tukey.loc[disc_tukey['reject']==True]
<style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; }
.dataframe tbody tr th {
    vertical-align: top;

.dataframe thead th {
    text-align: right;
group1 group2 meandiff p-adj lower upper reject
0 0.0 0.05 6.0639 0.001 2.4368 9.6910 True
2 0.0 0.15 6.9176 0.001 3.0233 10.8119 True
3 0.0 0.2 5.6293 0.001 1.7791 9.4796 True
4 0.0 0.25 6.1416 0.001 2.2016 10.0817 True
def plt_discounts(d=d):
    plt.axvline(qty_mu, color='purple')
interactive(children=(Dropdown(description='d', options=('0.0', '0.0', '0.0', '0.0', '0.0', '0.0', '0.0', '0.0…
stats = disc_df['data'].describe()
dataqty = stats['mean']
print(f'Revised average quantity sold across all orders {round(dataqty)}')
zeros = disc_df.groupby('group').get_group('0.0')['data']
nodiscav = zeros.mean()
print(f'Average quantity sold for orders where no discount was extended was: {round(nodiscav)}')
Revised average quantity sold across all orders 23.0
Average quantity sold for orders where no discount was extended was: 21.0
for k,v in discs.items():
    print(f'The average quantity sold for {k} discount is {round(v.mean())}')
The average quantity sold for 0.0 discount is 21.0
The average quantity sold for 0.15 discount is 27.0
The average quantity sold for 0.05 discount is 27.0
The average quantity sold for 0.2 discount is 26.0
The average quantity sold for 0.25 discount is 27.0
The average quantity sold for 0.1 discount is 24.0
data_mu = disc_df['data'].mean()
print(f'The average qty purchased regardless of discount or none offered is {round(data_mu)}')
The average qty purchased regardless of discount or none offered is 23.0

Various EDA to understand distributions and remaining potential outliers

The following hex-bin visualization illustrates the density of data in the 0% category. Visually, this appears that equal quantitites are purchased, but might be due to light markings relative to sample sizes. Clearly not the best choice for EDA, and further exploration is required.

disc_df['group'] = disc_df['group'].astype(float)
sns.jointplot(data=disc_df, x='group', y='data', kind='hex')
Text(336.9714285714286, 0.5, 'Qty')


Visual on distributions and potential remaining outliers:

#boxen plot
sns.catplot(data=disc_df, x='group', y='data', kind='boxen')
ax.axhline(26.75, color='k')


This plot may not best suit non-technical audience with the additional information potentially could cause confusion. However, the above boxen plot clearly illustrates distributions as well as remaining potential outliers in the .05, .15, .2 and .25 groups. These outliers will not be removed at the present moment with the intention to preserve as much of the initial dataset as possible. Sample sizes can be referenced under 'Assumption 3: Normality' where this is observed.

fig, ax = plt.subplots(figsize=(10,8))
sns.barplot(data=disc_df, x='group', y='data', ci=68)
#plt.axhline(data_mu, linestyle='--', color='darkgray')
plt.axhline(26.75, linestyle='--', color='lightblue')
plt.title('Average Quanity Purchased at Varying Discont Levels', fontsize=20)
plt.xlabel('Discount Extended')
plt.ylabel('Average Quantity Purchased')
Text(0, 0.5, 'Average Quantity Purchased')


Effect Sizes:

for disc, disc_data in discs.items():
    es = fn.Cohen_d(zeros, disc_data)
    print(f'The effect size for {disc} is {round(np.abs(es),4)}')
The effect size for 0.0 is 0.0
The effect size for 0.15 is 0.453
The effect size for 0.05 is 0.3949
The effect size for 0.2 is 0.3751
The effect size for 0.25 is 0.4109
The effect size for 0.1 is 0.1982

Hypothesis 1 Findings and Recommendation:


Rejecting the null hypothesis :

  • 𝐻0 :There is no statistcally significant effect on the quantity of a product in an order in relation to a discount amount.

Alternative hypothesese:

*  𝐻1 :Discounts have a statistically significant effect on the quantiy of a product in an order.
*  𝐻1𝑎 :Certain discount values have a greater effect than others.(see below for findings)

A 10% discount had no statistical significance on the quantiy purchased.

Discounts extended at 5%, 15%, 20%, and 25% statistically are equal in terms of their effect on quantity sold when compared to none offered, with a p-value of .001 meaning there is a .01 percent chance of classifying them as such due to chance. Each have varying effect sizes in compared to orders placed with no discount extended.

Discount AvQty Effect Size Effect
5 % 27 .1982 Small
15 % 27 .454 Medium
20 % 26 .3751 Medium
25 % 27 .454 Medium

Additional notes: For discounts 1%,2%,3%,4%, and 6% that were included in the original dataset, the amount of data provided was relatively small to evaluate the impact on the whole. This data was removed from further testing

With additional outliers removed for each of the discount groups, the revised average qty purchased was 23 overall.


While larger discounts did deomonstrate significant effect on quantity purchased, smaller discounts held a statistically equal effect. To recognize the effect of driving higher quantities purchased and realize larger profit margins, offer the smaller discount.


Do some categories generate more revenue than others ?? Which ones?

  • $𝐻0$ : All categories generate equal revenues.
  • $𝐻1$ : Certain categories sell at statistically higher rates of revneu than others.
  • $𝐻1𝑎$ :

Importing and inspecting data from Product and OrderDetail tables:

These tables includes product information data including:

  1. Categories

  2. Pricing and discount information to generate revenues

##clean sql notation
                FROM Product AS p
                JOIN OrderDetail as od
                ON p.ID = Od.ProductId
                JOIN Category as c 
                ON c.ID = p.CategoryID;""")
catavs = pd.DataFrame(cur.fetchall(), columns=[x[0] for x in cur.description])
<style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; }
.dataframe tbody tr th {
    vertical-align: top;

.dataframe thead th {
    text-align: right;
ProductId ProductName UnitPrice CategoryId CategoryName Discount Quantity
0 11 Queso Cabrales 14.0 4 Dairy Products 0.0 12
1 42 Singaporean Hokkien Fried Mee 9.8 5 Grains/Cereals 0.0 10
2 72 Mozzarella di Giovanni 34.8 4 Dairy Products 0.0 5
3 14 Tofu 18.6 7 Produce 0.0 9
4 51 Manjimup Dried Apples 42.4 7 Produce 0.0 40
#Revenue is calculated by subtracting discounts from the UnitPrice and multiplying by quantity ordered.
catavs['Revenue'] = (catavs['UnitPrice'] * (1 - catavs['Discount']))*catavs['Quantity']
avcrev = catavs['Revenue'].mean()
print (round(avcrev),2)
cg = catavs['CategoryName'].unique()

587.0 2

Initial Visual Inspection and Observations:

There are 8 different categories sold in this company that represent 77 products.

The average revenue generated across all categories is $587.00

Visually, it appears that there are three categories that significantly generate higher revenues than others, additional testing will demonstrate their siginficance and effect.

fig, ax = plt.subplots(figsize=(12,8))
sns.barplot(data=catavs, x='CategoryName', y='Revenue', ci=68, ax=ax)
plt.title('Initial Inspection: Average Revenue Generated Accross All Categories', fontsize=18)
plt.axhline(avcrev,linestyle="--", color='k', linewidth=.8 )


catcount = len(catavs['CategoryId'].unique())
avrev = catavs['Revenue'].mean()
print(f'There are {catcount} different categories sold in this company')
print(f'The average revenue generated accross all categories is {round(avcrev,0)}')
There are 8 different categories sold in this company
The average revenue generated accross all categories is 587.0
cats = {}
for cat in catavs['CategoryName'].unique():
    cats[cat] = catavs.groupby('CategoryName').get_group(cat)['Revenue']

In each of the different categories, the products align as such:

catsprods = {}
for cat in catavs['CategoryName'].unique():
    catsprods[cat] = catavs.groupby('CategoryName').get_group(cat)['ProductId'].unique()
for k,v in catsprods.items():
    print(f'There are {len(v)} products in the {k} Category')
There are 10 products in the Dairy Products Category
There are 7 products in the Grains/Cereals Category
There are 5 products in the Produce Category
There are 12 products in the Seafood Category
There are 12 products in the Condiments Category
There are 13 products in the Confections Category
There are 12 products in the Beverages Category
There are 6 products in the Meat/Poultry Category

Assumption 1: Outliers

Outliers removed via z-score testing.

for cat, cat_data in cats.items():
    idx_outs = fn.find_outliers_Z(cat_data)
    print(f'Found {idx_outs.sum()} outliers in Category # {cat}')
    cats[cat] = cat_data[~idx_outs]
print('\n All of these outliers were removed')
Found 5 outliers in Category # Dairy Products
Found 6 outliers in Category # Grains/Cereals
Found 1 outliers in Category # Produce
Found 8 outliers in Category # Seafood
Found 4 outliers in Category # Condiments
Found 9 outliers in Category # Confections
Found 12 outliers in Category # Beverages
Found 4 outliers in Category # Meat/Poultry

 All of these outliers were removed
pids = catavs['ProductName'].unique()
print(f'There are {len(pids)} products')
There are 77 products

Assumption 2: Equal Variance

Testing cleaned dataset for equal variance.

Since the groups do NOT have euqal variance, a Kruksal test will be conducted.

datac = []
for k,v in cats.items():
import scipy.stats as stats
stat,p = stats.levene(*datac)
print(f'Lavene test for equal variance results are {round(p,4)}')
sig = 'do NOT' if p < .05 else 'DO'

print(f'The groups {sig} have equal variance')
Lavene test for equal variance results are 0.0
The groups do NOT have equal variance

Assumption 3: Normality

All groups are > 15 samples = Assumption for normality is met.

n = []

for cat, cat_data in cats.items():
    print(f'There are {len(cat_data)} samples in the data set for Employee #{cat}.')
if all(n):
    print('\nAll samples are >15: Normality Assumption Criterion is met.')

There are 361 samples in the data set for Employee #Dairy Products.
There are 190 samples in the data set for Employee #Grains/Cereals.
There are 135 samples in the data set for Employee #Produce.
There are 322 samples in the data set for Employee #Seafood.
There are 212 samples in the data set for Employee #Condiments.
There are 325 samples in the data set for Employee #Confections.
There are 392 samples in the data set for Employee #Beverages.
There are 169 samples in the data set for Employee #Meat/Poultry.

All samples are >15: Normality Assumption Criterion is met.

Kruksal Testing:

Results, reject the null hypothesis.

stat, p = stats.kruskal(*datac)
print(f"Kruskal test p value: {round(p,4)}")
if p < .05 :
    print(f'Reject the null hypothesis')
    print(f'Null hypotheis remains true')
Kruskal test p value: 0.0
Reject the null hypothesis

Post-Hoc Testing:

cat_df = fn.prep_data_for_tukeys(cats)
<style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; }
.dataframe tbody tr th {
    vertical-align: top;

.dataframe thead th {
    text-align: right;
data group
0 168.00 Dairy Products
2 174.00 Dairy Products
12 47.50 Dairy Products
13 1088.00 Dairy Products
14 200.00 Dairy Products
... ... ...
2096 2702.70 Meat/Poultry
2098 738.00 Meat/Poultry
2099 86.40 Meat/Poultry
2102 111.75 Meat/Poultry
2148 48.00 Meat/Poultry

2106 rows × 2 columns

Visual Inspection Post Data Cleaning:

def plt_discounts(d=cg):
    plt.axvline(cat_df['data'].mean(), color='purple')
    plt.title('Average Revenue Generated by Category')
interactive(children=(Dropdown(description='d', options=('Dairy Products', 'Grains/Cereals', 'Produce', 'Seafo…
tukeyc = sms.stats.multicomp.pairwise_tukeyhsd(cat_df['data'],cat_df['group'])
Multiple Comparison of Means - Tukey HSD, FWER=0.05
group1 group2 meandiff p-adj lower upper reject
Beverages Condiments 55.7712 0.9 -78.858 190.4005 False
Beverages Confections 27.0669 0.9 -91.4026 145.5364 False
Beverages Dairy Products 194.0913 0.001 78.8965 309.286 True
Beverages Grains/Cereals 21.4034 0.9 -118.1928 160.9996 False
Beverages Meat/Poultry 411.042 0.001 265.7215 556.3625 True
Beverages Produce 296.1055 0.001 138.516 453.695 True
Beverages Seafood -46.1557 0.9 -164.9266 72.6151 False
Condiments Confections -28.7043 0.9 -168.1193 110.7106 False
Condiments Dairy Products 138.32 0.0449 1.6769 274.9631 True
Condiments Grains/Cereals -34.3678 0.9 -192.129 123.3933 False
Condiments Meat/Poultry 355.2707 0.001 192.4225 518.119 True
Condiments Produce 240.3342 0.001 66.4494 414.2191 True
Condiments Seafood -101.927 0.3439 -241.5981 37.7441 False
Confections Dairy Products 167.0244 0.001 46.2712 287.7776 True
Confections Grains/Cereals -5.6635 0.9 -149.8807 138.5537 False
Confections Meat/Poultry 383.9751 0.001 234.2101 533.7401 True
Confections Produce 269.0386 0.001 107.3415 430.7357 True
Confections Seafood -73.2226 0.6099 -197.392 50.9467 False
Dairy Products Grains/Cereals -172.6879 0.0054 -314.2272 -31.1485 True
Dairy Products Meat/Poultry 216.9507 0.001 69.7626 364.1389 True
Dairy Products Produce 102.0142 0.5181 -57.2991 261.3276 False
Dairy Products Seafood -240.247 0.001 -361.2959 -119.1982 True
Grains/Cereals Meat/Poultry 389.6386 0.001 222.6607 556.6164 True
Grains/Cereals Produce 274.7021 0.001 96.9438 452.4604 True
Grains/Cereals Seafood -67.5591 0.831 -212.024 76.9057 False
Meat/Poultry Produce -114.9365 0.5359 -297.2246 67.3516 False
Meat/Poultry Seafood -457.1977 0.001 -607.2012 -307.1942 True
Produce Seafood -342.2612 0.001 -504.1792 -180.3432 True
for cat, cat_data in cats.items():
    print(f'The average revenue for {cat} is ${round(cat_data.mean(),2)}')
The average revenue for Dairy Products is $593.86
The average revenue for Grains/Cereals is $421.17
The average revenue for Produce is $695.87
The average revenue for Seafood is $353.61
The average revenue for Condiments is $455.54
The average revenue for Confections is $426.83
The average revenue for Beverages is $399.77
The average revenue for Meat/Poultry is $810.81

Hypothesis 2: A Clean Vizualization

index = list(cat_df.groupby('group').mean().sort_values('data', ascending=False).index)
 'Dairy Products',
import matplotlib.ticker as ticker
fig, ax = plt.subplots(figsize=(10,8))
sns.barplot(data=cat_df, x=round(cat_df['data']), y=cat_df['group'],order=index, ci=68, palette='Wistia_r')
formatter = ticker.FormatStrFormatter('$%1.2f')
plt.axvline(cat_df['data'].mean(), linestyle='--', color='gray')
plt.xlabel('Average Revenue Generated')
plt.title('Average Revenue* Generated By Category', fontsize=20)
Text(0.5, 1.0, 'Average Revenue* Generated By Category')


f, ax = plt.subplots(figsize=(10,8))
#sns.despine(f, left=True, bottom=True)
sns.catplot(data=cat_df, x='group', y='data', ax=ax)
sns.catplot(data=cat_df, x='group', y='data', ax=ax)



tukeycdf = fn.tukey_df(tukeyc)

The following groups are statistically similar and accept the null hypothesis that all categories generate equal revenue:

tukeycfalse  = tukeycdf.loc[tukeycdf['reject']==False]
<style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; }
.dataframe tbody tr th {
    vertical-align: top;

.dataframe thead th {
    text-align: right;
group1 group2 meandiff p-adj lower upper reject
0 Beverages Condiments 55.7712 0.9000 -78.8580 190.4005 False
1 Beverages Confections 27.0669 0.9000 -91.4026 145.5364 False
3 Beverages Grains/Cereals 21.4034 0.9000 -118.1928 160.9996 False
6 Beverages Seafood -46.1557 0.9000 -164.9266 72.6151 False
7 Condiments Confections -28.7043 0.9000 -168.1193 110.7106 False
9 Condiments Grains/Cereals -34.3678 0.9000 -192.1290 123.3933 False
12 Condiments Seafood -101.9270 0.3439 -241.5981 37.7441 False
14 Confections Grains/Cereals -5.6635 0.9000 -149.8807 138.5537 False
17 Confections Seafood -73.2226 0.6099 -197.3920 50.9467 False
20 Dairy Products Produce 102.0142 0.5181 -57.2991 261.3276 False
24 Grains/Cereals Seafood -67.5591 0.8310 -212.0240 76.9057 False
25 Meat/Poultry Produce -114.9365 0.5359 -297.2246 67.3516 False

The folowing groups can reject the null hypothesis:

tukeyctrues  = tukeycdf.loc[tukeycdf['reject']==True]
<style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; }
.dataframe tbody tr th {
    vertical-align: top;

.dataframe thead th {
    text-align: right;
group1 group2 meandiff p-adj lower upper reject
2 Beverages Dairy Products 194.0913 0.0010 78.8965 309.2860 True
4 Beverages Meat/Poultry 411.0420 0.0010 265.7215 556.3625 True
5 Beverages Produce 296.1055 0.0010 138.5160 453.6950 True
8 Condiments Dairy Products 138.3200 0.0449 1.6769 274.9631 True
10 Condiments Meat/Poultry 355.2707 0.0010 192.4225 518.1190 True
11 Condiments Produce 240.3342 0.0010 66.4494 414.2191 True
13 Confections Dairy Products 167.0244 0.0010 46.2712 287.7776 True
15 Confections Meat/Poultry 383.9751 0.0010 234.2101 533.7401 True
16 Confections Produce 269.0386 0.0010 107.3415 430.7357 True
18 Dairy Products Grains/Cereals -172.6879 0.0054 -314.2272 -31.1485 True
19 Dairy Products Meat/Poultry 216.9507 0.0010 69.7626 364.1389 True
21 Dairy Products Seafood -240.2470 0.0010 -361.2959 -119.1982 True
22 Grains/Cereals Meat/Poultry 389.6386 0.0010 222.6607 556.6164 True
23 Grains/Cereals Produce 274.7021 0.0010 96.9438 452.4604 True
26 Meat/Poultry Seafood -457.1977 0.0010 -607.2012 -307.1942 True
27 Produce Seafood -342.2612 0.0010 -504.1792 -180.3432 True
def mult_Cohn_d(tukey_result_df, df_dict):
    '''Using a dataframe from Tukey Test Results and a 
    corresponding dictionary, this function loops through 
    each variable and returns the adjusted p-value and Cohn_d test'''
    import pandas as pd
    res = [['g1', 'g2','padj', 'd']]
    for i, row in tukey_result_df.iterrows():
        g1 = row['group1']
        g2 = row['group2']
        padj = row['p-adj']
        d = fn.Cohen_d(df_dict[g1], df_dict[g2])

        res.append([g1, g2,padj, d])

    mdc = pd.DataFrame(res[1:], columns=res[0])
    return mdc

The table below illustrates those categories that can reject the null hypothesis that states all categories generate equal revenue. The padj is the probability this is due to chance and the d column shows the effect size.

mult_Cohn_d(tukeyctrues, cats)
<style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; }
.dataframe tbody tr th {
    vertical-align: top;

.dataframe thead th {
    text-align: right;
g1 g2 padj d
0 Beverages Dairy Products 0.0010 -0.351635
1 Beverages Meat/Poultry 0.0010 -0.596175
2 Beverages Produce 0.0010 -0.520603
3 Condiments Dairy Products 0.0449 -0.277491
4 Condiments Meat/Poultry 0.0010 -0.517399
5 Condiments Produce 0.0010 -0.490099
6 Confections Dairy Products 0.0010 -0.340630
7 Confections Meat/Poultry 0.0010 -0.600173
8 Confections Produce 0.0010 -0.560435
9 Dairy Products Grains/Cereals 0.0054 0.349940
10 Dairy Products Meat/Poultry 0.0010 -0.310545
11 Dairy Products Seafood 0.0010 0.525083
12 Grains/Cereals Meat/Poultry 0.0010 -0.563832
13 Grains/Cereals Produce 0.0010 -0.570886
14 Meat/Poultry Seafood 0.0010 0.754594
15 Produce Seafood 0.0010 0.798069

Hypothesis 2 Findings and Recommendation:


Reject the null hypothesis:

  • 𝐻0 : All categories generate equal revenues.

Alternative hypotheis

  • 𝐻1 : Certain categories sell at statistically higher rates of revnue than others.

Top three revenue-generating categories: Meat/Poultry, Produce, and Dairy Products.

Statistically, Meat/Poultry & Produce were statistically equivalent and ranked as top sellers, Dairy & Produce were statiscally equal. (Definition of statistically equal: they returned false value from Tukey test, indicating they had a simliar mean and therefore statiticaly equal with a .05 chance of falsely being classified as such)

Category Average Revenue
Meat/Poultry $810.81
Produce $695.87
Dairy Products $593.86
Condiments $455.54
Confections $426.83
Grains/Cereals `$421.17``
Beverages $399.77
Seafood $353.61

The table(s) above outlines how the various categories compare to each other. If the adjusted p-value in column padj is >.05, they are statistically equal. Conversely of the adjusted p-value 'adjp' is <.05 the two samples are statistically not-equal. This is further examined by the effect size illustrated in column d. d=0.2 be considered a 'small' effect size, 0.5 represents a 'medium' effect size and 0.8 a 'large' effect size.

Notes: Revenue is calculated by subtracting any extended discounts from the salesprice and multiplying that by quantity sold.


If there are additional products that align with the higher revenue generating categories, that category could be broadened to maximize revenue generating potential.

Example: Meat/Poultry currently has 6 products, this could be expanded. Conversely, the seafood category carries 12 products which could be narrowed. Additional analysis could demonstrate which seafood are the best sellers which would be preserved.

Knowing what revenue each category generates could potentially influence the ability to appropriately categorize discounts. However, not knowing profit margins - we'd need to take this into consideration.


Do certain sales representatives sell more than others? Who are the top sellers?

$H0$: All sales representatives generate equal revenue.

$H1$: Some sales representatives generate more than others in revenue.

Importing and inspecting data from Product, OrderDetail, Order and Employee Tables

These table includes information on:

1) Product information including SalesPrice, Discount and Quantity Sold
2) Sales Representative Information
                od.UnitPrice AS SalesPrice, 
                FROM Product AS p
                JOIN OrderDetail AS od
                ON od.ProductID = p.Id 
                JOIN 'Order' AS o
                ON o.Id = od.OrderId
                JOIN Employee AS e
                ON o.EmployeeID = e.ID;""")
dfr = pd.DataFrame(cur.fetchall(), columns=[x[0] for x in cur.description])
<style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; }
.dataframe tbody tr th {
    vertical-align: top;

.dataframe thead th {
    text-align: right;
ProductId ProductName Discontinued OrderId ProductId SalesPrice Quantity Discount EmployeeId LastName FirstName Region ShippedDate
0 11 Queso Cabrales 0 10248 11 14.00 12 0.00 5 Buchanan Steven British Isles 2012-07-16
1 42 Singaporean Hokkien Fried Mee 1 10248 42 9.80 10 0.00 5 Buchanan Steven British Isles 2012-07-16
2 72 Mozzarella di Giovanni 0 10248 72 34.80 5 0.00 5 Buchanan Steven British Isles 2012-07-16
3 14 Tofu 0 10249 14 18.60 9 0.00 6 Suyama Michael British Isles 2012-07-10
4 51 Manjimup Dried Apples 0 10249 51 42.40 40 0.00 6 Suyama Michael British Isles 2012-07-10
... ... ... ... ... ... ... ... ... ... ... ... ... ...
2150 64 Wimmers gute Semmelknödel 0 11077 64 33.25 2 0.03 1 Davolio Nancy North America None
2151 66 Louisiana Hot Spiced Okra 0 11077 66 17.00 1 0.00 1 Davolio Nancy North America None
2152 73 Röd Kaviar 0 11077 73 15.00 2 0.01 1 Davolio Nancy North America None
2153 75 Rhönbräu Klosterbier 0 11077 75 7.75 4 0.00 1 Davolio Nancy North America None
2154 77 Original Frankfurter grüne Soße 0 11077 77 13.00 2 0.00 1 Davolio Nancy North America None

2155 rows × 13 columns

#Sales Revenue is calculated by multiplying the adjusted price (accounting for any discounts) times quantity
dfr['SaleRev'] = (dfr['SalesPrice'] * (1-dfr['Discount'])) * dfr['Quantity']
empcount = len(dfr['EmployeeId'].unique())
avrev = dfr['SaleRev'].mean()
print(f'There are {empcount} employees in this company associated with sales information')
print(f'The calculated avarage revenue generated by a sales representative in this dataset is ${round(avrev)}')
There are 9 employees in this company associated with sales information
The calculated avarage revenue generated by a sales representative in this dataset is $587.0

Hypothesis 3 Preliminary Visualizations:

fig, ax = plt.subplots(figsize=(8,5))
sns.distplot(dfr['SaleRev'], color='green')
plt.axvline(avrev, color='lightgreen' )
plt.xlabel('Sales Revenue')
plt.title('Average Sales Revenue Distribution', fontsize=18)


fig, ax = plt.subplots(figsize=(8,5))
sns.barplot(data=dfr, x='EmployeeId', y='SaleRev', ci=68, ax=ax)
plt.title('Average Revenue by Sales Representative', fontsize=16)
plt.axhline(avrev,linestyle="--", color='gray', linewidth=.6 )
plt.xlabel('Employee Id')
plt.ylabel('Sales Revenue')


Employees are listed in the table below: Although the employee names are unique, for the sake of data inspection we'll continue to use Employee Id as the unique identifier and reference the table above to gather additional insight.

cur.execute("""SELECT ID, LastName, FirstName, Title, Region
                from Employee""")
empsdata = pd.DataFrame(cur.fetchall(), columns=[x[0] for x in cur.description])
<style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; }
.dataframe tbody tr th {
    vertical-align: top;

.dataframe thead th {
    text-align: right;
Id LastName FirstName Title Region
0 1 Davolio Nancy Sales Representative North America
1 2 Fuller Andrew Vice President, Sales North America
2 3 Leverling Janet Sales Representative North America
3 4 Peacock Margaret Sales Representative North America
4 5 Buchanan Steven Sales Manager British Isles
5 6 Suyama Michael Sales Representative British Isles
6 7 King Robert Sales Representative British Isles
7 8 Callahan Laura Inside Sales Coordinator North America
8 9 Dodsworth Anne Sales Representative British Isles
reps = {}
for rep in dfr['EmployeeId'].unique():
    reps[str(rep)] = dfr.groupby('EmployeeId').get_group(rep)['SaleRev']
fig, ax = plt.subplots(figsize=(10,5))
for k,v in reps.items():

plt.title('Sales Revenue By Distribution by Rep')
print('Distributions appear roughly equal and there appears to be outliers')
Distributions appear roughly equal and there appears to be outliers


Initial Observations:

Datatype is numeric in this 2155 order sample.

There are 9 employees in this company associated with sales information

The avarage revenue generated by a sales representative is $629.00.

Initial visual inspection indicates roughly uniform distribution in sales revenue, more than half of the sales representatives achieve the average. Additional testing will demonstrate if it is significant.

Since we are comparing multiple discounts to inspect it's impact on quantity ordered an AVNOVA test will be run: Assumptions for ANOVA Testing:

  1. No significant outliers Upon a quick visual inspection, there appears to be some outliers that could be removed

  2. Equal variance

  3. Normality (if n>15) Not required for samples greater than 15

Assumption 1: Outliers

for rep, rep_data in reps.items():
    idx_outs = fn.find_outliers_Z(rep_data)
    print(f'Found {idx_outs.sum()} outliers in Employee # {rep}')
    reps[rep] = rep_data[~idx_outs]
print('\n All of these outliers were removed')
Found 3 outliers in Employee # 5
Found 3 outliers in Employee # 6
Found 7 outliers in Employee # 4
Found 4 outliers in Employee # 3
Found 2 outliers in Employee # 9
Found 4 outliers in Employee # 1
Found 5 outliers in Employee # 8
Found 5 outliers in Employee # 2
Found 5 outliers in Employee # 7

 All of these outliers were removed
fig, ax = plt.subplots(figsize=(10,5))
for k,v in reps.items():

plt.title('Revised: Sales Revenue Distribution by Rep')
print('Distributions appear roughly equal and outliers are visibly removed in comparisson with other visual')
Distributions appear roughly equal and outliers are visibly removed in comparisson with other visual


Asumption 2: Equal Variance

Results are NOT equal variance for this group

#from functions import test_equal_variance
import scipy.stats as stats
datas = []
for k,v in reps.items():
stat,p = stats.levene(*datas)
print(f'Lavene test for equal variance results are {round(p,4)}')
sig = 'do NOT' if p < .05 else 'DO'

print(f'The groups {sig} have equal variance')
Lavene test for equal variance results are 0.0143
The groups do NOT have equal variance

Assumption 3: Normality

The lengths of these samples are >15 so normality criteria is met.

n = []

for rep,samples in reps.items():
    print(f'There are {len(samples)} samples in the data set for Employee #{rep}.')
if all(n):
    print('\nAll samples are >15: Normality Assumption Criterion is met.')

There are 114 samples in the data set for Employee #5.
There are 165 samples in the data set for Employee #6.
There are 413 samples in the data set for Employee #4.
There are 317 samples in the data set for Employee #3.
There are 105 samples in the data set for Employee #9.
There are 341 samples in the data set for Employee #1.
There are 255 samples in the data set for Employee #8.
There are 236 samples in the data set for Employee #2.
There are 171 samples in the data set for Employee #7.

All samples are >15: Normality Assumption Criterion is met.
stat, p = stats.kruskal(*datas)
print(f"Kruskal test p value: {round(p,4)}")
Kruskal test p value: 0.2968
stat, p = stats.f_oneway(*datas)
2.180626766904163 0.026222185205141618
clean_data = fn.prep_data_for_tukeys(reps)
<style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; }
.dataframe tbody tr th {
    vertical-align: top;

.dataframe thead th {
    text-align: right;
data group
0 168.000 5
1 98.000 5
2 174.000 5
17 45.900 5
18 342.720 5
... ... ...
2077 390.000 7
2103 52.350 7
2104 386.400 7
2105 490.000 7
2123 232.085 7

2117 rows × 2 columns

Hypothesis 3: A Clean Visualization:

f, ax = plt.subplots(figsize=(8,4))
sns.barplot(data=clean_data, x='group', y='data', ci=68)
plt.xlabel('Employee Id')
plt.ylabel('Average Revenue Generated')
plt.axhline(clean_data['data'].mean(), linestyle="--", color='darkgray')

plt.title('Average Revenue Generated by Sales Representative', fontsize=20)
Text(0.5, 1.0, 'Average Revenue Generated by Sales Representative')


Observation: The plot above is misleading since it goes off of means it skews the data. Further testing indicates there are no statistical differences between representatives in terms of revenue generated over this time span. The subsequent plot reveals a more accurate depiction of what testing demonstrates.

mu = clean_data['data'].mean()

sns.catplot(data=clean_data, x='group', y='data', 
            kind='swarm', height=6, aspect=1.5)
formatter = ticker.FormatStrFormatter('$%1.2f')
plt.axhline(mu, color='k', linestyle="-", lw=2)
plt.xlabel('Employee Id')
plt.ylabel('Average Total Revenue in US Dollars')
plt.title('Average Total Revenue by Sales Representaitve', fontsize=20)
print(f'Average revenue across all sales representatives: ${round(mu,2)}')
Average revenue across all sales representatives: $494.72


#datacleaning for additional testing:
clean_data['data'] = clean_data['data'].astype(float)
clean_data['group'] = clean_data['group'].astype(str)
<style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; }
.dataframe tbody tr th {
    vertical-align: top;

.dataframe thead th {
    text-align: right;
data group
0 168.000 5
1 98.000 5
2 174.000 5
17 45.900 5
18 342.720 5
... ... ...
2077 390.000 7
2103 52.350 7
2104 386.400 7
2105 490.000 7
2123 232.085 7

2117 rows × 2 columns

The table below illustrates to accept the null hypothesis.

tukeys = sms.stats.multicomp.pairwise_tukeyhsd(clean_data['data'], clean_data['group'])
Multiple Comparison of Means - Tukey HSD, FWER=0.05
group1 group2 meandiff p-adj lower upper reject
1 2 70.6795 0.7896 -67.9146 209.2736 False
1 3 62.1979 0.8328 -65.5036 189.8995 False
1 4 -2.0207 0.9 -121.7839 117.7426 False
1 5 2.7041 0.9 -174.3746 179.7827 False
1 6 -81.4678 0.7616 -236.6872 73.7517 False
1 7 68.5247 0.9 -84.8486 221.8981 False
1 8 -24.7293 0.9 -160.2376 110.779 False
1 9 101.7408 0.7013 -80.9369 284.4186 False
2 3 -8.4815 0.9 -149.2052 132.2421 False
2 4 -72.7002 0.725 -206.2617 60.8614 False
2 5 -67.9754 0.9 -254.6631 118.7123 False
2 6 -152.1472 0.1034 -318.2452 13.9507 False
2 7 -2.1547 0.9 -166.5288 162.2193 False
2 8 -95.4088 0.5349 -243.2531 52.4356 False
2 9 31.0613 0.9 -160.9455 223.0682 False
3 4 -64.2186 0.7606 -186.4399 58.0027 False
3 5 -59.4939 0.9 -238.2441 119.2564 False
3 6 -143.6657 0.1049 -300.7895 13.4581 False
3 7 6.3268 0.9 -148.9735 161.6271 False
3 8 -86.9273 0.5613 -224.6128 50.7583 False
3 9 39.5429 0.9 -144.7557 223.8415 False
4 5 4.7247 0.9 -168.4434 177.8929 False
4 6 -79.4471 0.7575 -230.19 71.2959 False
4 7 70.5454 0.8574 -78.2959 219.3867 False
4 8 -22.7086 0.9 -153.0653 107.648 False
4 9 103.7615 0.6578 -75.1282 282.6512 False
5 6 -84.1718 0.9 -283.5134 115.1697 False
5 7 65.8207 0.9 -132.0867 263.7281 False
5 8 -27.4334 0.9 -211.8418 156.975 False
5 9 99.0368 0.9 -122.3569 320.4304 False
6 7 149.9925 0.1837 -28.6232 328.6082 False
6 8 56.7384 0.9 -106.7935 220.2704 False
6 9 183.2086 0.1211 -21.1229 387.5401 False
7 8 -93.2541 0.6646 -255.0348 68.5267 False
7 9 33.2161 0.9 -169.7166 236.1487 False
8 9 126.4701 0.4952 -63.3213 316.2616 False

Effect Size Testing:

tukeyrepdf = fn.tukey_df(tukeys)
<style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; }
.dataframe tbody tr th {
    vertical-align: top;

.dataframe thead th {
    text-align: right;
group1 group2 meandiff p-adj lower upper reject
0 1 2 70.6795 0.7896 -67.9146 209.2736 False
1 1 3 62.1979 0.8328 -65.5036 189.8995 False
2 1 4 -2.0207 0.9000 -121.7839 117.7426 False
3 1 5 2.7041 0.9000 -174.3746 179.7827 False
4 1 6 -81.4678 0.7616 -236.6872 73.7517 False
5 1 7 68.5247 0.9000 -84.8486 221.8981 False
6 1 8 -24.7293 0.9000 -160.2376 110.7790 False
7 1 9 101.7408 0.7013 -80.9369 284.4186 False
8 2 3 -8.4815 0.9000 -149.2052 132.2421 False
9 2 4 -72.7002 0.7250 -206.2617 60.8614 False
10 2 5 -67.9754 0.9000 -254.6631 118.7123 False
11 2 6 -152.1472 0.1034 -318.2452 13.9507 False
12 2 7 -2.1547 0.9000 -166.5288 162.2193 False
13 2 8 -95.4088 0.5349 -243.2531 52.4356 False
14 2 9 31.0613 0.9000 -160.9455 223.0682 False
15 3 4 -64.2186 0.7606 -186.4399 58.0027 False
16 3 5 -59.4939 0.9000 -238.2441 119.2564 False
17 3 6 -143.6657 0.1049 -300.7895 13.4581 False
18 3 7 6.3268 0.9000 -148.9735 161.6271 False
19 3 8 -86.9273 0.5613 -224.6128 50.7583 False
20 3 9 39.5429 0.9000 -144.7557 223.8415 False
21 4 5 4.7247 0.9000 -168.4434 177.8929 False
22 4 6 -79.4471 0.7575 -230.1900 71.2959 False
23 4 7 70.5454 0.8574 -78.2959 219.3867 False
24 4 8 -22.7086 0.9000 -153.0653 107.6480 False
25 4 9 103.7615 0.6578 -75.1282 282.6512 False
26 5 6 -84.1718 0.9000 -283.5134 115.1697 False
27 5 7 65.8207 0.9000 -132.0867 263.7281 False
28 5 8 -27.4334 0.9000 -211.8418 156.9750 False
29 5 9 99.0368 0.9000 -122.3569 320.4304 False
30 6 7 149.9925 0.1837 -28.6232 328.6082 False
31 6 8 56.7384 0.9000 -106.7935 220.2704 False
32 6 9 183.2086 0.1211 -21.1229 387.5401 False
33 7 8 -93.2541 0.6646 -255.0348 68.5267 False
34 7 9 33.2161 0.9000 -169.7166 236.1487 False
35 8 9 126.4701 0.4952 -63.3213 316.2616 False

Hypothesis 3 Results Table:

fn.mult_Cohn_d(tukeyrepdf, reps)
<style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; }
.dataframe tbody tr th {
    vertical-align: top;

.dataframe thead th {
    text-align: right;
g1 g2 padj d
0 1 2 0.7896 -0.130744
1 1 3 0.8328 -0.114624
2 1 4 0.9000 0.004144
3 1 5 0.9000 -0.005335
4 1 6 0.7616 0.170314
5 1 7 0.9000 -0.120978
6 1 8 0.9000 0.052428
7 1 9 0.7013 -0.176757
8 2 3 0.9000 0.014775
9 2 4 0.7250 0.143003
10 2 5 0.9000 0.123877
11 2 6 0.1034 0.298368
12 2 7 0.9000 0.003517
13 2 8 0.5349 0.192116
14 2 9 0.9000 -0.049137
15 3 4 0.7606 0.124941
16 3 5 0.9000 0.108117
17 3 6 0.1049 0.277284
18 3 7 0.9000 -0.010479
19 3 8 0.5613 0.171933
20 3 9 0.9000 -0.063944
21 4 5 0.9000 -0.010070
22 4 6 0.7575 0.178198
23 4 7 0.8574 -0.133375
24 4 8 0.9000 0.051141
25 4 9 0.6578 -0.194860
26 5 6 0.9000 0.193965
27 5 7 0.9000 -0.110550
28 5 8 0.9000 0.063274
29 5 9 0.9000 -0.159411
30 6 7 0.1837 -0.275056
31 6 8 0.9000 -0.140752
32 6 9 0.1211 -0.329617
33 7 8 0.6646 0.178140
34 7 9 0.9000 -0.048048
35 8 9 0.4952 -0.239676

Findings and recommendations:


Both parametric and non parametric tests were conducted, despite indications for non-parametric tests as the dataset(s) did not meet the assupmtion of equal variance. A visual inspection after outlier removal suggested there could be significant variance in the mean and the parametric was run for comparison. The parametric test indicated that the null hypothesis could be rejected. However, post-hoc testing validated the efficacy of the non-parametric test to accept the null hypothesis, with the smallest probability that the outcome was due to chance was well over the accepted rate of .05.

The Hypothesis 3 Results Table above provides details on how each representative compares with one another and is itemized by the adjusted p values and d is the result of a Cohen D illustrating the effect size of each comparison.


If there is no statistical difference, and effect size is small at best, best practices can still be shared by those who have a higher average revenue, examples 2,3 and 9 still have higher than average sales.

Perhaps a little healthy, insentivised competition might spur increased revenues if not by one, by many. Also, building on knowledge of smaller discounts yielding larger quantities, sales team could increase revenue by being conservative with discount rates.


Where are our customers from that spend the most money?

$H0$: Customers spend equal amounts regardless of region.

$H1$: Region has an effect on total amount spent.

Importing and inspecting data from OrderDetail and Order

These table includes information on:

1) Sales total information 
2) Regions where orders shipped to indicating the location of customers
                FROM `Order` AS o
                JOIN OrderDetail AS od
                on o.ID = od.OrderId ;""")
dfreg = pd.DataFrame(cur.fetchall(), columns=[x[0] for x in cur.description])
<style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; }
.dataframe tbody tr th {
    vertical-align: top;

.dataframe thead th {
    text-align: right;
ShipRegion OrderId ProductId UnitPrice Quantity Discount
0 Western Europe 10248 11 14.0 12 0.0
1 Western Europe 10248 42 9.8 10 0.0
2 Western Europe 10248 72 34.8 5 0.0
3 Western Europe 10249 14 18.6 9 0.0
4 Western Europe 10249 51 42.4 40 0.0
dfreg['Amount_Spent'] = ((dfreg['UnitPrice'])*(1 - dfreg['Discount']))*dfreg['Quantity']

Hypothesis 4 Preliminary Visualizations:

fig, ax = plt.subplots()
spend_mu = dfreg['Amount_Spent'].mean()
sns.distplot(dfreg['Amount_Spent'], ax=ax)
plt.axvline(spend_mu, color='lightgreen')
plt.title('Average Total Spend')
print(f'The average total spend is ${round(spend_mu,2)}')
print(f'The distribution indicates there may be outliers.')
The average total spend is $587.37
The distribution indicates there may be outliers.


regs = {}
for reg in dfreg['ShipRegion'].unique():
    regs[reg] = dfreg.groupby('ShipRegion').get_group(reg)['Amount_Spent']
regions = list(dfreg['ShipRegion'].unique())
print(f'There are {len(regions)} regions, they are {regions}.')
There are 9 regions, they are ['Western Europe', 'South America', 'Central America', 'North America', 'Northern Europe', 'Scandinavia', 'Southern Europe', 'British Isles', 'Eastern Europe'].
fig, ax = plt.subplots(figsize=(10,5))
for k,v in regs.items():

plt.title('Sales Revenue Distribution')
print('Distributions appear roughly equal, although there appears to be outliers')
Distributions appear roughly equal, although there appears to be outliers


def plt_discounts(d=regions):
    plt.axvline(spend_mu, color='purple')
interactive(children=(Dropdown(description='d', options=('Western Europe', 'South America', 'Central America',…
fig, ax = plt.subplots(figsize=(8,5))
sns.barplot(data=dfreg, x='ShipRegion', y='Amount_Spent', ci=68, palette="rocket", ax=ax)
plt.title('Total Spend by Region', fontsize=16)
plt.axhline(spend_mu,linestyle="--", color='orange', linewidth=.6 )
plt.ylabel('Total Spend')


Hypothesis 4 Initial Observations:

Datatype is numeric in this 2155 order sample.

There are 9 regions in reflected in this dataset

The avarage of total spent is $587.37.

Initial visual inspection indicates skewed, but roughly uniform distribution in total sales, more than half of the sales representatives achieve the average. Additional testing will demonstrate if it is significant.

Since we are comparing multiple regions to inspect it's impact on quantity ordered an AVNOVA test will be run: Assumptions for ANOVA Testing:

  1. Upon a quick visual inspection, there appears to be some outliers that could be removed

  2. Equal variance

  3. Normality (if n>15) Not required for samples greater than 15

Hypothesis 4 Assumption 1: Outlier

Outliers were identified and removed via z-score testing. Details are below:

regs = {}
for reg in dfreg['ShipRegion'].unique():
    regs[reg] = dfreg.groupby('ShipRegion').get_group(reg)['Amount_Spent']
for reg, reg_data in regs.items():
    idx_outs = fn.find_outliers_Z(reg_data)
    print(f'Found {idx_outs.sum()} outliers in the {reg}')
    regs[reg] = reg_data[~idx_outs]
print('\n All of these outliers were removed')
Found 11 outliers in the Western Europe
Found 2 outliers in the South America
Found 1 outliers in the Central America
Found 9 outliers in the North America
Found 2 outliers in the Northern Europe
Found 1 outliers in the Scandinavia
Found 3 outliers in the Southern Europe
Found 4 outliers in the British Isles
Found 0 outliers in the Eastern Europe

 All of these outliers were removed

Hypothesis 4 Assumption 2: Equal Variance

data = []
labels = []
for k,v in regs.items():
stat,p = stats.levene(*data, center = 'median')
print(f'Lavene test for equal variance results are {round(p,4)}')
sig = 'do NOT' if p < .05 else 'DO'

print(f'The groups {sig} have equal variance')
Lavene test for equal variance results are 0.0
The groups do NOT have equal variance

Hypothesis 4 Assumption 3: Normality

n =[]
for reg, samples in regs.items():
    print(f'There are {len(samples)} samples in the data set for Regions #{reg}.')
if all(n):
    print('\nAll samples are >15: Normality Assumption Criterion is met.')
There are 734 samples in the data set for Regions #Western Europe.
There are 353 samples in the data set for Regions #South America.
There are 71 samples in the data set for Regions #Central America.
There are 418 samples in the data set for Regions #North America.
There are 141 samples in the data set for Regions #Northern Europe.
There are 69 samples in the data set for Regions #Scandinavia.
There are 134 samples in the data set for Regions #Southern Europe.
There are 186 samples in the data set for Regions #British Isles.
There are 16 samples in the data set for Regions #Eastern Europe.

All samples are >15: Normality Assumption Criterion is met.

Using non-parametric Kruskal since the data set was not of equal variance:

stat, p = stats.kruskal(*data)
print(f"Kruskal test p value: {round(p,6)}")
Kruskal test p value: 0.0
cregs = fn.prep_data_for_tukeys(regs)  
<style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; }
.dataframe tbody tr th {
    vertical-align: top;

.dataframe thead th {
    text-align: right;
data group
0 168.0 Western Europe
1 98.0 Western Europe
2 174.0 Western Europe
3 167.4 Western Europe
4 1696.0 Western Europe
... ... ...
1933 54.0 Eastern Europe
1934 199.5 Eastern Europe
1935 200.0 Eastern Europe
1936 232.5 Eastern Europe
2054 591.6 Eastern Europe

2122 rows × 2 columns

mu = cregs['data'].mean()
tukeyr = sms.stats.multicomp.pairwise_tukeyhsd(cregs['data'],cregs['group'])
Multiple Comparison of Means - Tukey HSD, FWER=0.05
group1 group2 meandiff p-adj lower upper reject
British Isles Central America -185.8522 0.2271 -415.6671 43.9628 False
British Isles Eastern Europe -241.5782 0.6901 -670.7752 187.6188 False
British Isles North America 108.0077 0.3368 -37.1939 253.2093 False
British Isles Northern Europe 48.1291 0.9 -135.8232 232.0814 False
British Isles Scandinavia -137.1482 0.6379 -369.3612 95.0647 False
British Isles South America -38.1848 0.9 -187.4464 111.0767 False
British Isles Southern Europe -172.9738 0.0946 -359.6391 13.6914 False
British Isles Western Europe 124.6055 0.0989 -10.6288 259.8398 False
Central America Eastern Europe -55.726 0.9 -511.6242 400.1721 False
Central America North America 293.8599 0.001 82.3968 505.3229 True
Central America Northern Europe 233.9812 0.0623 -5.7511 473.7136 False
Central America Scandinavia 48.7039 0.9 -229.7849 327.1927 False
Central America South America 147.6673 0.4491 -66.6039 361.9385 False
Central America Southern Europe 12.8783 0.9 -228.942 254.6986 False
Central America Western Europe 310.4577 0.001 105.7104 515.205 True
Eastern Europe North America 349.5859 0.1926 -70.0708 769.2426 False
Eastern Europe Northern Europe 289.7073 0.4947 -144.8807 724.2953 False
Eastern Europe Scandinavia 104.43 0.9 -352.6817 561.5417 False
Eastern Europe South America 203.3934 0.8404 -217.6853 624.472 False
Eastern Europe Southern Europe 68.6044 0.9 -367.1389 504.3476 False
Eastern Europe Western Europe 366.1837 0.1372 -50.1293 782.4968 False
North America Northern Europe -59.8786 0.9 -220.316 100.5588 False
North America Scandinavia -245.1559 0.0115 -459.2227 -31.0892 True
North America South America -146.1925 0.0045 -265.2754 -27.1097 True
North America Southern Europe -280.9815 0.001 -444.5224 -117.4406 True
North America Western Europe 16.5978 0.9 -84.3478 117.5435 False
Northern Europe Scandinavia -185.2773 0.2974 -427.3094 56.7548 False
Northern Europe South America -86.3139 0.7597 -250.4349 77.807 False
Northern Europe Southern Europe -221.1029 0.0164 -419.8505 -22.3554 True
Northern Europe Western Europe 76.4764 0.7992 -74.9996 227.9525 False
Scandinavia South America 98.9634 0.8905 -117.8778 315.8046 False
Scandinavia Southern Europe -35.8256 0.9 -279.926 208.2748 False
Scandinavia Western Europe 261.7538 0.003 54.3185 469.189 True
South America Southern Europe -134.789 0.2306 -301.9451 32.3671 False
South America Western Europe 162.7904 0.001 56.0873 269.4934 True
Southern Europe Western Europe 297.5794 0.001 142.82 452.3387 True


** Noting that the sample size of Eastern Europe is relatively small and the confidence interval is much greater to accomodate for it, possibly skewing the results of the table below.

tukeyr_df = fn.tukey_df(tukeyr)
tukeyr_false = tukeyr_df[tukeyr_df['reject']==False]
<style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; }
.dataframe tbody tr th {
    vertical-align: top;

.dataframe thead th {
    text-align: right;
group1 group2 meandiff p-adj lower upper reject
0 British Isles Central America -185.8522 0.2271 -415.6671 43.9628 False
1 British Isles Eastern Europe -241.5782 0.6901 -670.7752 187.6188 False
2 British Isles North America 108.0077 0.3368 -37.1939 253.2093 False
3 British Isles Northern Europe 48.1291 0.9000 -135.8232 232.0814 False
4 British Isles Scandinavia -137.1482 0.6379 -369.3612 95.0647 False
5 British Isles South America -38.1848 0.9000 -187.4464 111.0767 False
6 British Isles Southern Europe -172.9738 0.0946 -359.6391 13.6914 False
7 British Isles Western Europe 124.6055 0.0989 -10.6288 259.8398 False
8 Central America Eastern Europe -55.7260 0.9000 -511.6242 400.1721 False
10 Central America Northern Europe 233.9812 0.0623 -5.7511 473.7136 False
11 Central America Scandinavia 48.7039 0.9000 -229.7849 327.1927 False
12 Central America South America 147.6673 0.4491 -66.6039 361.9385 False
13 Central America Southern Europe 12.8783 0.9000 -228.9420 254.6986 False
15 Eastern Europe North America 349.5859 0.1926 -70.0708 769.2426 False
16 Eastern Europe Northern Europe 289.7073 0.4947 -144.8807 724.2953 False
17 Eastern Europe Scandinavia 104.4300 0.9000 -352.6817 561.5417 False
18 Eastern Europe South America 203.3934 0.8404 -217.6853 624.4720 False
19 Eastern Europe Southern Europe 68.6044 0.9000 -367.1389 504.3476 False
20 Eastern Europe Western Europe 366.1837 0.1372 -50.1293 782.4968 False
21 North America Northern Europe -59.8786 0.9000 -220.3160 100.5588 False
25 North America Western Europe 16.5978 0.9000 -84.3478 117.5435 False
26 Northern Europe Scandinavia -185.2773 0.2974 -427.3094 56.7548 False
27 Northern Europe South America -86.3139 0.7597 -250.4349 77.8070 False
29 Northern Europe Western Europe 76.4764 0.7992 -74.9996 227.9525 False
30 Scandinavia South America 98.9634 0.8905 -117.8778 315.8046 False
31 Scandinavia Southern Europe -35.8256 0.9000 -279.9260 208.2748 False
33 South America Southern Europe -134.7890 0.2306 -301.9451 32.3671 False
tukeyr_trues = tukeyr_df[tukeyr_df['reject']==True]
<style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; }
.dataframe tbody tr th {
    vertical-align: top;

.dataframe thead th {
    text-align: right;
group1 group2 meandiff p-adj lower upper reject
9 Central America North America 293.8599 0.0010 82.3968 505.3229 True
14 Central America Western Europe 310.4577 0.0010 105.7104 515.2050 True
22 North America Scandinavia -245.1559 0.0115 -459.2227 -31.0892 True
23 North America South America -146.1925 0.0045 -265.2754 -27.1097 True
24 North America Southern Europe -280.9815 0.0010 -444.5224 -117.4406 True
28 Northern Europe Southern Europe -221.1029 0.0164 -419.8505 -22.3554 True
32 Scandinavia Western Europe 261.7538 0.0030 54.3185 469.1890 True
34 South America Western Europe 162.7904 0.0010 56.0873 269.4934 True
35 Southern Europe Western Europe 297.5794 0.0010 142.8200 452.3387 True
mult_Cohn_d(tukeyr_trues, regs)
<style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; }
.dataframe tbody tr th {
    vertical-align: top;

.dataframe thead th {
    text-align: right;
g1 g2 padj d
0 Central America North America 0.0010 -0.485386
1 Central America Western Europe 0.0010 -0.548229
2 North America Scandinavia 0.0115 0.402484
3 North America South America 0.0045 0.262021
4 North America Southern Europe 0.0010 0.485062
5 Northern Europe Southern Europe 0.0164 0.481990
6 Scandinavia Western Europe 0.0030 -0.460358
7 South America Western Europe 0.0010 -0.300535
8 Southern Europe Western Europe 0.0010 -0.539438
for reg, rev in regs.items():
    print(f'The average revenue for {reg} is ${round(rev.mean(),2)}')
The average revenue for Western Europe is $586.93
The average revenue for South America is $424.14
The average revenue for Central America is $276.47
The average revenue for North America is $570.33
The average revenue for Northern Europe is $510.45
The average revenue for Scandinavia is $325.18
The average revenue for Southern Europe is $289.35
The average revenue for British Isles is $462.33
The average revenue for Eastern Europe is $220.75

Hypothesis 4 Observations and Recommendations:


The average spend for this dataset accross all regions was $500.

Regional averages were found to be what is reported in the table below:

Region Average Revenue
Western Europe $586.93
North America $570.33
Northern Europe $510.45
British Isles $462.33
South America $424.14
Scandinavia $325.18
Southern Europe $289.35
Central America $276.47
Eastern Europe `$220.75

For each group, the assumption for equal variance was not met and a Kruksal test was conducted. The p value for the nonparametric kruskal test was singificant, which can reject the null hypothesis that all regions spend the same amounts with a 5% degree of error that this is due to chance.

Orders shipped to Western Europe, and North America and Northern Europe generated the highest amount of revenue, statistically these regions are equal:

Region 1 Region 2 MeanDiff Adj P Reject Null
North America Northern Europe -59.8786 0.9000 False
North America Western Europe 16.5978 0.9000 False
Northern Europe Western Europe 76.4764 0.7992 False

For bottom performers that had enough data:

Region 1 Region 2 Adj P
South America Southern Europe 0.2306


Explore best practices from regions that are top performers.
Market analysis for regions that need to be developed. Leverage knowledge gained regarding categories and discounts.

indexr = list(cregs.groupby('group').mean().sort_values('data', ascending=False).index)
['Western Europe',
 'North America',
 'Northern Europe',
 'British Isles',
 'South America',
 'Southern Europe',
 'Central America',
 'Eastern Europe']
avspend = cregs['data'].mean()
fig, ax = plt.subplots(figsize=(12,8))
sns.barplot(data=cregs, x='group', y='data', ci=68,order=indexr, palette="rocket", ax=ax)
plt.title('Average Revenue by Region', fontsize=20)
plt.axhline(avspend,linestyle="--", color='orange', linewidth=.6 )
plt.ylabel('Average Revenue')


In Closing:

Since the data provided did not include purchase prices of merchandise, ways were examined to maximize revenues. The datasets were all multi group comparisons and none of the groups met all the assupmtions for parametric testing. All groups called for Kruskal-Wallis and post-hoc testing detailed in this notebook.

It was discovered through hypothesis testing and data analysis, various ways to achieve this through: Minimizing discounts Broadening Revenue Generating Categories **add alllll the recommendations)

In addition, future analysis and testeing could provide insight to: Develop Regional Markets Develop Sales Staff


SQL & A/B / Hypothesis Testing to inform business intelligence



Language:Jupyter Notebook 99.7%Language:Python 0.3%