matthewsia98 / Shopify-Fall-2022-Data-Science-Challenge

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

import pandas as pd
df = pd.read_csv(r'https://docs.google.com/spreadsheets/d/16i38oonuX1y1g7C_UAmiK9GkY7cS-64DfiDMNiR41LM/export?format=csv&gid=0')
df
order_id shop_id user_id order_amount total_items payment_method created_at
0 1 53 746 224 2 cash 2017-03-13 12:36:56
1 2 92 925 90 1 cash 2017-03-03 17:38:52
2 3 44 861 144 1 cash 2017-03-14 4:23:56
3 4 18 935 156 1 credit_card 2017-03-26 12:43:37
4 5 18 883 156 1 credit_card 2017-03-01 4:35:11
... ... ... ... ... ... ... ...
4995 4996 73 993 330 2 debit 2017-03-30 13:47:17
4996 4997 48 789 234 2 cash 2017-03-16 20:36:16
4997 4998 56 867 351 3 cash 2017-03-19 5:42:42
4998 4999 60 825 354 2 credit_card 2017-03-16 14:51:18
4999 5000 44 734 288 2 debit 2017-03-18 15:48:18

5000 rows × 7 columns

Question 1a

This is the naive average order value

df['order_amount'].sum() / df['order_amount'].size
3145.128

Sorting by order amount reveals why the average is skewed

Shop 42 is selling an unreasonable amount of sneakers.
They sold 34,063 pairs of sneakers in 30 days totalling $11,990,176.

sorted_df = df.sort_values(['order_amount', 'total_items'], ascending=[False, True])
sorted_df
order_id shop_id user_id order_amount total_items payment_method created_at
15 16 42 607 704000 2000 credit_card 2017-03-07 4:00:00
60 61 42 607 704000 2000 credit_card 2017-03-04 4:00:00
520 521 42 607 704000 2000 credit_card 2017-03-02 4:00:00
1104 1105 42 607 704000 2000 credit_card 2017-03-24 4:00:00
1362 1363 42 607 704000 2000 credit_card 2017-03-15 4:00:00
... ... ... ... ... ... ... ...
4219 4220 92 747 90 1 credit_card 2017-03-25 20:16:58
4414 4415 92 927 90 1 credit_card 2017-03-17 9:57:01
4760 4761 92 937 90 1 debit 2017-03-20 7:37:28
4923 4924 92 965 90 1 credit_card 2017-03-09 5:05:11
4932 4933 92 823 90 1 credit_card 2017-03-24 2:17:13

5000 rows × 7 columns

df.loc[df['shop_id'] == 42, ['total_items', 'order_amount']].sum()
total_items        34063
order_amount    11990176
dtype: int64

Question 1b

The median is a better metric to report because it is more resilient to outliers in the data.

Question 1c

The median order value is $284.00.

df['order_amount'].describe()
count      5000.000000
mean       3145.128000
std       41282.539349
min          90.000000
25%         163.000000
50%         284.000000
75%         390.000000
max      704000.000000
Name: order_amount, dtype: float64

Question 2a

54 orders were shipped by Speedy Express

SELECT COUNT(*) FROM Orders
WHERE ShipperID IN
(SELECT ShipperID FROM Shippers
WHERE ShipperName = 'Speedy Express')

Question 2b

Employee with last name "Peacock" has the most orders with a total number of 40

SELECT Employees.LastName, COUNT(*) as "NumOrders"
FROM Orders
JOIN Employees
ON Orders.EmployeeID = Employees.EmployeeId
GROUP BY Orders.EmployeeId
ORDER BY 2 DESC
LIMIT 1

Question 2c

ProductID 40 called "Boston Crab Meat" is the most ordered product by customers in Germany with a total quantity of 160

SELECT OrderDetails.ProductID, Products.ProductName, SUM(OrderDetails.Quantity) as "Quantity"
FROM Orders
JOIN OrderDetails
ON Orders.OrderID = OrderDetails.OrderID
JOIN Products
ON OrderDetails.ProductID = Products.ProductID
JOIN Customers
ON Orders.CustomerID = Customers.CustomerID
WHERE Orders.CustomerID IN
(SELECT CustomerID
FROM customers
WHERE Country = 'Germany')
GROUP BY OrderDetails.ProductID
ORDER BY 3 DESC
LIMIT 1

About


Languages

Language:Python 100.0%