import pandas as pd

df = pd.read_csv(r'https://docs.google.com/spreadsheets/d/16i38oonuX1y1g7C_UAmiK9GkY7cS-64DfiDMNiR41LM/export?format=csv&gid=0')
df

	order_id	shop_id	user_id	order_amount	total_items	payment_method	created_at
0	1	53	746	224	2	cash	2017-03-13 12:36:56
1	2	92	925	90	1	cash	2017-03-03 17:38:52
2	3	44	861	144	1	cash	2017-03-14 4:23:56
3	4	18	935	156	1	credit_card	2017-03-26 12:43:37
4	5	18	883	156	1	credit_card	2017-03-01 4:35:11
...	...	...	...	...	...	...	...
4995	4996	73	993	330	2	debit	2017-03-30 13:47:17
4996	4997	48	789	234	2	cash	2017-03-16 20:36:16
4997	4998	56	867	351	3	cash	2017-03-19 5:42:42
4998	4999	60	825	354	2	credit_card	2017-03-16 14:51:18
4999	5000	44	734	288	2	debit	2017-03-18 15:48:18

5000 rows × 7 columns

Question 1a

This is the naive average order value

df['order_amount'].sum() / df['order_amount'].size

3145.128

Sorting by order amount reveals why the average is skewed

Shop 42 is selling an unreasonable amount of sneakers.
They sold 34,063 pairs of sneakers in 30 days totalling $11,990,176.

sorted_df = df.sort_values(['order_amount', 'total_items'], ascending=[False, True])
sorted_df

	order_id	shop_id	user_id	order_amount	total_items	payment_method	created_at
15	16	42	607	704000	2000	credit_card	2017-03-07 4:00:00
60	61	42	607	704000	2000	credit_card	2017-03-04 4:00:00
520	521	42	607	704000	2000	credit_card	2017-03-02 4:00:00
1104	1105	42	607	704000	2000	credit_card	2017-03-24 4:00:00
1362	1363	42	607	704000	2000	credit_card	2017-03-15 4:00:00
...	...	...	...	...	...	...	...
4219	4220	92	747	90	1	credit_card	2017-03-25 20:16:58
4414	4415	92	927	90	1	credit_card	2017-03-17 9:57:01
4760	4761	92	937	90	1	debit	2017-03-20 7:37:28
4923	4924	92	965	90	1	credit_card	2017-03-09 5:05:11
4932	4933	92	823	90	1	credit_card	2017-03-24 2:17:13

5000 rows × 7 columns

df.loc[df['shop_id'] == 42, ['total_items', 'order_amount']].sum()

total_items        34063
order_amount    11990176
dtype: int64

Question 1b

The median is a better metric to report because it is more resilient to outliers in the data.

Question 1c

The median order value is $284.00.

df['order_amount'].describe()

count      5000.000000
mean       3145.128000
std       41282.539349
min          90.000000
25%         163.000000
50%         284.000000
75%         390.000000
max      704000.000000
Name: order_amount, dtype: float64

Question 2a

54 orders were shipped by Speedy Express

SELECT COUNT(*) FROM Orders
WHERE ShipperID IN
(SELECT ShipperID FROM Shippers
WHERE ShipperName = 'Speedy Express')

Question 2b

Employee with last name "Peacock" has the most orders with a total number of 40

SELECT Employees.LastName, COUNT(*) as "NumOrders"
FROM Orders
JOIN Employees
ON Orders.EmployeeID = Employees.EmployeeId
GROUP BY Orders.EmployeeId
ORDER BY 2 DESC
LIMIT 1

Question 2c

ProductID 40 called "Boston Crab Meat" is the most ordered product by customers in Germany with a total quantity of 160

SELECT OrderDetails.ProductID, Products.ProductName, SUM(OrderDetails.Quantity) as "Quantity"
FROM Orders
JOIN OrderDetails
ON Orders.OrderID = OrderDetails.OrderID
JOIN Products
ON OrderDetails.ProductID = Products.ProductID
JOIN Customers
ON Orders.CustomerID = Customers.CustomerID
WHERE Orders.CustomerID IN
(SELECT CustomerID
FROM customers
WHERE Country = 'Germany')
GROUP BY OrderDetails.ProductID
ORDER BY 3 DESC
LIMIT 1

About

Languages

Language:Python 100.0%