Module 2 Code Challenge

Welcome to your Module 2 Code Challenge. This code challenge is designed to test your understanding of the Module 2 material. It covers:

Statistical Distributions
Statistical Tests
Bayesian Statistics
Linear Regression

Read the instructions carefully. You will be asked both to write code and respond to a few short answer questions.

Note on the short answer questions:

For the short answer questions please use your own words. The expectation is that you have not copied and pasted from an external source, even if you consult another source to help craft your response. While the short answer questions are not necessarily being assessed on grammatical correctness or sentence structure, you should do your best to communicate yourself clearly.

# Run this cell without changes to import the necessary libraries

# Use any additional libraries you like to complete this assessment 

import itertools
import numpy as np
import pandas as pd 
from scipy import stats
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

import pickle

import statsmodels.api as sm
from statsmodels.formula.api import ols

Part 1: Statistical Distributions [Suggested time: 20 minutes]

Normal Distributions

Let's consider check totals at a TexMex restaurant. We know that the population distribution of check totals is normally distributed with a mean of $\mu$ = \$20 and standard deviation of $\sigma$ = \$3.

1.1) Compute the z-score for a \$26 check.

# Code here

1.2) Approximately what percentage of all checks are less than \$26? Explain how you came to your answer.

You can answer this using the empirical rule or this z-table.

"""
Written answer here
"""

Confidence Intervals

One month, a waiter gets 500 checks with a mean amount of \$19 and a standard deviation of \$3.

1.3) Use this sample to calculate a 95% confidence interval for the mean of this waiter's check amounts. Interpret the result.

# Code here

"""
Written answer here
"""

Part 2: Statistical Testing [Suggested time: 20 minutes]

The TexMex restaurant recently introduced queso to its menu.

We have random samples of 1000 "no queso" order check totals and 1000 "queso" order check totals for orders made by different customers.

In the cell below, we load the sample data for you into the arrays no_queso and queso for the "no queso" and "queso" order check totals, respectively. Then, we create histograms of the distribution of the check amounts for the "no queso" and "queso" samples.

# Run this cell without changes

# Load the sample data 
no_queso = pickle.load(open('data/no_queso.pkl', 'rb'))
queso = pickle.load(open('data/queso.pkl', 'rb'))

# Plot histograms

fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(12, 4))

ax1.set_title('Sample of Non-Queso Check Totals')
ax1.set_xlabel('Amount')
ax1.set_ylabel('Frequency')
ax1.hist(no_queso, bins=20)

ax2.set_title('Sample of Queso Check Totals')
ax2.set_xlabel('Amount')
ax2.set_ylabel('Frequency')
ax2.hist(queso, bins=20)
plt.show()

Hypotheses and Errors

The restaurant owners want to know if customers who order queso spend more or less than customers who do not order queso.

2.1) Set up the null $H_{0}$ and alternative hypotheses $H_{A}$ for this test.

"""
Written answer here
"""

2.2) What does it mean to make a `Type I` error or a `Type II` error in this specific context?

"""
Written answer here
"""

Sample Testing

2.3) Run a statistical test on the two samples. Can you reject the null hypothesis?

Use a significance level of $\alpha = 0.05$. You can assume the two samples have equal variance.

You can use scipy.stats to find the answer if you like. It has already been imported as stats and the statistical testing documentation can be found here.

# Code here

"""
Written answer here
"""

Part 3: Bayesian Statistics [Suggested time: 15 minutes]

Bayes' Theorem

A medical test is designed to diagnose a certain disease. The test has a false positive rate of 10%, meaning that 10% of people without the disease will get a positive test result. The test has a false negative rate of 2%, meaning that 2% of people with the disease will get a negative result. Only 1% of the population has this disease.

3.1) What is the probability of receiving a positive test result? Show how you arrive at your answer.

Assume that the person being tested is randomly selected from the broader population. You can show your work using text, code, or both.

"""
Written answer with probability notation here
"""

# Code to calculate the probability here

3.2) If a patient receives a positive test result, what is the probability that they actually have the disease? Show how you arrive at your answer.

Hint: Use your answer to the previous question to answer this one. You can show your work using text, code, or both.

"""
Written answer with probability notation here
"""

# Code to calculate the probability here

Part 4: Linear Regression [Suggested Time: 20 min]

In this section, you'll be using the Advertising data to run regression models. In this dataset, each row represents a different product, and we have a sample of 200 products from a larger population of products. We have three features - TV, radio, and newspaper - that describe how many thousands of advertising dollars were spent promoting the product. The target, sales, describes how many millions of dollars in sales the product had.

The relevant modules have already been imported at the beginning of this notebook. We'll load and prepare the dataset for you below.

# Run this cell without changes

data = pd.read_csv('data/advertising.csv').drop('Unnamed: 0', axis=1)
data.describe()

.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}

</style>

	TV	radio	newspaper	sales
count	200.000000	200.000000	200.000000	200.000000
mean	147.042500	23.264000	30.554000	14.022500
std	85.854236	14.846809	21.778621	5.217457
min	0.700000	0.000000	0.300000	1.600000
25%	74.375000	9.975000	12.750000	10.375000
50%	149.750000	22.900000	25.750000	12.900000
75%	218.825000	36.525000	45.100000	17.400000
max	296.400000	49.600000	114.000000	27.000000

# Run this cell without changes

X = data.drop('sales', axis=1)
y = data['sales']

Simple Linear Regression

4.1) Use StatsModels' `ols`-function to run a linear regression model using `TV` to predict `sales`.

Required output: the summary of this regression model.

# Code here

4.2) Can we infer that products with higher TV advertising spend tend to have greater sales? Explain how you determined this from the model output.

This question is asking you to use your findings from the sample in your dataset to make an inference about the relationship between TV advertising spend and sales in the broader population.

"""
Written answer here
"""

Multiple Linear Regression

4.3) Compute a correlation matrix for `X`. Given these correlation coefficients, would there be any issue if you included all of these features in one regression model?

# Code here

"""
Written answer here
"""

4.4) Use StatsModels' `ols`-function to run a multiple linear regression model with `TV`, `radio`, and `newspaper` as independent variables and `sales` as the dependent variable.

Required output: the summary of this regression model.

# Code here

4.5) Does this model do a better job of explaining sales than the previous model using only the `TV` feature? Explain how you determined this based on the model output.

"""
Written answer here
"""

sik-flow / phase_2_cc

Module 2 Code Challenge

Note on the short answer questions:

Part 1: Statistical Distributions [Suggested time: 20 minutes]

Normal Distributions

1.1) Compute the z-score for a \$26 check.

1.2) Approximately what percentage of all checks are less than \$26? Explain how you came to your answer.

Confidence Intervals

1.3) Use this sample to calculate a 95% confidence interval for the mean of this waiter's check amounts. Interpret the result.

Part 2: Statistical Testing [Suggested time: 20 minutes]

Hypotheses and Errors

2.1) Set up the null $H_{0}$ and alternative hypotheses $H_{A}$ for this test.

2.2) What does it mean to make a `Type I` error or a `Type II` error in this specific context?

Sample Testing

2.3) Run a statistical test on the two samples. Can you reject the null hypothesis?

Part 3: Bayesian Statistics [Suggested time: 15 minutes]

Bayes' Theorem

3.1) What is the probability of receiving a positive test result? Show how you arrive at your answer.

3.2) If a patient receives a positive test result, what is the probability that they actually have the disease? Show how you arrive at your answer.

Part 4: Linear Regression [Suggested Time: 20 min]

Simple Linear Regression

4.1) Use StatsModels' `ols`-function to run a linear regression model using `TV` to predict `sales`.

4.2) Can we infer that products with higher TV advertising spend tend to have greater sales? Explain how you determined this from the model output.

Multiple Linear Regression

4.3) Compute a correlation matrix for `X`. Given these correlation coefficients, would there be any issue if you included all of these features in one regression model?

4.4) Use StatsModels' `ols`-function to run a multiple linear regression model with `TV`, `radio`, and `newspaper` as independent variables and `sales` as the dependent variable.

4.5) Does this model do a better job of explaining sales than the previous model using only the `TV` feature? Explain how you determined this based on the model output.

About

Languages

Module 2 Code Challenge

Note on the short answer questions:

Part 1: Statistical Distributions [Suggested time: 20 minutes]

Normal Distributions

1.1) Compute the z-score for a \$26 check.

1.2) Approximately what percentage of all checks are less than \$26? Explain how you came to your answer.

Confidence Intervals

1.3) Use this sample to calculate a 95% confidence interval for the mean of this waiter's check amounts. Interpret the result.

Part 2: Statistical Testing [Suggested time: 20 minutes]

Hypotheses and Errors

2.1) Set up the null $H_{0}$ and alternative hypotheses $H_{A}$ for this test.

2.2) What does it mean to make a Type I error or a Type II error in this specific context?

Sample Testing

2.3) Run a statistical test on the two samples. Can you reject the null hypothesis?

Part 3: Bayesian Statistics [Suggested time: 15 minutes]

Bayes' Theorem

3.1) What is the probability of receiving a positive test result? Show how you arrive at your answer.

3.2) If a patient receives a positive test result, what is the probability that they actually have the disease? Show how you arrive at your answer.

Part 4: Linear Regression [Suggested Time: 20 min]

Simple Linear Regression

4.1) Use StatsModels' ols-function to run a linear regression model using TV to predict sales.

4.2) Can we infer that products with higher TV advertising spend tend to have greater sales? Explain how you determined this from the model output.

Multiple Linear Regression

4.3) Compute a correlation matrix for X. Given these correlation coefficients, would there be any issue if you included all of these features in one regression model?

4.4) Use StatsModels' ols-function to run a multiple linear regression model with TV, radio, and newspaper as independent variables and sales as the dependent variable.

4.5) Does this model do a better job of explaining sales than the previous model using only the TV feature? Explain how you determined this based on the model output.

About

Languages

2.2) What does it mean to make a `Type I` error or a `Type II` error in this specific context?

4.1) Use StatsModels' `ols`-function to run a linear regression model using `TV` to predict `sales`.

4.3) Compute a correlation matrix for `X`. Given these correlation coefficients, would there be any issue if you included all of these features in one regression model?

4.4) Use StatsModels' `ols`-function to run a multiple linear regression model with `TV`, `radio`, and `newspaper` as independent variables and `sales` as the dependent variable.

4.5) Does this model do a better job of explaining sales than the previous model using only the `TV` feature? Explain how you determined this based on the model output.