Mock Belt Exam Revisited - For Class

05/05/05

Original Instructions

Data Enrichment Mock Exam

API results:

https://drive.google.com/file/d/10iWPhZtId0R9RCiVculSozCwldG-V3eH/view?usp=sharing

Read in the json file
Separate the records into 4 tables each a pandas dataframe
Transform In this case remove dollar signs from funded amount in the financials records and convert to numeric datatype
Create a database with SQLAlchemy and add the tables to the datbase

Perform a hypothesis test to determine if there is a signficant difference between the funded amount when it is all males and when there is at least one female in the group.

Follow-Up Hypothesis to Test (if there's time)

If there is time, perform an additional hypothesis test to determine if there is a significant difference in the funded amount for different sectors.

ETL of JSON File

import json
import pandas as pd
import numpy as np
import seaborn as sns
from scipy import stats


import pymysql
pymysql.install_as_MySQLdb()

from sqlalchemy import create_engine
from sqlalchemy_utils import create_database, database_exists

Extract

## Loading json file
with open('Mock_Crowdsourcing_API_Results.json') as f:
    results = json.load(f)
results.keys()

## explore each key 
type(results['meta'])

## display meta
results['meta']

## display data
type(results['data'])

## preview the dictionary
# results['data']

## preview just the keys
results['data'].keys()

## what does the crowd key look like?
# results['data']['crowd']

## checking single entry of crowd
results['data']['crowd'][0]

## making crowd a dataframe
crowd = pd.DataFrame(results['data']['crowd'])
crowd

## making demographics a dataframe
demo = pd.DataFrame(results['data']['demographics'])
demo

## making financials a dataframe
financials = pd.DataFrame(results['data']['financials'])
financials

## making use a dataframe
use = pd.DataFrame(results['data']['use'])
use

Transform

## fixing funded amount column
financials['funded_amount'] = financials['funded_amount'].str.replace('$','')
financials['funded_amount'] = pd.to_numeric(financials['funded_amount'])
financials

Load

## loading mysql credentials
with open('/Users/codingdojo/.secret/mysql.json') as f:
    login = json.load(f)
login.keys()

## creating connection to database with sqlalchemy
connection_str  = f"mysql+pymysql://{login['user']}:{login['password']}@localhost/mock-belt-exam"
engine = create_engine(connection_str)

## Check if database exists, if not, create it
if database_exists(connection_str) == False: 
    create_database(connection_str)
else: 
    print('The database already exists.')

## saving dataframes to database
financials.to_sql('financials', engine, index=False, if_exists = 'replace')
use.to_sql('use', engine, index=False, if_exists = 'replace')
demo.to_sql('demographics', engine, index=False, if_exists = 'replace')
crowd.to_sql('crowd',engine, index=False, if_exists = 'replace')

## checking if tables created
q= '''SHOW TABLES;'''
pd.read_sql(q,engine)

Hypothesis Testing

Follow the Guide: Choosing the Right Hypothesis Test from the LP.

1. State the Hypothesis & Null Hypothesis

$H_0$ (Null Hypothesis):
$H_A$ (Alternative Hypothesis):

coding-dojo-data-science / data-enrichment-hypothesis-testing-codealong

Mock Belt Exam Revisited - For Class

Original Instructions

Follow-Up Hypothesis to Test (if there's time)

ETL of JSON File

Extract

Transform

Load

Hypothesis Testing

1. State the Hypothesis & Null Hypothesis

About

Languages