DS/ML review

import pandas as pd 
import numpy as np
from sklearn.preprocessing import LabelEncoder

file_name = 'input/wine-reviews/winemag-data-130k-v2.csv'

stock_data = pd.read_csv('something.csv')

import with first column indexed

stock_data = pd.read_csv('something.csv', index_col=0)

basic dataframe info

stock_data.dtypes #OR

an array of the unique names/values


a list of unique values and how often they occur

stock_data.sell_date.value_counts() # how many stocks were sold each day

a different way of doing the previous


or get the cheapest wine in each region


access individual columns


indexed by numbers


indexed by ticker


pandas indexing with iloc and loc

iloc - grabs the entire row based on index - can grab a column

stock_data.iloc[25] # row of Tesla stock

loc - grabs a column

stock_data.loc[:, ['price', 'sell_date']]

change the index


selecting data based on a condition

stock_data.q3_growth > 0 # filters for stocks that have appreciated in the 3rd quarter
stock_data.loc[stock_data.q3_growth > 0] # filters a dataframe of these stocks
stock_data.loc[(stock_data.q3_growth > 0) & (stock_data.q2_growth > 0)] # filters for stocks that have appreciated for 2 quartes

filter based on a condition or conditions

reviews.loc[(reviews.country == 'Italy') | (reviews.country == 'France') ]
reviews.loc[reviews.country.isin(['Italy', 'France'])]

filter out nulls


filter for just the nulls


assigning or building out new data - common practice for feature engineering

wine_reviews['top_rated_regions'] = wine_reviews['country'] + ' - ' + wine_reviews['region_1']

recast a column as a different data type

wine_reviews.points.astype('float64') # convert from int to float

rename a column

wine_reviews.rename(columns={'points': 'score'})

if we collected data based on country and the data was identically formatted

pd.concat([us_wine_reviews, france_wine_reviews])

various pandas join methods for combining dataframes

figure out what percentage of the values are missing

how many total missing values do we have?

total_cells = np.product(wine_reviews.shape)

get the number of missing data points per column

missing_values_count = wine_reviews.isnull().sum()
total_missing = missing_values_count.sum()

percent of data that is missing

percent_missing = (total_missing/total_cells) * 100
print(f'Percent missing: {percent_missing}')

if it's an extremely small percentage of data with NaNs, drop those rows

wine_reviews.dropna(subset=['variety'], inplace=True)

or remove a singular problematic column that's rife with NaNs or errors

wine_reviews.drop(['region_2'], axis=1)

if there's a date present, it's a good idea to check if that column is being recognized as a date dtype


if not - parse the likely string type into a datetime object

if it's a standard format like 2/8/18 or 23-10-1998

stock_data['sale_date_2'] = pd.to_datetime(stock_data['sale_date'], format='%m/%d/%y')

then double check the reformat


if you run into multiple date formats, try 'infering' - infer_datetime_format=True

select and plot the day of the month that stocks were sold

stock_data_sell_dates = stock_data['sale_date_2'].dt.day
sns.distplot(stock_data_sell_dates, kde=False, bins=31)

if strings - particularly names of something (country, city, etc...) - are similar

use something like the fuzzywuzzy library to correct this


if you have datetime data and you're trying to model, try separating them into hour-day-month-year

if you have categorical data - you'll prob need to one-hot-encode (multiple columns) or label-encode (single column)

may need to drop NaN's before transforming

wine_reviews.dropna(subset=['country'], inplace=True)

Apply the label encoder to each column

cat_features = ['country']
encoder = LabelEncoder() #from scikit-learn

encoded = wine_reviews[cat_features].apply(encoder.fit_transform)
wine_reviews['country'] = wine_reviews[cat_features].apply(encoder.fit_transform)

interactions or combining categorical columns/variables is a great way to feature engineer

wine_reviews['country-region'] = wine_reviews['country'] + ' - ' + wine_reviews['region_1']

When selecting and narrowing features for a model, there's 2 general approaches to take Univariate methods which consider only one feature at a time or selecting all the best features at once with L1 (Lasso regression) or L2 (Ridge regression) regularization

  • L1 - linear model
  • L2 - penalizes the square of the coefficients

visualize the data with the right type of graph (seaborn or matplotlib)


SQL query examples

query_1 = """
        SELECT COUNT(consecutive_number) AS num_accidents, 
               EXTRACT(DAYOFWEEK FROM timestamp_of_crash) AS day_of_week
        FROM `bigquery-public-data.nhtsa_traffic_fatalities.accident_2015`
        GROUP BY day_of_week
        ORDER BY num_accidents DESC

query_2 = """ 
            WITH time AS 
                SELECT DATE(block_timestamp) AS trans_date
                FROM `bigquery-public-data.crypto_bitcoin.transactions`
            SELECT COUNT(1) AS transactions,
            FROM time
            GROUP BY trans_date
            ORDER BY trans_date


modeling steps: Define - Fit - Predict - Evaluate


from sklearn.tree import DecisionTreeRegressor
melbourne_model = DecisionTreeRegressor(random_state=1)

Fit (after train - test split)

train_X, val_X, train_y, val_y = train_test_split(X, y, random_state = 0)
melbourne_model.fit(train_X, train_y)



Evaluate (loads of metrics for this)

mean_absolute_error(val_y, val_predictions)


  • Decision Trees - parameters to play with - size of a node - depth of tree
  • Random forest - makes many trees and averages their predictions (distribution of sorts)

Narrowing the number of features

from sklearn.feature_selection import SelectKBest, f_regression
from sklearn.model_selection import train_test_split
# pick the k-number of features
selector = SelectKBest(score_func=f_regression, k=15)
X_train_selected = selector.fit_transform(X_train, y_train)
X_test_selected = selector.transform(X_test)

# TODO: Which features were selected?
selected_mask = selector.get_support()
all_names = X_train.columns
selected_names = all_names[selected_mask]
unselected_names = all_names[~selected_mask]

# OR
print('Features selected:')
for name in selected_names:

Example for loop of what number of features to select

for k in range(1, len(X_train.columns)+1):
    print(f'{k} features')

    selector = SelectKBest(score_func=f_regression, k=k)
    X_train_selected = selector.fit_transform(X_train, y_train)
    X_test_selected = selector.transform(X_test)

    model = LinearRegression()
    model.fit(X_train_selected, y_train)
    y_pred = model.predict(X_test_selected)
    mae = mean_absolute_error(y_test, y_pred)
    print(f'Test Mean Absolute Error: ${mae:,.0f} \n')

Cross validation visual

Metrics for assessing and scoring models

Scikit-learn metrics
