Dipankar-Medhi / adult_dataset_analysis

This repository contains the EDA, data preprocessing and ML model training and evaluation of the adult dataset.

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Adult Dataset Data Analysis

Data Analysis of Adult dataset.

Table of contents

EDA

Univariate Analysis

Histogram:

histogram


Box plots:

boxplot


Barplot of categorical features:

barplot

Bivariate Analysis

Pairplot:

pairplot

Barplot for numerical vs categorical features:

barplot

Data Preprocessing

Removing outliers and missing values

IQR:

    iqr = 1.5 * (np.percentile(df[field_name], 75) -
                 np.percentile(df[field_name], 25))
    df.drop(df[df[field_name] > (
        iqr + np.percentile(df[field_name], 75))].index, inplace=True)
    df.drop(df[df[field_name] < (np.percentile(
        df[field_name], 25) - iqr)].index, inplace=True)
    return df

df2 = remove_outlier_IQR(df,'final-wt')
df_final = remove_outlier_IQR(df2, 'hours-per-week')
df_final.shape

(36312, 15)

Boxplot after outliers removal

outliers_boxplot

Encoding categorical features

  • using dummy variables.

Data preparation for training and testing

from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split

X = data.drop(columns=['income_<=50K', 'income_>50K'])
y = data['income_<=50K']

scaler = StandardScaler()
scaled_df = scaler.fit_transform(X)

X_train, X_test, y_train, y_test = train_test_split(
    scaled_df, y, test_size=0.3)
print("X train shape: {} and y train shape: {}".format(
    X_train.shape, y_train.shape))
print("X test shape: {} and y test shape: {}".format(X_test.shape, y_test.shape)

X train shape: (25418, 108) and y train shape: (25418,) X test shape: (10894, 108) and y test shape: (10894,)

Model Training and Evaluation

Random Forest Classifier

rfc

Logistic Regression

lgr

K Nearest Neighbors

knn

Naive Bayes

naiv

About

This repository contains the EDA, data preprocessing and ML model training and evaluation of the adult dataset.


Languages

Language:Jupyter Notebook 100.0%