EDA-of-Financial-and-Insurance-Datasets

Ipny and HTML files EDA Case Studies

This repository contains several case studies of financial and insurance data. Each case requires a process of exploratory data analysis to understand the dataset and a feature engineering process to modify the dataset to make it ideal for modeling. The cases include problems in the data analysis, feature engineering, supervised learning and unsupervised learning spectrum. All of the cases share a common goal: identify anomalies.

Here is a brief explanation of each case and its category:

Data Analysis

EDA 1: Analyzing the stock price of the Goldman Sachs stock from 2009 to 2012 to identify anomalies.

Feature Engineering

EDA 3: Analyzing a dataset of credit card transactions to create new variables that help us identify the anomalies in the set.
EDA 4: Analyzing a dataset of HealthCare aggregated data to create new variables that help us identify anomalies in the set.

Supervised Learning

EDA 2: Create a gains table with a logistic regression model to understand what is the lift and the impact of the model.
EDA 9: Build supervised learning models (random forest and gradient boosting) using the H2O library.
EDA 10: Build supervised learning models (GLM and AutoML) using the H2O library and benchmark models to select the best model.
EDA 11: Using SHAP Values to understand the results of an algorithm.

Unsupervised Learning

EDA 5: Using the modified HealthCare dataset with all the new variables, create clusters of observations and identify the clusters that show the most obvious anomalies.
EDA 6: Using the modified Credit Card dataset and some new features, cluster observations using DBSCAN and MeanShift clustering techniques to identify anomalies in the dataset.
EDA 7: Outlier detection using the autoencoder algorithm in the PyOD library.
EDA 8: Outlier detection using the iForest algorithm in the PyOD library.

Extras

Word2Vec: Using word2vec grasp similairyt between a merchants to narrow down the categories.
R&D Attrition: Side project done to analyze attrition in the R&D department of a mock dataset.

vlaskinvlad / EDA-of-Financial-and-Insurance-Datasets