R Portfolio

R Portfolio of data science projects from either original work or revised for a study and learning purpose. Portfolio in this repo is presented in the form of .R and .Rmd(R-markdown) files.

Each folder represents the fields of application (i.e. Timeseries, Deeplearning, MachineLearning, etc)

For a detailed code example and images, please refer to .readme file presented below.

Note: Data used in the projects is for learning and demo purposes only

Motivation / Thought Process

These days R is less preferrable in industory for various reasons (i.e less production-ready, non-scalable). However, I think R is still a very powerful language. I personally am fond of and use R for everyday analysis from simple EDA to creating stunning visualizations and building a complex ML/DL models. I think R has its strong advantage in looking at codes and results at a controlled enviornmenets.

This repository was origianlly to have a record of project progress and my own learning process, but I found that it would be helpful to who wants to improve data-science skills to next-level using R language, as it contains a numerious real-life data science example and notebooks created by @hyunjoonbok and codes borrowed from authors who produced state-of-the-art results.

I tried to include the usage of packages and methods that have been consistently used in actual industries, in order to to solve the problems (even if it's a toy example). The repo contatins use-cases that can be readily applied to many of real-world datasets.

Projects
Setup
To-Do
Contact

Projects

(3) Time Series Forecasting with Modeltime (Advanced)

This workbook covers complete advanced steps to create a SOTA time-series forecasting model at scale. We use Walmart M4 Kaggle competition dataset to create foreacst for (7) different time-series. It introduces latest functions in Modeltime and techniques in R, which load data, preprocess, modelling, fitting, calibration, ensembling, and visualization. The codes are experiment-ready to be applied to any of custom time-series dataset.

(2) Time Series Forecasting with Modeltime (+Nest) (Basic)

Often times, it's necceary for business who are performing any kind of time-series forecast model to scale it's model. This examples leverages "Nest" function to create several time-series at the same time in a single dataset, where the best-chosen ML algorithem is applied to create a forecast for entire groups. The possibility is endless. The model can be scaled to create thousands of models in pararell, with the help of "Nest" function.

(1) Time Series Forecasting with ModelTime - Walkthrough (Basic)

The file walk-through key processes that need to be performed to generate time-series in high-level. We are looking bike_sharing_daily time series data from 2011 to 2013 to predict the sales of it for the next 3 months. We set aside last 3-months of data as the testing set, and levere modeltime package to build different SOTA timeseires models including, ARIMA, Prophet, XGBoost, randomforest. Then we evaluate the Model by refitting data from the errors we got from initial models, and eventually multi-visualize the model.

Customer Segmentation & Clustering with K-means

Looking at custmer transcation data to segment cusotmers into groups to better statify the business strategy. Use a K-means clustering Building and Bootstrap Evaluation to effectively group cusotmers, and create points of strategy to be possibly discussed with business stakeholders.

Reference: Diego Usai's Website

Machine Learning

Time Series ML using H2O

Predicting a future beer sale number using a historical data. Using H20's AUTOML feature to easily obtint the state-of-the-art ensemble results, and plot the errors to improve.

ML Model Interpretability with DALEX

2nd DALEX R file

Often times, ML models are critized as being black-box (untracakble complex inside that magiaclaly solves the problem). Here we look at the problem of predicting the apartment prices using Linear Regression, SVM, Random Forest, and get the pacakge DALEX to help look how much each variables affect this prediction.

Employee Churn Modeling

ML Model with Caret to predict Employee Chrun

ML Model with LIME to understand and prevent Employee Chrun

With the help of powerful Caret pacakge that help build ML model . Has complete steps to pre-process, fine-tune, train, and get ROC curve. Then, I use LIME (Local Interpretable Model-Agnostic Explanation) to understand ML model created. Use H2O to initiate modeling, and with the help of LIME, it gives both global and local interpretation of predictor variables. It gives a clear visual explanation of variable importance and how model is affected by those.

Naïve Bayes Classifier

The Naïve Bayes classifier is a simple probabilistic classifier which is based on Bayes Theorem but with strong assumptions regarding independence. Historically, this technique became popular with applications in email filtering, spam detection, and document categorization. Here, I built a simple classification model with Caret and H2O.

Predict Bank's Term Deposit (Classification) using H2O

Build a simple ML from H2O to predict which customers more likely to enroll in Bank's Term Deposit. Shows how random Grid Search combined with Stacked Ensembles is a very powerful combination

The Ultimate XGBoost Guide

Contatins a complete steps in model-building with XGBoost in R. From CV, grid-serach, hyperparameter tuning to feature selection, optimization, training/evaluation and Prediction. Solves a real-world binary classification problem.

Machine Learning Problem Solving Guide (data not included)

Contatins a complete steps in model-building and explanation of what's actaully going on in ML. Using 4 different method/packages (PDP, ICE, LIME, Shapley), it shows how Machine Learning can be explainable in some sense.

Predict Airplane arrival delay

Looking at a toy example here to see how we could use H2O to predict arrival delay using historical airline data with Destination to Chicago Airport. Give a easy glance how easily H2O package could be utilized in a simple ML problem.

Deep Learning

Basic Concept of CNN using Keras

How to stack Keras layers to basic CNN model. Solve MNIST using CNN in just a few lines of code.

Cat vs Dog Image classifier using CNN in R (Production)

A Complete steps to load images, generate, compile, train, and test model in CNN to solve the famous Cat vs Dog image classification

Predict Telco Customer Chrun using Keras

A real-world example to predict customer retention / churn using Telecommunication company's data. Data Pre-processing, modeling, evaluating, prediction, checking performance, model explanation, feature importance visualization.

Tensorflow Estimator API to build Regression Model

Use Tensorflow's low level API to build both a linear regression model and Deep NeuralNet with the toy dataset. Activation of Tensorboard in R interface is also introduced.

High performance KERAS LSTM Algorithm

Developt a State-of-the-Art Keras LSTM algorithm to predict a sunspot by connecting to the R TensorFlow backend. Perform Time Series Cross Validation using Backtesting with the rsample package rolling forecast origin resampling.

Predict the number of hostipal opening & closure

Use real compeitition-type tabular data to predict the number of hospital opening/closure. Followed by initial data pre-processing, Boruta package to perform feature selection and eventually use H2O's radnom grid-serach and deep-learning algorithm to build and evaluate the model.

IMDB Movie Rating prediction using Keras RNN and LSTM in R

The famous Movie rating prediction problem tackled by Keras RNN and LSTM layers in R interface

IMDB Movie Review Sentiment Analysis using Keras RNN and LSTM in R

2-way polarity (positive, negative) classification system for Movie Review texts. Goes through word-tokenizatino, modeling and evaluation using Keras.

Time Series Forecasting & Anomaly Detection

ARIMA Time-series Forecast

ARIMA model to understand and predict value in time series. Has steps to Decomspose, Stationary, Autocorrelations, Model Fitting/Evaluation

Anomaly Detection in R

A very simple but powerful Anomaly Dection model with the toy package download data.

3 Simple TimeSeries forecast Model

Breif introduction on choosing forecast model. Exponential State Smoothing, ARIMA and TBATS and their comparison is shown.

TimeSeries Machine Learning

TimeSeries Machine Learning using TIMETK

Tidying up TimeSeries Machine Learning using Sweep

Time series machine learning to forecast time series data of beer sales. Augmentation on the data is supported. Then we clean the model (i.e. retrieve the model coefficients, residuals). Modeling / Error Investigation are followed.

3 Simple TimeSeries forecast Model

Breif introduction on choosing forecast model. Exponential State Smoothing, ARIMA and TBATS and their comparison is shown.

Database & Pararell Computing

Google BigQuery Connection with R

Analyzing Google Analytics data (built-in as sample data) with BigQuery using R interface. It shows how we can locally connect to BigQuery using DBI pacakge.

Database Fundamentals in R

Connection to BigQuery, usage of dplyr commands, Calculate k-means inside the data, and fianlly visualization of data using ggplot.

Parallel Computing in R

R provides a number of convenient facilities for parallel computing. This script shows how to setup and run a parallel process on your current multi-core device, without need for additional hardware.

SparklyR Complete Guide

Introduces a R interface for Apache Spark. Connecting to Spark from a local machine. Learn to use distributed computing by fully utilizing Spark's engine, as Hadoop-based Data Lake is becoming a common practice at companies.

Text Mining / Social Media Analysis

Text-Mining using SparklyR

For real-world text data that goes beyond GB/TB in file size, it's necessary to leverage Spark engine load and transform data. Eventally genearate a list of the most used words, and create basic wordcloud.

Complete Text-Mining Guide

Looking at the Jane Austen Book's text to learn a full function of Text-Mining (tidying up data, Sentiment analysis, word-frequnecy, TF-IDF, Wordcloud, Tokenizing by n-gram, Topic-modeling). Ready-to-be used in any real-world datasets.

Twitter Analysis and Visualization

Learn to serach tweets by length, location or any criteria set. Retrieve a list of all the accounts a user follows. Then plot the frequency of tweets for each user over time.

Visualization (ggplot2)

Ready-to-Use ggplot2

ggplot2 code 1

ggplot2 code 2

A few curated list of ggplot codes that generates beautiful plot with examples. Basic understanding of ggplot codes is required.

Statistic Concepts with real-world examples

Logistic Regression

Concept of Logistic Regression displayed in R code. Solves a binary classification problem.

Mutinomial Regression

Multinomial regression is similar to logistic regression, but fits better when the response variable is a categorical variable with more than 2 levels.

Ordinal Logistic Regression

Ordinal logistic regression can be used to model a ordered factor response. Here, we use ordered logistic regression to predict the car evaluation.

Ridge Regression

Ridge Regression is a commonly used technique to address the problem of "multi-collinearity". We looks at the result of Linear Regression vs Ridge Regression

Network Analysis and Manipulation

Social Network Analysis is a set of methods used to visualize networks, describe specific characteristics of overall network structure, and build mathematical and statistical models of network structures and dynamics

Setup

Simply click one of the R files above and copy/paste on your own R scripts.
- Please make sure to change the working directory!

TO-DOs

List of features ready and TO-DOs for future development

Introduction to R Shiny Apps : in progress
More Kaggle examples : in progress
Data cleaning .ipynbs : in progress --> In Python Portfolio

Contact

Created by @hyunjoonbok - feel free to contact me!

R Portfolio

Motivation / Thought Process

Table of contents

Projects

Machine Learning

Employee Churn Modeling

Machine Learning Problem Solving Guide (data not included)

Deep Learning

Time Series Forecasting & Anomaly Detection

TimeSeries Machine Learning

Database & Pararell Computing

Text Mining / Social Media Analysis

Visualization (ggplot2)

Ready-to-Use ggplot2

Statistic Concepts with real-world examples

Setup

TO-DOs

Contact

About

Languages