zcdliuwei / Projects

Data Science Portfolio

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Data Science Portfolio

This portfolio consists of several projects illustrating the work I have done in order to further develop my data science skills.

Table of Contents

Projects

Vectors of Locally Aggregated Concepts (VLAC)

Repository | Published Paper

  • It leverages clusters of word embeddings (i.e., concepts) to create features from a collection of documents allowing for classification of documents
  • Inspiration was drawn from VLAD, which is a feature generation method for image classification
  • The article was published in ECML-PKDD 2019


NLP: Analyzing WhatsApp Messages

Repository | Github | nbviewer

  • Created a package that allows in-depth analyses on whatsapp conversations
  • Analyses were initially done on whatsapp messages between me and my fianciee to surprise her with on our wedding
  • Visualizations were done in such a way that it would make sense for someone not familiar with data science
  • Methods: Sentiment Analysis, TF-IDF, Topic Modeling, Wordclouds, etc.


Board Game Exploration

Github | site

  • Created an application for exploring board game matches that I tracked over the last year
  • The application was created for two reasons:
    • First, I wanted to surprise my wife with this application as we played mostly together
    • Second, the data is relatively simple (5 columns) and small (~300 rows) and I wanted to demonstrate the possibilities of analyses with simple data
  • Dashboard was created with streamlit and the deployment of the application was through Heroku


Optimizing Emté Routes

Github | nbviewer

  • Project for the course Business Analytics in the master
  • Optimization of managers visiting a set of cities
  • Total of 133 cities, max distance 400km with time and capacity constraints
  • Thus, a vehicle routing problem
  • Methods: Integer Linear Programming, Tabu Search, Simmulated Annealing, Ant Colony Optimization, Python


Exploring Explainable ML

Repository | Notebook | nbviewer | Medium

  • Explored several methods for opening the black boxes that are tree-based prediction models
  • Models included PDP, LIME, and SHAP


Deploying a Machine Learning Model

Repository | Medium

  • Developed a set of steps necessary to quickly develop your machine learning model
  • Used a combination of FastAPI, Uvicorn and Gunicorn to lay the foundation of the API
  • The repository contains all code necessary (including dockerfile)


Statistical Cross-Validation Techniques

Repository | Notebook | nbviewer | Medium

  • Dived into several more in-depth techniques for validating a model
  • Statistical methods were explored for comparing models including the Wilcoxon signed-rank test, McNemar's test, 5x2CV paired t-test and 5x2CV combined F test


Cluster Analysis: Create, Visualize and Interpret Customer Segments

Repository | Notebook | nbviewer | Medium

  • Explored several methods for creating customer segments; k-Means (Cosine & Euclidean) vs. DBSCAN
  • Applied PCA and t-SNE for the 3 dimensional exploration of clusters
  • Used variance between averages of clusters per variable to detect important differences between clusters


Exploring Advanced Feature Engineering Techniques

Repository | Notebook | nbviewer | Medium

  • Several methods are described for advanced feature engineering including:
    • Resampling Imbalanced Data using SMOTE
    • Creating New Features with Deep Feature Synthesis
    • Handling Missing Values with the Iterative Imputer and CatBoost
    • Outlier Detection with IsolationForest


Predicting and Optimizing Auction Prices

Github | nbviewer

  • Data received from an auction house and therefore not made public
  • Prediction of value at which an item will be sold to be used as an objective measure
  • Optimize starting price such that predicted value will be as high as possible
  • Methods: Classification (KNN, LightGBM, RF, XGBoost, etc.), LOO-CV, Genetic Algorithms, Python


Statistical Analysis using the Hurdle Model

Github | nbviewer

  • Used Apple Store data to analyze which business model aspects influence performance of mobile games
  • Two groups were identified and compared, namley early and later entrants of the market
  • The impact of entry timing and the use of technological innovation was analyzed on performance
  • Methods: Zero-Inflated Negative Binomial Regression, Hurdle model, Python, R


Predict and optimize demand

Github | nbviewer

  • Part of the course Data-Driven SCM
  • Optimizing order quantity based on predicted demand using machine learning methods
  • Simulation model was created to check performance of method calculating expected demand
  • Additional weather features were included
  • Methods: Regression (XGBoost, LightGBM and CatBoost), Bayesian Optimization (Skopt)


Analyzing Google Takeout Data

Github | nbviewer

  • Analyzing my own data provided by Google Takeout
  • Location data, browser search history, mail data, etc.
  • Code to analyze browser history is included
  • Code to create animation will follow


Cars Dashboard

Github

  • Created a dashboard for the cars dataset using Python, HTML and CSS
  • It allows for several crossfilters (see below)


Qwixx Visualization

Github | nbviewer


Academic Journey Visualization

Github | nbviewer

  • A visualization of my 9! year academic journey
  • My grades and average across six educational programs versus the average of all other students in the same classes.
  • Unfortunately, no data on the students’ average grade was available for classes in my bachelor.
  • Grades range from 0 to 10 with 6 being a passing grade.
  • Lastly, Cum Laude is the highest academic distinction for the master’s program I followed.


Neural Style Transfer

  • For the course deep learning I worked on a paper researching optimal selection of hidden layers to create the most appealing images in neural style transfer while speeding up the process of optimization
  • Code is not yet provided as I used most of the following code from here and I merely explored different layers


Predicting Housing Prices

Github | nbviewer

  • Project for the course Machine Learning
  • Participation of the kaggle competition House Prices: Advanced Regression Techniques
  • We were graded on leaderboard scores and I scored 1st of the class with position 33th of the leaderboard
  • I used a weighted average of XGBoost, Lasso, ElasticNet, Ridge and Gradient Boosting Regressor
  • The focus of this project was mostly on feature engineering


Analyzing FitBit Data

Repository | Github | nbviewer

  • My first data science project
  • I looked at improving the quality of FitBit's heartrate measurement as I realized it wasn't all that accurate
  • Used a simple 10-fold CV with regression techniques to fill missing heart rates
  • There are many things that I would do differently now like using other validation techniques that are not as sensitive to overfitting on timeseries


Contact

If you are looking to contact me personally, please do so via E-mail or Linkedin:

ezoic increase your site revenue

About

Data Science Portfolio


Languages

Language:Jupyter Notebook 100.0%