zcdliuwei / Projects

Data Science Portfolio

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Data Science Portfolio

This portfolio consists of several projects illustrating the work I have done in order to further develop my data science skills.

Table of Contents


Vectors of Locally Aggregated Concepts (VLAC)

Repository | Published Paper

  • It leverages clusters of word embeddings (i.e., concepts) to create features from a collection of documents allowing for classification of documents
  • Inspiration was drawn from VLAD, which is a feature generation method for image classification
  • The article was published in ECML-PKDD 2019

NLP: Analyzing WhatsApp Messages

Repository | Github | nbviewer

  • Created a package that allows in-depth analyses on whatsapp conversations
  • Analyses were initially done on whatsapp messages between me and my fianciee to surprise her with on our wedding
  • Visualizations were done in such a way that it would make sense for someone not familiar with data science
  • Methods: Sentiment Analysis, TF-IDF, Topic Modeling, Wordclouds, etc.

Board Game Exploration

Github | site

  • Created an application for exploring board game matches that I tracked over the last year
  • The application was created for two reasons:
    • First, I wanted to surprise my wife with this application as we played mostly together
    • Second, the data is relatively simple (5 columns) and small (~300 rows) and I wanted to demonstrate the possibilities of analyses with simple data
  • Dashboard was created with streamlit and the deployment of the application was through Heroku

Optimizing Emté Routes

Github | nbviewer

  • Project for the course Business Analytics in the master
  • Optimization of managers visiting a set of cities
  • Total of 133 cities, max distance 400km with time and capacity constraints
  • Thus, a vehicle routing problem
  • Methods: Integer Linear Programming, Tabu Search, Simmulated Annealing, Ant Colony Optimization, Python

Exploring Explainable ML

Repository | Notebook | nbviewer | Medium

  • Explored several methods for opening the black boxes that are tree-based prediction models
  • Models included PDP, LIME, and SHAP

Deploying a Machine Learning Model

Repository | Medium

  • Developed a set of steps necessary to quickly develop your machine learning model
  • Used a combination of FastAPI, Uvicorn and Gunicorn to lay the foundation of the API
  • The repository contains all code necessary (including dockerfile)

Statistical Cross-Validation Techniques

Repository | Notebook | nbviewer | Medium

  • Dived into several more in-depth techniques for validating a model
  • Statistical methods were explored for comparing models including the Wilcoxon signed-rank test, McNemar's test, 5x2CV paired t-test and 5x2CV combined F test

Cluster Analysis: Create, Visualize and Interpret Customer Segments

Repository | Notebook | nbviewer | Medium

  • Explored several methods for creating customer segments; k-Means (Cosine & Euclidean) vs. DBSCAN
  • Applied PCA and t-SNE for the 3 dimensional exploration of clusters
  • Used variance between averages of clusters per variable to detect important differences between clusters

Exploring Advanced Feature Engineering Techniques

Repository | Notebook | nbviewer | Medium

  • Several methods are described for advanced feature engineering including:
    • Resampling Imbalanced Data using SMOTE
    • Creating New Features with Deep Feature Synthesis
    • Handling Missing Values with the Iterative Imputer and CatBoost
    • Outlier Detection with IsolationForest

Predicting and Optimizing Auction Prices

Github | nbviewer

  • Data received from an auction house and therefore not made public
  • Prediction of value at which an item will be sold to be used as an objective measure
  • Optimize starting price such that predicted value will be as high as possible
  • Methods: Classification (KNN, LightGBM, RF, XGBoost, etc.), LOO-CV, Genetic Algorithms, Python

Statistical Analysis using the Hurdle Model

Github | nbviewer

  • Used Apple Store data to analyze which business model aspects influence performance of mobile games
  • Two groups were identified and compared, namley early and later entrants of the market
  • The impact of entry timing and the use of technological innovation was analyzed on performance
  • Methods: Zero-Inflated Negative Binomial Regression, Hurdle model, Python, R

Predict and optimize demand

Github | nbviewer

  • Part of the course Data-Driven SCM
  • Optimizing order quantity based on predicted demand using machine learning methods
  • Simulation model was created to check performance of method calculating expected demand
  • Additional weather features were included
  • Methods: Regression (XGBoost, LightGBM and CatBoost), Bayesian Optimization (Skopt)

Analyzing Google Takeout Data

Github | nbviewer

  • Analyzing my own data provided by Google Takeout
  • Location data, browser search history, mail data, etc.
  • Code to analyze browser history is included
  • Code to create animation will follow

Cars Dashboard


  • Created a dashboard for the cars dataset using Python, HTML and CSS
  • It allows for several crossfilters (see below)

Qwixx Visualization

Github | nbviewer

Academic Journey Visualization

Github | nbviewer

  • A visualization of my 9! year academic journey
  • My grades and average across six educational programs versus the average of all other students in the same classes.
  • Unfortunately, no data on the students’ average grade was available for classes in my bachelor.
  • Grades range from 0 to 10 with 6 being a passing grade.
  • Lastly, Cum Laude is the highest academic distinction for the master’s program I followed.

Neural Style Transfer

  • For the course deep learning I worked on a paper researching optimal selection of hidden layers to create the most appealing images in neural style transfer while speeding up the process of optimization
  • Code is not yet provided as I used most of the following code from here and I merely explored different layers

Predicting Housing Prices

Github | nbviewer

  • Project for the course Machine Learning
  • Participation of the kaggle competition House Prices: Advanced Regression Techniques
  • We were graded on leaderboard scores and I scored 1st of the class with position 33th of the leaderboard
  • I used a weighted average of XGBoost, Lasso, ElasticNet, Ridge and Gradient Boosting Regressor
  • The focus of this project was mostly on feature engineering

Analyzing FitBit Data

Repository | Github | nbviewer

  • My first data science project
  • I looked at improving the quality of FitBit's heartrate measurement as I realized it wasn't all that accurate
  • Used a simple 10-fold CV with regression techniques to fill missing heart rates
  • There are many things that I would do differently now like using other validation techniques that are not as sensitive to overfitting on timeseries


If you are looking to contact me personally, please do so via E-mail or Linkedin:

ezoic increase your site revenue


Data Science Portfolio


Language:Jupyter Notebook 100.0%