er1czz / data_challenges

Data Challenge Practices (prior to Insight 2020C)

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Data challenge log

Data challenge at Insight link

4. Mercedes-Benz Greener Manufacturing - a regression problem and its solution

drawing

Situation:

  • Assembled automobiles need to be tested to ensure the safety and reliability.
  • Testing is a time-consuming process.
  • Different cars have different configurations/features.

Task: how to cut the testing time using a algorithmic approach?

Action: using regression models to identify key features that affect the testing time.

Results: regression models were applied to analyze the correlation between features and testing time.

Takeaways:

  • Key features were identified. Effort should be prioritized on optimizing those key features.
  • Top 3 features together require more than 40% time for testing, which are ID, X314, and X315.
  • Feature X314: took 35.8% testing time, about 36 seconds average.

3. IEEE-CIS Fraud Detection - a classification problem and its solution

Situation:

  • Credit card fraud is a common financial fraud, especially during pandemic.
  • Shopping everything online is the new norm.

Task: how to maximaize the transaction security with minimal hassles to clients?

Action: developing a predicative model based on machine learning algorithms of binary classification.

Results: maximized the detection rate of fradulent activities while minimizing the number of false alarms (false positive events).

Takeaways:

  • For fraud detection, both precision and recall need to be considered for evaluating model performance.
  • High precision - less financial loss - favorable for small banking of limited number of transactions.
  • High recall - less false flags - better user experience - favorable for large banking.

Version 4 (Latest) and corresponding hyperparameter analysis

  • Improvement: data normalization, model optimization
  • Note: due to the large size of data set, computation-demanding actions are not performed including cross-validation, learning curve, and the fine tuning of model hyperparameters.

Previous versions:
Version 3

  • Improvement: feature selection
  • To do: model optimization, data normalization, learning curve, cross-validation

Version 2

  • Improvement: data cleaning
  • To do: feature selection

Version 1

  • To do: data cleaning, feature selection

2. Ames House Price Prediction (model fitting practice) - analysis - regression

  • Random forest regression, RMSE score 0.18125.
  • RMSE (Root Mean Squared Error): lower score is better, testing score provided by Kaggle.
  • To do: exploratory data analysis and feature selection

1. RMS Titanic Survival Prediction (testing water) - analysis - classification

  • Random forest classification, accuracy score 0.77033.
  • Accuracy score: higher score is better, testing score provided by Kaggle.
  • To do: Monte Carlo to simulate the missing data, especially passenger age.

Sources:

Data sets from kaggle.com

Stock Photos from unsplash.com

Misc

About

Data Challenge Practices (prior to Insight 2020C)


Languages

Language:Jupyter Notebook 100.0%