Data challenge log

Data challenge at Insight link

Situation:

Task: how to cut the testing time using a algorithmic approach?

Action: using regression models to identify key features that affect the testing time.

Results: regression models were applied to analyze the correlation between features and testing time.

Takeaways:

Key features were identified. Effort should be prioritized on optimizing those key features.
Top 3 features together require more than 40% time for testing, which are ID, X314, and X315.
Feature X314: took 35.8% testing time, about 36 seconds average.

Situation:

Task: how to maximaize the transaction security with minimal hassles to clients?

Action: developing a predicative model based on machine learning algorithms of binary classification.

Results: maximized the detection rate of fradulent activities while minimizing the number of false alarms (false positive events).

Takeaways:

For fraud detection, both precision and recall need to be considered for evaluating model performance.
High precision - less financial loss - favorable for small banking of limited number of transactions.
High recall - less false flags - better user experience - favorable for large banking.

Version 4 (Latest) and corresponding hyperparameter analysis

Improvement: data normalization, model optimization
Note: due to the large size of data set, computation-demanding actions are not performed including cross-validation, learning curve, and the fine tuning of model hyperparameters.

Previous versions:
Version 3

Random forest regression, RMSE score 0.18125.
RMSE (Root Mean Squared Error): lower score is better, testing score provided by Kaggle.
To do: exploratory data analysis and feature selection

Data sets from kaggle.com

Stock Photos from unsplash.com

Car assembly line by Lenny Kuhne: https://unsplash.com/photos/jHZ70nRk7Ns
Business Man with a credit card by rupixen https://unsplash.com/photos/Q59HmzK38eQ