classification ensemble-learning fraud-detection kmeans-clustering logistic-regression machine-learning random-forest smote-sampling xgboost

Fraud Detection

Credit Card fraud detection based on Kaggle dataset. Applied and tested with Clustering, Logistic Regression, Random Forest, and XG BOOST, along with some sampling techniques for balancing the data.

Some Tips

Features V1 to V28 are the principal components obtained with PCA, so they are scaled. Only time and amount need to be scaled.
The F1-score is a great scoring metric for imbalanced data when more attention is needed on the positives, making it suitable for measuring model performance.
The dataset is highly imbalanced, and it is important to take care of overfitting on the Non-Fraud class. The main techniques used were Random Under-sampling and SMOTE for oversampling the minority class.
Secondly, be aware that Fraud transactions can be natural outliers compared to Non-Fraud transactions. Be careful about Anomaly detection, especially outlier removal.
Be careful about splitting test and train data before applying any sampling techniques. Only apply sampling techniques to the train data.
At the end, be cautious about sampling and cross-validation; if not applied correctly, it can cause data leakage.