Detect Credit card frauds using Python Outlier detection tools such as KNN, Isolation Forest etc.
The dataset contains information on each transaction with data on user who makes the transaction, agency name, merchant category code, date and time of purchase.
The purpose is to create features which understand the user behavior to identify abnormal patterns in transactions. Then the idea is to use unsupervised learning techniques to identify similar users and specially the set of users who could be potential frauds.
Features are generated to cover 3 major aspects of fraud detection in finance world which are
- Recency (how recent are the transactions)
- Frequency (how often the transactions are made)
- Monetary (how much money is involved)
- Granularity in time by including features like year, month, day, time, weekday of transaction.
- time taken by user between successive transactions
- Number of monthly, yearly, daily, weekly transactions to get an idea of each user's frequency habits
- fraction of total transaction spent on a particular transaction (generally we expect even distribution if we follow a pattern)
- money spent per merchant category code per user
- amount difference between successive transactions
Aggregations are key for generalizing each user's spending behavior. Using these aggregations such as aggregation by merchant category, aggregation by time periods etc., I came up with coefficient of deviation to quantify how each user deviates from the baseline value pobtained by aggregation methods.
Since the variables differ in scale, the features are scaled to improve the efficiency of our further modeling techniques. Standard Scalar transformation was used.
The dataset was split into train and test to train our model using Knn and test it out on test dataset.
A 3 nearest neighbor approach was used to train the initial model. the decision scores are generated by the model to identify each transaction as 0 or 1 where 1 means anomaly. The first model identified 10 percent of the data as anomalous.
Since Knn can be sensitive to outliers, the combination methods are used to get average or maximum values from different set of knn models based on different nearest neighbor parameters.
A value is chosen as threshold based on which anomalous tranactions are identified. This value is chosen by looking at the histogram distribution and the percentile values.
Based on above two steps, our initial 9 percent anomalous transactions are refined to just give 2 percent of whole data as suspicious and it can be further reported for investigation.
Instead of number of nearest neighbors, now we have maximum samples as the parameter to tune. Initially I go with 50 max samples to see the results. This model identifies 30 percent of data as anomalous, hence we need further tuning to obtain even stricter thresholds in getting anomalies.
Iforest, unlike KNN needs more hyperparameter tuning with testing of more samples each iteration to develop a fairly evaluated model. Again combination methods are used to train on diufferent sample sizes and average or maximum obtained from these ensemble models is used for anomaly identification.
A value is chosen as threshold based on which anomalous tranactions are identified. This value is chosen by looking at the histogram distribution and the percentile values.
Based on above two steps, our initial 30 percent anomalous transactions are refined to just give 1.25 percent of whole data as suspicious and it can be further reported for investigation. This is even refined than 2 percent as obtained from knn analysis.
In this analysis, we see how KNN and Iforest are two impressive unsupervised learning techniques to identify anomalies in credit card dataset. Isolation forest does better than KNN in limiting the investigation needs on smaller subset of data but it surely needs more tuning.