Outlier-Detection-using-PyOD-tools

Detect Credit card frauds using Python Outlier detection tools such as KNN, Isolation Forest etc.

About Data

The dataset contains information on each transaction with data on user who makes the transaction, agency name, merchant category code, date and time of purchase.

Purpose

The purpose is to create features which understand the user behavior to identify abnormal patterns in transactions. Then the idea is to use unsupervised learning techniques to identify similar users and specially the set of users who could be potential frauds.

Feature Engineering

Features are generated to cover 3 major aspects of fraud detection in finance world which are

Recency (how recent are the transactions)
Frequency (how often the transactions are made)
Monetary (how much money is involved)

- Time Related features (recency)

Granularity in time by including features like year, month, day, time, weekday of transaction.
time taken by user between successive transactions

Frequency related

Number of monthly, yearly, daily, weekly transactions to get an idea of each user's frequency habits

Money related (monetary)

fraction of total transaction spent on a particular transaction (generally we expect even distribution if we follow a pattern)
money spent per merchant category code per user
amount difference between successive transactions

Creating features for baseline comparisons

Aggregations are key for generalizing each user's spending behavior. Using these aggregations such as aggregation by merchant category, aggregation by time periods etc., I came up with coefficient of deviation to quantify how each user deviates from the baseline value pobtained by aggregation methods.

Standardization of Data

Since the variables differ in scale, the features are scaled to improve the efficiency of our further modeling techniques. Standard Scalar transformation was used.

Splitting the data

The dataset was split into train and test to train our model using Knn and test it out on test dataset.

PyOd Tool: KNN

A 3 nearest neighbor approach was used to train the initial model. the decision scores are generated by the model to identify each transaction as 0 or 1 where 1 means anomaly. The first model identified 10 percent of the data as anomalous.

KNN: Limitation and how to overcome

Since Knn can be sensitive to outliers, the combination methods are used to get average or maximum values from different set of knn models based on different nearest neighbor parameters.

KNN: Choosing a threshold boundary

A value is chosen as threshold based on which anomalous tranactions are identified. This value is chosen by looking at the histogram distribution and the percentile values.

KNN: Refined results

Based on above two steps, our initial 9 percent anomalous transactions are refined to just give 2 percent of whole data as suspicious and it can be further reported for investigation.

PyOd Tool: Isolation Forest

Instead of number of nearest neighbors, now we have maximum samples as the parameter to tune. Initially I go with 50 max samples to see the results. This model identifies 30 percent of data as anomalous, hence we need further tuning to obtain even stricter thresholds in getting anomalies.

Isolation forest : Limitation and how to overcome

Iforest, unlike KNN needs more hyperparameter tuning with testing of more samples each iteration to develop a fairly evaluated model. Again combination methods are used to train on diufferent sample sizes and average or maximum obtained from these ensemble models is used for anomaly identification.

Isolation forest: Choosing a threshold boundary

A value is chosen as threshold based on which anomalous tranactions are identified. This value is chosen by looking at the histogram distribution and the percentile values.

Isolation forest: Refined results

Based on above two steps, our initial 30 percent anomalous transactions are refined to just give 1.25 percent of whole data as suspicious and it can be further reported for investigation. This is even refined than 2 percent as obtained from knn analysis.

Conclusion

In this analysis, we see how KNN and Iforest are two impressive unsupervised learning techniques to identify anomalies in credit card dataset. Isolation forest does better than KNN in limiting the investigation needs on smaller subset of data but it surely needs more tuning.

MB4511 / Anomaly-Detection---Credit-Card-Fraud

Outlier-Detection-using-PyOD-tools

About Data

Purpose

Feature Engineering

- Time Related features (recency)

Frequency related

Money related (monetary)

Creating features for baseline comparisons

Standardization of Data

Splitting the data

PyOd Tool: KNN

KNN: Limitation and how to overcome

KNN: Choosing a threshold boundary

KNN: Refined results

PyOd Tool: Isolation Forest

Isolation forest : Limitation and how to overcome

Isolation forest: Choosing a threshold boundary

Isolation forest: Refined results

Conclusion

About

Languages