SameeranP / Fraud-Detection-Using-Machine-Learning

This is a repository for a detecting frauds on a primitive level in banking/financial sector, insurance sector, law enforcement sector or in Government agencies

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Fraud Detection Using Machine Learning

Introduction

This notebook contains Exploratory Data Analysis and Predictive Machine Learning Model for fraud detection. Fraud detection is valuable to many industries including the banking-financial sectors, insurance, law enforcement, government agencies, and many more.

In recent years we have seen a huge increase in Fraud attempts, making fraud detection important as well as challenging. Despite countless efforts and human supervision, hundreds of millions are lost due to fraud. Fraud can happen using various methods ie, stolen credit cards, misleading accounting, phishing emails, etc. Due to small cases in large population detection of fraud is important as well as challenging.

Data mining and machine learning help to foresee and rapidly distinguish fraud and make quick move to limit costs. Using data mining tools, a huge number of transactions can be looked to spot pattern and distinguish fraud transactions.


Data does not have any NULL value.

step              False
type              False
amount            False
nameOrig          False
oldbalanceOrg     False
newbalanceOrg     False
nameDest          False
oldbalanceDest    False
newbalanceDest    False
isFraud           False
isFlaggedFraud    False
dtype: bool
step type amount nameOrig oldbalanceOrg newbalanceOrg nameDest oldbalanceDest newbalanceDest isFraud isFlaggedFraud
0 1 PAYMENT 9839.64 C1231006815 170136.0 160296.36 M1979787155 0.0 0.0 0 0
1 1 PAYMENT 1864.28 C1666544295 21249.0 19384.72 M2044282225 0.0 0.0 0 0
2 1 TRANSFER 181.00 C1305486145 181.0 0.00 C553264065 0.0 0.0 1 0
3 1 CASH_OUT 181.00 C840083671 181.0 0.00 C38997010 21182.0 0.0 1 0
4 1 PAYMENT 11668.14 C2048537720 41554.0 29885.86 M1230701703 0.0 0.0 0 0

The provided data has the financial transaction data as well as the target variable isFraud, which is the actual fraud status of the transaction and isFlaggedFraud is the indicator which the simulation is used to flag the transaction using some threshold value.

Minimum value of Amount, Old/New Balance of Origin/Destination:

amount            0.0
oldbalanceOrg     0.0
newbalanceOrg     0.0
oldbalanceDest    0.0
newbalanceDest    0.0
dtype: float64

Maximum value of Amount, Old/New Balance of Origin/Destination:

amount            9.244552e+07
oldbalanceOrg     5.958504e+07
newbalanceOrg     4.958504e+07
oldbalanceDest    3.560159e+08
newbalanceDest    3.561793e+08
dtype: float64

Data Analysis


Since there is no missing and garbage value, there is no need for data cleaning, but we still need to perform data analysis as data contaion huge variation of the value in different columns. Normalization will also imporve the overall accuracy of the machine learning model.


png

The graph above shows that TRANSFER and CASH_OUT are two most used mode of transaction and we can see that TRANSFER and CASH_OUT are also the only way in which fraud happen. Thus we will focus on this type of transactions.

png

** Things we can conclude from this heatmap: **

  • OldbalanceOrg and NewbalanceOrg are highly correlated.
  • OldbalanceDest and NewbalanceDest are highly correlated.
  • Amount is correlated with isFraud(Target Variable).

There is not much relation between the features, so we need to understand where the relationship between them depends on the type of transaction and amount. To do so, we need to see the heat map of fraud and nonfraud transactions differently.

png

There are 2 flags which stand out to me and it's interesting to look onto: isFraud and isFlaggedFraud column. From the hypothesis, isFraud is the indicator which indicates the actual fraud transactions whereas isFlaggedFraud is what the system prevents the transaction due to some thresholds being triggered. From the above heatmap we can see that there is some relation between other columns and isFlaggedFraud thus there must be relation between isFraud.

The total number of fraud transaction is 8213.
The total number of fraud transaction which is marked as fraud 16.
Ratio of fraud transaction vs non-fraud transaction is 1:773.


Thus in every 773 transaction there is 1 fraud transaction happening.
Amount lost due to these fraud transaction is $12056415427.

png

png

The plot above clearly shows the need for a system which can be fast and reliable to mark the transaction which is fraud. Since, the current system is letting fraud transaction able to pass through a system which is not labeling them as a fraud. Some data exploration can be helpful to check for the relation between features.

Data Exploration

png

png

png

png

png

Data Clearning

step type amount oldbalanceOrg newbalanceOrig oldbalanceDest newbalanceDest isFraud
0 1 1 9839.64 170136.0 160296.36 0.0 0.0 0
1 1 1 1864.28 21249.0 19384.72 0.0 0.0 0
2 1 2 181.00 181.0 0.00 0.0 0.0 1
3 1 3 181.00 181.0 0.00 21182.0 0.0 1
4 1 1 11668.14 41554.0 29885.86 0.0 0.0 0

Machine Learning Model

from sklearn.model_selection import train_test_split
train_X, test_X, train_y, test_y = train_test_split(X, y, test_size = 0.2, random_state = 121)
from sklearn.ensemble import RandomForestClassifier
clf = RandomForestClassifier(n_estimators=15)
probabilities = clf.fit(train_X, train_y.values.ravel()).predict(test_X)
from sklearn.metrics import average_precision_score
print(average_precision_score(test_y,probabilities))
0.7687057112224541

Save the Model

from sklearn.externals import joblib

with open('RandomForestClassifier.pkl','wb') as RandomForestClassifier:
    joblib.dump(clf,RandomForestClassifier)

Check Fraud

example
index step type amount oldbalanceOrg newbalanceOrig oldbalanceDest newbalanceDest isFraud
0 2 1 2 181.00 181.0 0.00 0.0 0.0 1
1 3 1 3 181.00 181.0 0.00 21182.0 0.0 1
2 251 1 2 2806.00 2806.0 0.00 0.0 0.0 1
3 252 1 3 2806.00 2806.0 0.00 26202.0 0.0 1
4 680 1 2 20128.00 20128.0 0.00 0.0 0.0 1
5 0 1 1 9839.64 170136.0 160296.36 0.0 0.0 0
6 1 1 1 1864.28 21249.0 19384.72 0.0 0.0 0
7 4 1 1 11668.14 41554.0 29885.86 0.0 0.0 0
8 5 1 1 7817.71 53860.0 46042.29 0.0 0.0 0
9 6 1 1 7107.77 183195.0 176087.23 0.0 0.0 0
display(form)

png

Conclusion

  • Existing rule-based system is not capable of detection of all the fraud transaction.
  • Machine learning can be used for the detection of fraud transaction.
  • Predictive models produce good precision score and are capable of detection of fraud transaction.

About

This is a repository for a detecting frauds on a primitive level in banking/financial sector, insurance sector, law enforcement sector or in Government agencies


Languages

Language:Jupyter Notebook 99.8%Language:Python 0.2%