Fraud Detection Using Machine Learning

Introduction

This notebook contains Exploratory Data Analysis and Predictive Machine Learning Model for fraud detection. Fraud detection is valuable to many industries including the banking-financial sectors, insurance, law enforcement, government agencies, and many more.

In recent years we have seen a huge increase in Fraud attempts, making fraud detection important as well as challenging. Despite countless efforts and human supervision, hundreds of millions are lost due to fraud. Fraud can happen using various methods ie, stolen credit cards, misleading accounting, phishing emails, etc. Due to small cases in large population detection of fraud is important as well as challenging.

Data mining and machine learning help to foresee and rapidly distinguish fraud and make quick move to limit costs. Using data mining tools, a huge number of transactions can be looked to spot pattern and distinguish fraud transactions.

Data does not have any NULL value.

step              False
type              False
amount            False
nameOrig          False
oldbalanceOrg     False
newbalanceOrg     False
nameDest          False
oldbalanceDest    False
newbalanceDest    False
isFraud           False
isFlaggedFraud    False
dtype: bool

	step	type	amount	nameOrig	oldbalanceOrg	newbalanceOrg	nameDest	oldbalanceDest	isFraud
0	1	PAYMENT	9839.64	C1231006815	170136.0	160296.36	M1979787155	0.0	0
1	1	PAYMENT	1864.28	C1666544295	21249.0	19384.72	M2044282225	0.0	0
2	1	TRANSFER	181.00	C1305486145	181.0	0.00	C553264065	0.0	1
3	1	CASH_OUT	181.00	C840083671	181.0	0.00	C38997010	21182.0	1
4	1	PAYMENT	11668.14	C2048537720	41554.0	29885.86	M1230701703	0.0	0

The provided data has the financial transaction data as well as the target variable isFraud, which is the actual fraud status of the transaction and isFlaggedFraud is the indicator which the simulation is used to flag the transaction using some threshold value.

Minimum value of Amount, Old/New Balance of Origin/Destination:

amount            0.0
oldbalanceOrg     0.0
newbalanceOrg     0.0
oldbalanceDest    0.0
newbalanceDest    0.0
dtype: float64

Maximum value of Amount, Old/New Balance of Origin/Destination:

amount            9.244552e+07
oldbalanceOrg     5.958504e+07
newbalanceOrg     4.958504e+07
oldbalanceDest    3.560159e+08
newbalanceDest    3.561793e+08
dtype: float64

Data Analysis

Since there is no missing and garbage value, there is no need for data cleaning, but we still need to perform data analysis as data contaion huge variation of the value in different columns. Normalization will also imporve the overall accuracy of the machine learning model.

The graph above shows that TRANSFER and CASH_OUT are two most used mode of transaction and we can see that TRANSFER and CASH_OUT are also the only way in which fraud happen. Thus we will focus on this type of transactions.

** Things we can conclude from this heatmap: **

OldbalanceOrg and NewbalanceOrg are highly correlated.
OldbalanceDest and NewbalanceDest are highly correlated.
Amount is correlated with isFraud(Target Variable).

There is not much relation between the features, so we need to understand where the relationship between them depends on the type of transaction and amount. To do so, we need to see the heat map of fraud and nonfraud transactions differently.

There are 2 flags which stand out to me and it's interesting to look onto: isFraud and isFlaggedFraud column. From the hypothesis, isFraud is the indicator which indicates the actual fraud transactions whereas isFlaggedFraud is what the system prevents the transaction due to some thresholds being triggered. From the above heatmap we can see that there is some relation between other columns and isFlaggedFraud thus there must be relation between isFraud.

The total number of fraud transaction is 8213.
The total number of fraud transaction which is marked as fraud 16.
Ratio of fraud transaction vs non-fraud transaction is 1:773.


Thus in every 773 transaction there is 1 fraud transaction happening.
Amount lost due to these fraud transaction is $12056415427.

The plot above clearly shows the need for a system which can be fast and reliable to mark the transaction which is fraud. Since, the current system is letting fraud transaction able to pass through a system which is not labeling them as a fraud. Some data exploration can be helpful to check for the relation between features.

Data Exploration

Data Clearning

	step	type	amount	oldbalanceOrg	newbalanceOrig	oldbalanceDest	isFraud
0	1	1	9839.64	170136.0	160296.36	0.0	0
1	1	1	1864.28	21249.0	19384.72	0.0	0
2	1	2	181.00	181.0	0.00	0.0	1
3	1	3	181.00	181.0	0.00	21182.0	1
4	1	1	11668.14	41554.0	29885.86	0.0	0

Machine Learning Model

from sklearn.model_selection import train_test_split
train_X, test_X, train_y, test_y = train_test_split(X, y, test_size = 0.2, random_state = 121)

from sklearn.ensemble import RandomForestClassifier
clf = RandomForestClassifier(n_estimators=15)

probabilities = clf.fit(train_X, train_y.values.ravel()).predict(test_X)

from sklearn.metrics import average_precision_score
print(average_precision_score(test_y,probabilities))

0.7687057112224541

Save the Model

from sklearn.externals import joblib

with open('RandomForestClassifier.pkl','wb') as RandomForestClassifier:
    joblib.dump(clf,RandomForestClassifier)

Check Fraud

example

	index	step	type	amount	oldbalanceOrg	newbalanceOrig	oldbalanceDest	isFraud
0	2	1	2	181.00	181.0	0.00	0.0	1
1	3	1	3	181.00	181.0	0.00	21182.0	1
2	251	1	2	2806.00	2806.0	0.00	0.0	1
3	252	1	3	2806.00	2806.0	0.00	26202.0	1
4	680	1	2	20128.00	20128.0	0.00	0.0	1
5	0	1	1	9839.64	170136.0	160296.36	0.0	0
6	1	1	1	1864.28	21249.0	19384.72	0.0	0
7	4	1	1	11668.14	41554.0	29885.86	0.0	0
8	5	1	1	7817.71	53860.0	46042.29	0.0	0
9	6	1	1	7107.77	183195.0	176087.23	0.0	0

display(form)

Conclusion

Existing rule-based system is not capable of detection of all the fraud transaction.
Machine learning can be used for the detection of fraud transaction.
Predictive models produce good precision score and are capable of detection of fraud transaction.

About

This is a repository for a detecting frauds on a primitive level in banking/financial sector, insurance sector, law enforcement sector or in Government agencies

Languages

Language:Jupyter Notebook 99.8%Language:Python 0.2%