Home / Blog / Data Science / Security Analysis

Security Analysis

June 26, 2023
44

Meet the Author : Mr. Bharani Kumar

Bharani Kumar Depuru is a well known IT personality from Hyderabad. He is the Founder and Director of Innodatatics Pvt Ltd and 360DigiTMG. Bharani Kumar is an IIT and ISB alumni with more than 17 years of experience, he held prominent positions in the IT elites like HSBC, ITC Infotech, Infosys, and Deloitte. He is a prevalent IT consultant specializing in Industrial Revolution 4.0 implementation, Data Analytics practice setup, Artificial Intelligence, Big Data Analytics, Industrial IoT, Business Intelligence and Business Management. Bharani Kumar is also the chief trainer at 360DigiTMG with more than Ten years of experience and has been making the IT transition journey easy for his students. 360DigiTMG is at the forefront of delivering quality education, thereby bridging the gap between academia and industry.

Learn the core concepts of Data Science Course video on Youtube:

Use Case – Money Laundering

The dataset provides a table providing information on the type of transfer, source ID, destination ID, Amount, whether is it fraud or not and the type of laundering.

Data Understanding

Typeofaction: - This specifies the type of transaction conducted. This is in categorical form. These are: -

Transfer
Cash

Sourceid: - The origin of the money. This is in numeric format

Destinationid: - The final point where the money is sent, this is in numeric

Amountofmoney: - This the money transferred. This is also in numeric format

Date: This is when the transaction is initiated, this is in datetime format

Isfraud: This tell if there is fraud committed, 0 means no fraud, 1 means fraud. This is in binary categorical format

Typeofraud: this tells if fraud is commited. There are 4 categories.

Type I fraud
Type II fraud
Type III fraud
None i.e. no fraud

Exploratory Data Analysis

EDA becomes crucial since it is crucial to understand data before creating an ML model. With the help of EDA, it is possible to determine the structure and substance of the data. It is also feasible to assess the performance of the data and comprehend how any modifications to the data impact the data. The profiling library for Panda is used for EDA. This Auto-EDA library does the initial EDA without requiring manual coding or report saving.

Let's start by examining the different columns. We have four classified categories and three numerical columns based on the EDA report. There are 2340 records with no missing data and a total of 7 variables.

Typeofaction is a binary categorical data column having two types of categories i.e. ‘cash-in’ and ‘transfer’. The transfer has a total of 1580 records and cash-in as 760 records. So transfers are 67.5% of the dataset and cash-in is 32.5%.

Sourceid and destinationid are unique identifiers for financial institutes. As such, there isn’t much need for analysis of this dataset.

Amountofmoney represents the total money spent. This has 939 distinct values. The following table is giving key information for this column

Quantile Statistics

Minimum	13332
Q1	335914
Median	1162353.5
Q3	4686559.25
5th percentile	215500
95th percentile	7259213
Maximum	7952497
Range	7939165
IQR	4350645.25

Descriptive Statistics

Standard Deviation	2560433.61
Kurtosis	-0.988
mean	2508582.891
Skewness	0.74457
Variances	6.555 × 1012

Based on EDA, this column has a very large range, so normalization techniques need to apply to get it within range.

The date column has 2309 distinct values, so this column will be ignored for the model building stage.

Isfraud has two outputs, 0 meaning no fraud and 1 meaning fraud. Fraud has 1399 entries and no fraud has 941 entries.

Typeoffraud has 4 possible options, type1, type2, type3, and none. Type1 has 423 entries, type2 has 465 entries, type3 has 511 entries, and none has 941 entries.

In terms of correlation, there aren’t any columns displaying a high degree of collinearity.

Method Setup

The K-Nearest Neighbour approach, which categorises data points based on their closeness to other data points within the dataset, will be used to build the model in Python. This is a supervised learning model, meaning that the outcome is predetermined.

Model building

#!/usr/bin/env python
# coding: utf-8

import pandas as pd
import seaborn as sns
import numpy as np
import matplotlib.pyplot as plt
df = pd.read_csv("ML.csv")
df.head()
df['typeoffraud'].unique()
df['isfraud'].unique()
df.describe()
df.info()
from pandas_profiling import ProfileReport
prof = ProfileReport(df)
prof.to_file(output_file="output.html")
from sklearn.preprocessing import LabelEncoder
enc = LabelEncoder()
def norm_func(i):
x = (i-i.min()) / (i.max()-i.min())
return (x)
df['typeofaction'] = enc.fit_transform(df['typeofaction'])
df['typeoffraud'] = enc.fit_transform(df['typeoffraud'])
df.head()
norm_money = norm_func(df[['amountofmoney']])
norm_money
X = df[['typeofaction']]
Y = df['typeoffraud']
X
X = pd.concat([X, norm_money], axis = 1)
X
X = X.values
from sklearn.model_selection import train_test_split
X_train, X_test, Y_train, Y_test = train_test_split(X,Y, test_size=0.25)
X_train.shape, X_test.shape, Y_train.shape, Y_test.shape
df['typeoffraud'].value_counts()
Y_train.value_counts()
from sklearn.neighbors import KNeighborsClassifier
acc = []
for i in range(1,25,2):

neighbors = KNeighborsClassifier(n_neighbors=i)
neighbors.fit(X_train, Y_train)
train_acc = np.mean(neighbors.predict(X_train) == Y_train)
test_acc = np.mean(neighbors.predict(X_test) == Y_test) acc.append([train_acc, test_acc])
plt.plot(np.arange(1, 25, 2), [i[0] for i in acc],"ro-")
plt.plot(np.arange(1, 25, 2), [i[1] for i in acc],"bo-")
print(acc) # now we will take isfraud as the output column
#
Y = df['isfraud']
Y
X_train, X_test, Y_train, Y_test = train_test_split(X,Y, test_size=0.25)
X_train.shape, X_test.shape, Y_train.shape, Y_test.shape
acc_1 = []
for i in range(1,25,2):

neighbors = KNeighborsClassifier(n_neighbors=i)
neighbors.fit(X_train, Y_train)
train_acc = np.mean(neighbors.predict(X_train) == Y_train)
test_acc = np.mean(neighbors.predict(X_test) == Y_test)
acc_1.append([train_acc, test_acc])
plt.plot(np.arange(1, 25, 2), [i[0] for i in acc_1],"ro-")
plt.plot(np.arange(1, 25, 2), [i[1] for i in acc_1],"bo-")
print(acc_1)

Explanation

KNN is used in the code mentioned above. The file output.html contains the output from AutoEDA. The amountofmoney column's normalisation function is what matters most in this situation. The column is made scalefree and unitless using a technique known as the min-max scaler. The formula involves removing each value from the dataset that has the least value, then dividing the remaining values by the difference between the maximum and minimum values. As a result, the data's 0–1 range is narrower. It is simpler for the machine to process the column when its range is decreased.

A method of encoding categorical data to some quantitative output is label encoding. Since both the typeofaction and typeoffraund columns include categorical data, this is applied to both. The values of 0; cash-in; and 1; transfer will be returned by typeofaction. The numbers 0, 1, 2, 3 that represent type1, type2, type3, and none, respectively, will be returned by typeoffraud. The dataset is now prepared for the stage of model construction.

The quantity of money and the type of action are utilised as the input columns, and the first category of fraud is used as the output column. K values should always be treated as odd since K is defined to be in the range of 1 and 25, with step 2 being supplied as in KNN. Below are the results of the accuracy test and training: The K-Nearest Neighbour approach, which categorises data points based on their closeness to other data points within the dataset, will be used to build the model in Python. This is a supervised learning model, meaning that the outcome is predetermined. 0

As seen the accuracy is decreasing with more neighbors being factored in. testing accuracy is in blue and red represents training accuracy. As for the ideal number of neighbors, K=13 is chosen as test accuracy is around 96% and after this it falls off, training accuracy around this point is 95%.

Then isfraud is chosen as the output column with input columns being kept the same. Below are the results.

In this scenario, K is again taken as K = 13, where test accuracy is at its highest at nearly 96% and training accuracy is also nearly the same.

Conclusions

Money laundering is a significant issue, and since the globe appears to be entering a recession, it is becoming more and more crucial to stop it. Artificial intelligence and machine learning are crucial in reducing this. The importance of machine learning and artificial intelligence will increase as data science continues to expand.

The use case study comes to an end here. The dataset is quite simple, yet KNN is used to categorise and identify money laundering. To discover more about money laundering, how it operates, and how it is being stopped in the modern world, readers are urged to read this article. Additionally, readers can experiment with various datasets or use various machine learning techniques on the dataset mentioned above.

Click here to learn Data Science Course, Data Science Course in Hyderabad, Data Science Course in Bangalore