Home / Blog / Data Science / Security Analysis

Security Analysis

  • June 26, 2023
  • 3716
  • 44
Author Images

Meet the Author : Mr. Bharani Kumar

Bharani Kumar Depuru is a well known IT personality from Hyderabad. He is the Founder and Director of Innodatatics Pvt Ltd and 360DigiTMG. Bharani Kumar is an IIT and ISB alumni with more than 17 years of experience, he held prominent positions in the IT elites like HSBC, ITC Infotech, Infosys, and Deloitte. He is a prevalent IT consultant specializing in Industrial Revolution 4.0 implementation, Data Analytics practice setup, Artificial Intelligence, Big Data Analytics, Industrial IoT, Business Intelligence and Business Management. Bharani Kumar is also the chief trainer at 360DigiTMG with more than Ten years of experience and has been making the IT transition journey easy for his students. 360DigiTMG is at the forefront of delivering quality education, thereby bridging the gap between academia and industry.

Read More >

Here, we'll look at one use of the domain. We'll examine how the use case is implemented from beginning to end and how it benefits the domain.

Money laundering is the practise of disguising the illicit source of funds obtained via unlawful behaviour such as corruption, drug trafficking, human trafficking, or any other illegal activity. Similar to washing textiles, the purpose of money laundering is to make the money you have received look legitimate on paper.

Money laundering is the process of hiding the illegitimate origin of money generated from criminal activities like human trafficking, drug trafficking, embezzlement, corruption, or any other form of illegal activity. The goal of laundering is similar to washing clothes, you clean up the money you’ve received and make it appear legitimate on paper. Money laundering occurs in 3 phases:-

  • Placement: - This refers to the movement of money earned illegally to a financial institute.
  • Layering: - This is the stage where the money is transferred through multiple transactions, through various financial institutes to hide the source of money
  • Integration: - This is the stage where the money is ready to use as a source of legitimate money

Estimations of money laundering vary between USD$800 billion a year to USD$2 trillion every year, which represents about 2-5% of the global GDP.

The necessity to identify money laundering is becoming more and more important. This money is used to finance terrorism, crime, and other bad things. Additionally, governments all around the world are losing out on much-needed tax money. Numerous financial institutions and regulatory organisations, such as the SEC in the United States or the RBI in India, are developing various strategies to combat this. As there are frequently trends in how individuals wash, pattern identification is vital. But there is also the ongoing challenge of attempting to identify patterns in the ever-growing data. The amount of data created is greatly increasing, just like it is in every other industry. As a result, artificial intelligence and machine learning have begun to play a significant role in the detection and prevention of money laundering.

Learn the core concepts of Data Science Course video on Youtube:

Use Case – Money Laundering

The dataset provides a table providing information on the type of transfer, source ID, destination ID, Amount, whether is it fraud or not and the type of laundering.

360DigiTMG

 

Data Understanding

Typeofaction: - This specifies the type of transaction conducted. This is in categorical form. These are: -

  • Transfer
  • Cash

Sourceid: - The origin of the money. This is in numeric format

Destinationid: - The final point where the money is sent, this is in numeric

Amountofmoney: - This the money transferred. This is also in numeric format

Date: This is when the transaction is initiated, this is in datetime format

Isfraud: This tell if there is fraud committed, 0 means no fraud, 1 means fraud. This is in binary categorical format

Typeofraud: this tells if fraud is commited. There are 4 categories.

  • Type I fraud
  • Type II fraud
  • Type III fraud
  • None i.e. no fraud

Exploratory Data Analysis

EDA becomes crucial since it is crucial to understand data before creating an ML model. With the help of EDA, it is possible to determine the structure and substance of the data. It is also feasible to assess the performance of the data and comprehend how any modifications to the data impact the data. The profiling library for Panda is used for EDA. This Auto-EDA library does the initial EDA without requiring manual coding or report saving.

Let's start by examining the different columns. We have four classified categories and three numerical columns based on the EDA report. There are 2340 records with no missing data and a total of 7 variables.

Typeofaction is a binary categorical data column having two types of categories i.e. ‘cash-in’ and ‘transfer’. The transfer has a total of 1580 records and cash-in as 760 records. So transfers are 67.5% of the dataset and cash-in is 32.5%.

360DigiTMG

Sourceid and destinationid are unique identifiers for financial institutes. As such, there isn’t much need for analysis of this dataset.

Amountofmoney represents the total money spent. This has 939 distinct values. The following table is giving key information for this column

Quantile Statistics
Minimum 13332
Q1 335914
Median 1162353.5
Q3 4686559.25
5th percentile 215500
95th percentile 7259213
Maximum 7952497
Range 7939165
IQR 4350645.25
Descriptive Statistics
Standard Deviation 2560433.61
Kurtosis -0.988
mean 2508582.891
Skewness 0.74457
Variances 6.555 × 1012

Based on EDA, this column has a very large range, so normalization techniques need to apply to get it within range.

The date column has 2309 distinct values, so this column will be ignored for the model building stage.

Isfraud has two outputs, 0 meaning no fraud and 1 meaning fraud. Fraud has 1399 entries and no fraud has 941 entries.

360DigiTMG

Typeoffraud has 4 possible options, type1, type2, type3, and none. Type1 has 423 entries, type2 has 465 entries, type3 has 511 entries, and none has 941 entries.

In terms of correlation, there aren’t any columns displaying a high degree of collinearity.

360DigiTMG

Method Setup

The K-Nearest Neighbour approach, which categorises data points based on their closeness to other data points within the dataset, will be used to build the model in Python. This is a supervised learning model, meaning that the outcome is predetermined.

Model building

#!/usr/bin/env python
# coding: utf-8

import pandas as pd
import seaborn as sns
import numpy as np
import matplotlib.pyplot as plt
df = pd.read_csv("ML.csv")
df.head()
df['typeoffraud'].unique()
df['isfraud'].unique()
df.describe()
df.info()
from pandas_profiling import ProfileReport
prof = ProfileReport(df)
prof.to_file(output_file="output.html")
from sklearn.preprocessing import LabelEncoder
enc = LabelEncoder()
def norm_func(i):
x = (i-i.min()) / (i.max()-i.min())
return (x)
df['typeofaction'] = enc.fit_transform(df['typeofaction'])
df['typeoffraud'] = enc.fit_transform(df['typeoffraud'])
df.head()
norm_money = norm_func(df[['amountofmoney']])
norm_money
X = df[['typeofaction']]
Y = df['typeoffraud']
X
X = pd.concat([X, norm_money], axis = 1)
X
X = X.values
from sklearn.model_selection import train_test_split
X_train, X_test, Y_train, Y_test = train_test_split(X,Y, test_size=0.25)
X_train.shape, X_test.shape, Y_train.shape, Y_test.shape
df['typeoffraud'].value_counts()
Y_train.value_counts()
from sklearn.neighbors import KNeighborsClassifier
acc = []
for i in range(1,25,2):

neighbors = KNeighborsClassifier(n_neighbors=i)
neighbors.fit(X_train, Y_train)
train_acc = np.mean(neighbors.predict(X_train) == Y_train)
test_acc = np.mean(neighbors.predict(X_test) == Y_test) acc.append([train_acc, test_acc])
plt.plot(np.arange(1, 25, 2), [i[0] for i in acc],"ro-")
plt.plot(np.arange(1, 25, 2), [i[1] for i in acc],"bo-")
print(acc) # now we will take isfraud as the output column
#
Y = df['isfraud']
Y
X_train, X_test, Y_train, Y_test = train_test_split(X,Y, test_size=0.25)
X_train.shape, X_test.shape, Y_train.shape, Y_test.shape
acc_1 = []
for i in range(1,25,2):

neighbors = KNeighborsClassifier(n_neighbors=i)
neighbors.fit(X_train, Y_train)
train_acc = np.mean(neighbors.predict(X_train) == Y_train)
test_acc = np.mean(neighbors.predict(X_test) == Y_test)
acc_1.append([train_acc, test_acc])
plt.plot(np.arange(1, 25, 2), [i[0] for i in acc_1],"ro-")
plt.plot(np.arange(1, 25, 2), [i[1] for i in acc_1],"bo-")
print(acc_1)

Explanation

KNN is used in the code mentioned above. The file output.html contains the output from AutoEDA. The amountofmoney column's normalisation function is what matters most in this situation. The column is made scalefree and unitless using a technique known as the min-max scaler. The formula involves removing each value from the dataset that has the least value, then dividing the remaining values by the difference between the maximum and minimum values. As a result, the data's 0–1 range is narrower. It is simpler for the machine to process the column when its range is decreased.

A method of encoding categorical data to some quantitative output is label encoding. Since both the typeofaction and typeoffraund columns include categorical data, this is applied to both. The values of 0; cash-in; and 1; transfer will be returned by typeofaction. The numbers 0, 1, 2, 3 that represent type1, type2, type3, and none, respectively, will be returned by typeoffraud. The dataset is now prepared for the stage of model construction.

The quantity of money and the type of action are utilised as the input columns, and the first category of fraud is used as the output column. K values should always be treated as odd since K is defined to be in the range of 1 and 25, with step 2 being supplied as in KNN. Below are the results of the accuracy test and training: The K-Nearest Neighbour approach, which categorises data points based on their closeness to other data points within the dataset, will be used to build the model in Python. This is a supervised learning model, meaning that the outcome is predetermined. 0

360DigiTMG

As seen the accuracy is decreasing with more neighbors being factored in. testing accuracy is in blue and red represents training accuracy. As for the ideal number of neighbors, K=13 is chosen as test accuracy is around 96% and after this it falls off, training accuracy around this point is 95%.

Then isfraud is chosen as the output column with input columns being kept the same. Below are the results.

360DigiTMG

In this scenario, K is again taken as K = 13, where test accuracy is at its highest at nearly 96% and training accuracy is also nearly the same.

Conclusions

Money laundering is a significant issue, and since the globe appears to be entering a recession, it is becoming more and more crucial to stop it. Artificial intelligence and machine learning are crucial in reducing this. The importance of machine learning and artificial intelligence will increase as data science continues to expand.

The use case study comes to an end here. The dataset is quite simple, yet KNN is used to categorise and identify money laundering. To discover more about money laundering, how it operates, and how it is being stopped in the modern world, readers are urged to read this article. Additionally, readers can experiment with various datasets or use various machine learning techniques on the dataset mentioned above.

Click here to learn Data Science Course, Data Science Course in Hyderabad, Data Science Course in Bangalore

Data Science Placement Success Story

Data Science Training Institutes in Other Locations

Agra, Ahmedabad, Amritsar, Anand, Anantapur, Bangalore, Bhopal, Bhubaneswar, Chengalpattu, Chennai, Cochin, Dehradun, Malaysia, Dombivli, Durgapur, Ernakulam, Erode, Gandhinagar, Ghaziabad, Gorakhpur, Gwalior, Hebbal, Hyderabad, Jabalpur, Jalandhar, Jammu, Jamshedpur, Jodhpur, Khammam, Kolhapur, Kothrud, Ludhiana, Madurai, Meerut, Mohali, Moradabad, Noida, Pimpri, Pondicherry, Pune, Rajkot, Ranchi, Rohtak, Roorkee, Rourkela, Shimla, Shimoga, Siliguri, Srinagar, Thane, Thiruvananthapuram, Tiruchchirappalli, Trichur, Udaipur, Yelahanka, Andhra Pradesh, Anna Nagar, Bhilai, Borivali, Calicut, Chandigarh, Chromepet, Coimbatore, Dilsukhnagar, ECIL, Faridabad, Greater Warangal, Guduvanchery, Guntur, Gurgaon, Guwahati, Hoodi, Indore, Jaipur, Kalaburagi, Kanpur, Kharadi, Kochi, Kolkata, Kompally, Lucknow, Mangalore, Mumbai, Mysore, Nagpur, Nashik, Navi Mumbai, Patna, Porur, Raipur, Salem, Surat, Thoraipakkam, Trichy, Uppal, Vadodara, Varanasi, Vijayawada, Visakhapatnam, Tirunelveli, Aurangabad

Data Analyst Courses in Other Locations

ECIL, Jaipur, Pune, Gurgaon, Salem, Surat, Agra, Ahmedabad, Amritsar, Anand, Anantapur, Andhra Pradesh, Anna Nagar, Aurangabad, Bhilai, Bhopal, Bhubaneswar, Borivali, Calicut, Cochin, Chengalpattu , Dehradun, Dombivli, Durgapur, Ernakulam, Erode, Gandhinagar, Ghaziabad, Gorakhpur, Guduvanchery, Gwalior, Hebbal, Hoodi , Indore, Jabalpur, Jaipur, Jalandhar, Jammu, Jamshedpur, Jodhpur, Kanpur, Khammam, Kochi, Kolhapur, Kolkata, Kothrud, Ludhiana, Madurai, Mangalore, Meerut, Mohali, Moradabad, Pimpri, Pondicherry, Porur, Rajkot, Ranchi, Rohtak, Roorkee, Rourkela, Shimla, Shimoga, Siliguri, Srinagar, Thoraipakkam , Tiruchirappalli, Tirunelveli, Trichur, Trichy, Udaipur, Vijayawada, Vizag, Warangal, Chennai, Coimbatore, Delhi, Dilsukhnagar, Hyderabad, Kalyan, Nagpur, Noida, Thane, Thiruvananthapuram, Uppal, Kompally, Bangalore, Chandigarh, Chromepet, Faridabad, Guntur, Guwahati, Kharadi, Lucknow, Mumbai, Mysore, Nashik, Navi Mumbai, Patna, Pune, Raipur, Vadodara, Varanasi, Yelahanka

 

Navigate to Address

360DigiTMG - Data Science, IR 4.0, AI, Machine Learning Training in Malaysia

Level 16, 1 Sentral, Jalan Stesen Sentral 5, Kuala Lumpur Sentral, 50470 Kuala Lumpur, Wilayah Persekutuan Kuala Lumpur, Malaysia

+60 19-383 1378

Get Direction: Data Science Course

Read
Success Stories
Make an Enquiry