Sent Successfully.
Home / Blog / Data Science / Security Analysis
Security Analysis
Table of Content
Here, we'll look at one use of the domain. We'll examine how the use case is implemented from beginning to end and how it benefits the domain.
Money laundering is the practise of disguising the illicit source of funds obtained via unlawful behaviour such as corruption, drug trafficking, human trafficking, or any other illegal activity. Similar to washing textiles, the purpose of money laundering is to make the money you have received look legitimate on paper.
Money laundering is the process of hiding the illegitimate origin of money generated from criminal activities like human trafficking, drug trafficking, embezzlement, corruption, or any other form of illegal activity. The goal of laundering is similar to washing clothes, you clean up the money you’ve received and make it appear legitimate on paper. Money laundering occurs in 3 phases:-
- Placement: - This refers to the movement of money earned illegally to a financial institute.
- Layering: - This is the stage where the money is transferred through multiple transactions, through various financial institutes to hide the source of money
- Integration: - This is the stage where the money is ready to use as a source of legitimate money
Estimations of money laundering vary between USD$800 billion a year to USD$2 trillion every year, which represents about 2-5% of the global GDP.
The necessity to identify money laundering is becoming more and more important. This money is used to finance terrorism, crime, and other bad things. Additionally, governments all around the world are losing out on much-needed tax money. Numerous financial institutions and regulatory organisations, such as the SEC in the United States or the RBI in India, are developing various strategies to combat this. As there are frequently trends in how individuals wash, pattern identification is vital. But there is also the ongoing challenge of attempting to identify patterns in the ever-growing data. The amount of data created is greatly increasing, just like it is in every other industry. As a result, artificial intelligence and machine learning have begun to play a significant role in the detection and prevention of money laundering.
Learn the core concepts of Data Science Course video on Youtube:
Use Case – Money Laundering
The dataset provides a table providing information on the type of transfer, source ID, destination ID, Amount, whether is it fraud or not and the type of laundering.
Data Understanding
Typeofaction: - This specifies the type of transaction conducted. This is in categorical form. These are: -
- Transfer
- Cash
Sourceid: - The origin of the money. This is in numeric format
Destinationid: - The final point where the money is sent, this is in numeric
Amountofmoney: - This the money transferred. This is also in numeric format
Date: This is when the transaction is initiated, this is in datetime format
Isfraud: This tell if there is fraud committed, 0 means no fraud, 1 means fraud. This is in binary categorical format
Typeofraud: this tells if fraud is commited. There are 4 categories.
- Type I fraud
- Type II fraud
- Type III fraud
- None i.e. no fraud
Exploratory Data Analysis
EDA becomes crucial since it is crucial to understand data before creating an ML model. With the help of EDA, it is possible to determine the structure and substance of the data. It is also feasible to assess the performance of the data and comprehend how any modifications to the data impact the data. The profiling library for Panda is used for EDA. This Auto-EDA library does the initial EDA without requiring manual coding or report saving.
Let's start by examining the different columns. We have four classified categories and three numerical columns based on the EDA report. There are 2340 records with no missing data and a total of 7 variables.
Typeofaction is a binary categorical data column having two types of categories i.e. ‘cash-in’ and ‘transfer’. The transfer has a total of 1580 records and cash-in as 760 records. So transfers are 67.5% of the dataset and cash-in is 32.5%.
Sourceid and destinationid are unique identifiers for financial institutes. As such, there isn’t much need for analysis of this dataset.
Amountofmoney represents the total money spent. This has 939 distinct values. The following table is giving key information for this column
Quantile StatisticsMinimum | 13332 |
Q1 | 335914 |
Median | 1162353.5 |
Q3 | 4686559.25 |
5th percentile | 215500 |
95th percentile | 7259213 |
Maximum | 7952497 |
Range | 7939165 |
IQR | 4350645.25 |
Standard Deviation | 2560433.61 |
Kurtosis | -0.988 |
mean | 2508582.891 |
Skewness | 0.74457 |
Variances | 6.555 × 1012 |
Based on EDA, this column has a very large range, so normalization techniques need to apply to get it within range.
The date column has 2309 distinct values, so this column will be ignored for the model building stage.
Isfraud has two outputs, 0 meaning no fraud and 1 meaning fraud. Fraud has 1399 entries and no fraud has 941 entries.
Typeoffraud has 4 possible options, type1, type2, type3, and none. Type1 has 423 entries, type2 has 465 entries, type3 has 511 entries, and none has 941 entries.
In terms of correlation, there aren’t any columns displaying a high degree of collinearity.
Method Setup
The K-Nearest Neighbour approach, which categorises data points based on their closeness to other data points within the dataset, will be used to build the model in Python. This is a supervised learning model, meaning that the outcome is predetermined.
Model building
#!/usr/bin/env python# coding: utf-8
import pandas as pd
import seaborn as sns
import numpy as np
import matplotlib.pyplot as plt
df = pd.read_csv("ML.csv")
df.head()
df['typeoffraud'].unique()
df['isfraud'].unique()
df.describe()
df.info()
from pandas_profiling import ProfileReport
prof = ProfileReport(df)
prof.to_file(output_file="output.html")
from sklearn.preprocessing import LabelEncoder
enc = LabelEncoder()
def norm_func(i):
x = (i-i.min()) / (i.max()-i.min())
return (x)
df['typeofaction'] = enc.fit_transform(df['typeofaction'])
df['typeoffraud'] = enc.fit_transform(df['typeoffraud'])
df.head()
norm_money = norm_func(df[['amountofmoney']])
norm_money
X = df[['typeofaction']]
Y = df['typeoffraud']
X
X = pd.concat([X, norm_money], axis = 1)
X
X = X.values
from sklearn.model_selection import train_test_split
X_train, X_test, Y_train, Y_test = train_test_split(X,Y, test_size=0.25)
X_train.shape, X_test.shape, Y_train.shape, Y_test.shape
df['typeoffraud'].value_counts()
Y_train.value_counts()
from sklearn.neighbors import KNeighborsClassifier
acc = []
for i in range(1,25,2):
neighbors = KNeighborsClassifier(n_neighbors=i)
neighbors.fit(X_train, Y_train)
train_acc = np.mean(neighbors.predict(X_train) == Y_train)
test_acc = np.mean(neighbors.predict(X_test) == Y_test) acc.append([train_acc, test_acc])
plt.plot(np.arange(1, 25, 2), [i[0] for i in acc],"ro-")
plt.plot(np.arange(1, 25, 2), [i[1] for i in acc],"bo-")
print(acc) # now we will take isfraud as the output column
#
Y = df['isfraud']
Y
X_train, X_test, Y_train, Y_test = train_test_split(X,Y, test_size=0.25)
X_train.shape, X_test.shape, Y_train.shape, Y_test.shape
acc_1 = []
for i in range(1,25,2):
neighbors = KNeighborsClassifier(n_neighbors=i)
neighbors.fit(X_train, Y_train)
train_acc = np.mean(neighbors.predict(X_train) == Y_train)
test_acc = np.mean(neighbors.predict(X_test) == Y_test)
acc_1.append([train_acc, test_acc])
plt.plot(np.arange(1, 25, 2), [i[0] for i in acc_1],"ro-")
plt.plot(np.arange(1, 25, 2), [i[1] for i in acc_1],"bo-")
print(acc_1)
Explanation
KNN is used in the code mentioned above. The file output.html contains the output from AutoEDA. The amountofmoney column's normalisation function is what matters most in this situation. The column is made scalefree and unitless using a technique known as the min-max scaler. The formula involves removing each value from the dataset that has the least value, then dividing the remaining values by the difference between the maximum and minimum values. As a result, the data's 0–1 range is narrower. It is simpler for the machine to process the column when its range is decreased.
A method of encoding categorical data to some quantitative output is label encoding. Since both the typeofaction and typeoffraund columns include categorical data, this is applied to both. The values of 0; cash-in; and 1; transfer will be returned by typeofaction. The numbers 0, 1, 2, 3 that represent type1, type2, type3, and none, respectively, will be returned by typeoffraud. The dataset is now prepared for the stage of model construction.
The quantity of money and the type of action are utilised as the input columns, and the first category of fraud is used as the output column. K values should always be treated as odd since K is defined to be in the range of 1 and 25, with step 2 being supplied as in KNN. Below are the results of the accuracy test and training: The K-Nearest Neighbour approach, which categorises data points based on their closeness to other data points within the dataset, will be used to build the model in Python. This is a supervised learning model, meaning that the outcome is predetermined. 0
As seen the accuracy is decreasing with more neighbors being factored in. testing accuracy is in blue and red represents training accuracy. As for the ideal number of neighbors, K=13 is chosen as test accuracy is around 96% and after this it falls off, training accuracy around this point is 95%.
Then isfraud is chosen as the output column with input columns being kept the same. Below are the results.
In this scenario, K is again taken as K = 13, where test accuracy is at its highest at nearly 96% and training accuracy is also nearly the same.
Conclusions
Money laundering is a significant issue, and since the globe appears to be entering a recession, it is becoming more and more crucial to stop it. Artificial intelligence and machine learning are crucial in reducing this. The importance of machine learning and artificial intelligence will increase as data science continues to expand.
The use case study comes to an end here. The dataset is quite simple, yet KNN is used to categorise and identify money laundering. To discover more about money laundering, how it operates, and how it is being stopped in the modern world, readers are urged to read this article. Additionally, readers can experiment with various datasets or use various machine learning techniques on the dataset mentioned above.
Click here to learn Data Science Course, Data Science Course in Hyderabad, Data Science Course in Bangalore
Data Science Placement Success Story
Data Science Training Institutes in Other Locations
Agra, Ahmedabad, Amritsar, Anand, Anantapur, Bangalore, Bhopal, Bhubaneswar, Chengalpattu, Chennai, Cochin, Dehradun, Malaysia, Dombivli, Durgapur, Ernakulam, Erode, Gandhinagar, Ghaziabad, Gorakhpur, Gwalior, Hebbal, Hyderabad, Jabalpur, Jalandhar, Jammu, Jamshedpur, Jodhpur, Khammam, Kolhapur, Kothrud, Ludhiana, Madurai, Meerut, Mohali, Moradabad, Noida, Pimpri, Pondicherry, Pune, Rajkot, Ranchi, Rohtak, Roorkee, Rourkela, Shimla, Shimoga, Siliguri, Srinagar, Thane, Thiruvananthapuram, Tiruchchirappalli, Trichur, Udaipur, Yelahanka, Andhra Pradesh, Anna Nagar, Bhilai, Borivali, Calicut, Chandigarh, Chromepet, Coimbatore, Dilsukhnagar, ECIL, Faridabad, Greater Warangal, Guduvanchery, Guntur, Gurgaon, Guwahati, Hoodi, Indore, Jaipur, Kalaburagi, Kanpur, Kharadi, Kochi, Kolkata, Kompally, Lucknow, Mangalore, Mumbai, Mysore, Nagpur, Nashik, Navi Mumbai, Patna, Porur, Raipur, Salem, Surat, Thoraipakkam, Trichy, Uppal, Vadodara, Varanasi, Vijayawada, Visakhapatnam, Tirunelveli, Aurangabad
Data Analyst Courses in Other Locations
ECIL, Jaipur, Pune, Gurgaon, Salem, Surat, Agra, Ahmedabad, Amritsar, Anand, Anantapur, Andhra Pradesh, Anna Nagar, Aurangabad, Bhilai, Bhopal, Bhubaneswar, Borivali, Calicut, Cochin, Chengalpattu , Dehradun, Dombivli, Durgapur, Ernakulam, Erode, Gandhinagar, Ghaziabad, Gorakhpur, Guduvanchery, Gwalior, Hebbal, Hoodi , Indore, Jabalpur, Jaipur, Jalandhar, Jammu, Jamshedpur, Jodhpur, Kanpur, Khammam, Kochi, Kolhapur, Kolkata, Kothrud, Ludhiana, Madurai, Mangalore, Meerut, Mohali, Moradabad, Pimpri, Pondicherry, Porur, Rajkot, Ranchi, Rohtak, Roorkee, Rourkela, Shimla, Shimoga, Siliguri, Srinagar, Thoraipakkam , Tiruchirappalli, Tirunelveli, Trichur, Trichy, Udaipur, Vijayawada, Vizag, Warangal, Chennai, Coimbatore, Delhi, Dilsukhnagar, Hyderabad, Kalyan, Nagpur, Noida, Thane, Thiruvananthapuram, Uppal, Kompally, Bangalore, Chandigarh, Chromepet, Faridabad, Guntur, Guwahati, Kharadi, Lucknow, Mumbai, Mysore, Nashik, Navi Mumbai, Patna, Pune, Raipur, Vadodara, Varanasi, Yelahanka
Navigate to Address
360DigiTMG - Data Science, IR 4.0, AI, Machine Learning Training in Malaysia
Level 16, 1 Sentral, Jalan Stesen Sentral 5, Kuala Lumpur Sentral, 50470 Kuala Lumpur, Wilayah Persekutuan Kuala Lumpur, Malaysia
+60 19-383 1378