Khóa luận tốt nghiệp: An approach for fraud detection in financial transactions using machine learning methods

VIETNAM NATIONAL UNIVERSITY HOCHIMINH CITYUNIVERSITY OF INFORMATION TECHNOLOGY ADVANCED PROGRAM IN INFORMATION SYSTEMS NGUYEN THI MY LAN - 16520651 LE NGOC UYEN VY - 16521472 AN APPROACH

Trang 1

VIETNAM NATIONAL UNIVERSITY HOCHIMINH CITY

UNIVERSITY OF INFORMATION TECHNOLOGY

ADVANCED PROGRAM IN INFORMATION SYSTEMS

NGUYEN THI MY LAN - 16520651

LE NGOC UYEN VY - 16521472

AN APPROACH FOR FRAUD DETECTION IN

FINANCIAL TRANSACTIONS

USING MACHINE LEARNING METHODS

BACHELOR OF ENGINEERING IN INFORMATION SYSTEMS

THESIS ADVISOR

Dr CAO THI NHAN

HO CHI MINH CITY, 2020

Trang 2

ASSESSMENT COMMITTEE

The Assessment Committee is established under the Decision , date

—— by Rector of the University of Information Technology

1 - Chairman

”— - Secretary

=— - Member

Trang 3

oElœ

The thesis topic will not be completed without any assistance So the authorsimplementing the topic gratefully give acknowledgment to their support andmotivation during the graduation thesis project

First of all, we would like to thank the Lecturers of the University of InformationTechnology as well as the Lecturers of the Information Systems Faculty, who taughtand provided the background solid knowledge throughout the time studying at school.This knowledge is an important basis for us to complete our graduation thesis

In particular, we would like to express my endless thanks and gratefulness to mysupervisor, Dr Cao Thi Nhan Her dedicated support and constant advice have helped

us to complete our thesis well Her words of encouragement and comments havegreatly enriched and improved our work Without her guidance, the thesis would nothave been done effectively

During the implementation of the thesis, we have tried to apply effectively what wehave learned as well as learn new technologies to be able to complete the thesis in thebest way However, in the process of implementation, because of limited knowledge,experience, and study time, it is difficult to avoid shortcomings Therefore, we hope

to receive the comments of teachers to complete the thesis better as well as thenecessary knowledge and skills

Thank you so much!

AuthorsNguyen Thi My Lan — Le Ngoc Uyen Vy

Trang 4

Chapter 1 INTRODUCTION sssssssssssssssssseesesesessssssseseseseseessnseseseseseeeeeeeaees 11.1 Problems cect os saaece MO casas se seseevevesenssssusesnevevesenrnnaees 11.2 Aims and Obj€CfÏV€S - 5S 2t SS 1 22 12 1012121210101 1x rry 21.3 Languages, Tools and LibrarieS ¿5525252 S+s+*2+s£+x+xezsexexsx 2Chapter 2 BACKGROUND .cccsssssssenssssssessesesenssesssesesesenensnssssessecsenenenssessereeee 42.1 Fraudulent transaction definitiO - ¿+ ++5+52+2*+*+*+tsxsxexrrereree 42.2 Fraud Detection Approach - +: 5+ + 212121 * 21k 21111111 re 42.3 Techniques

2.4 Related Problems -:-+c s2 52.4.1 Imbalanced dataset problems ¿- ¿6-5252 +*2+2+£££sEzkerrereree 52.4.2 Methods to solve the imbalanced problems + + «+ +++<s<+ 7Chapter 3 MACHINE LEARNING FOR FRAUD DETECTION 103.1 K Nearest Neighibor - Ăn TT HH HH Hư 103.1.1 Applying K Nearest Neighbor (KNN) ccccccceieeeiey I13.1.2 Advantages and disadvantages cà ky 133.2 Logistic Regression

Trang 5

3.2.1 Logistic Regression model -¿- +61 £vskeekrkkskreerkeserree 133.2.2 Apply Logistic Regression step-by-step „l53.2.3 Advantages and disadvantages cà + St sseireeeiey 153.3 Support Vector Machine.

3.3.1 Advantages and Disadvantages -¿- cành 163.3.2 SVM model

3.4 Model evaluation ccceeccccscsseeescscsessesescseseeseecscseseeesececseseneseeseseeneeees 183.4.1 Comfusion matrix ee cccseseeeesescseseseseesesesesesecscsesesesecseseeneeees 193.4.2 Precision and Recall c.cccceeeeseeeeseecseeneeeesesesesesessesesesesecsesenseeeees 203.4.3 F-1 Score ,,cceertiet®.« seasmeegM 0 ssscsscsssssvssscesesensesevsesecses 213.4.4 Receiver Operating Characteristic CUTV€ ¿+ 5+5+5+++<++ 22Chapter 4 EXPERIMEENTTS - 5555 5< 3S 1141111315161 244.1 Data Analytics Pipeline - +5: S+ St re, 244.2 Credit Card Dataset - 5c St S ĐH HH HH re 244.2.1 Describing Data cece th HH HH HH HH HH 244.2.3 PT€DTOCGSSÏNE Làn 12 HH HH HH HH0 gi 254.2.4 Ñ€SUÏfS ng H0 H00 H010 011 H00 304.2.5 DiSCUuSSÏOH S2 2.121 1010101001010 01011 re 464.3 Synthetic Financial DataS€(S - cành HH HH HH it 484.3.1 Decribing Data óc 1à TT HT Hà nh Hệ 484.3.3 PT€DTOC€SSÏNE án TH HH HH HH gi 564.3.4 IÑ€SUÏ(S LH TH HT TH TH HT HH TH TH HH Hy 584.3.5 DisCUSSIONS HH HH HH HH HH HH Hi 73Chapter 5 CONCLUSIONS AND FUTURE WORKS -<5<< 76

iii

Trang 6

S.1 - COnCÏUSIOPS S222 3212 9191212 1 10101210 1 0101010 re 765.1.1 The results achieved

5.1.1 The limitations - ¿(óc 2121 1131 1 E1 21 1 H1 1H HH ghê 765.2 Future works

REFERENCES Ăn 01011000010101010000001010101000000000 186 1

iv

Trang 7

LIST OF EIGURES

voles

Figure 1: Identity theft reports in the United States -+©-<++ xiiFigure 2: Most common types of identity thet «655 S+c+xsserersreree xiiiFigure 3: Credit card fraud reports by Y€AT - - +5 + 55252 Sc+x+xzesersrsrseree xiv

Figure 3.1: Applying the K Nearest Neighbor (KNN) algorithm step-by-step [8] 11Figure 3.2: Common distance metrics [9] -+ + +5++2++£+++++++e£z£+czzezsx+ 12Figure 3.3: Graph of Logistic curve where o=0 and B=1 [1 I] - 14Figure 3.4: Graph of Sigmod function [12] - + -+++++++++++e++++x+++++ 14Figure 3.5: Confusion Matrix 0 cccccccccccsesecsceeeeesesesceseeeseseseneneeseseseneeesssseseeee 19Figure 3.6: Precision and Recall ¿-s + +++5+++£+£+x++e£srsxrtrrerersrrkreree 20Figure 3.7: Receiver Operating Characteristic curve model ‹-s+ 22

Figure 4.1: Data analytics Pipeline

Figure 4.2: The original dataset

Figure 4.3: Dataset Class Distribution 27Figure 4.4: Transactions amount distrIDutiOI ¿+ «++£s+£+*£*££veseeeeereee+ 27Figure 4.5: Transaction Time distributiON eee cesses ee eeeseseneeeeeeseneees 28Figure 4.6: Dataset after scaling features “Time” and “Amounf” - 29Figure 4.7: Confution matrix of testing data when applying KNN algorithm in theoriginal Credit Card dataset ¿+ ¿S1 1S 91 121 1 5111 1 1111 11101001 210101 r 31Figure 4.8: Confution matrix of testing data when applying LR algorithm in the

original Credit Card đafAS€K s6 5 111 1n TH TH ng HH nh Hư 32Figure 4.9: ROC curve of LR in the imbalanced Credit Card dataset 33Figure 4.10: Confution matrix of testing data when applying KNN algorithm in theoriginal Credit Card dataset ¿<5 1xx 91 121111510101 111211101010 1010 ty 34Figure 4.11: ROC curve of SVM in the imbalanced Credit Card dataset 34

Trang 8

Figure 4.12: Class distribution in the subsample after using Random Undersampling

— ÔÔÔÔÔÔÔÔ 35Figure 4.13: Confusion matrix of testing data when applying KNN with

undersampling in the Credit Card dataset ccccccssscssssesesescseeeesesesesenseeeseseneees 36Figure 4.14: Confusion matrix of testing data when applying LR with

undersampling in the Credit Card dataset - 25-252 52 5++*+++£s£sezzezsrs 37Figure 4.15: ROC curve of LR in the balanced Credit Card dataset using RandomUndersammpling -. +2 + 5 232 393212323 5191212113 171 717111 1101711111 117111 tr 38Figure 4.16: Confusion matrix of testing data when applying SVM with

undersampling in the Credit Card dataset - ¿555252 5+*2+5*++x+++s+xcx++ 39Figure 4.17: ROC curve of SVM in the balanced Credit Card dataset using Randomndersampling - ¿- ¿+22 2% 1212123 5391212113 111712111 217101111 111110111011 111gr 40Figure 4.18: Class distributions of the Credit Card dataset after applying SMOTE 41Figure 4.19: Confusion matrix of testing data when applying KNN with SMOTE inthe Credit Card dataset 42Figure 4.20: Confusion matrix of testing data when applying LR with SMOTE in

the Credit Card dataset 43Figure 4.21: ROC curve of LR in the balanced Credit Card dataset using SMOTE 44Figure 4.22: Confusion matrix of testing data when applying SVM with SMOTE inthe Credit Card dataset ccccccseessesssssssssssesesnsnessssssesseseenensisssssesseseeeisssssasesseaes 45Figure 4.23: ROC of SVM in balanced Credit Card dataset using SMOTE 46Figure 4.24: The comparison chart of the "accuracy" between different algorithms

on the Credit Card đia(ASCTL S2 E121 E3 5191212111121 1 11111 1110101110 12101, 47Figure 4.25: The chart compares the "Fl-score" of the algorithms on the Credit

Card datasets cccccsccsessesesesessessssesessesesessssensesesssesessesensnesssssssssessesesssssessasesases 48Figure 4.26: shows number of transactions which are the actual fraud per

tramsaction tyPe St TT HT HH TH HH Hy 52Figure 4.27: Distribution Of types - ¿+ kg ng it 52Figure 4.28: Distribution of EFauid ceeceeececseseeeneeseseseseeeeeeeeseneeeeeeaeae 53

vi

Trang 9

Figure 4.29: Distribution of the feature FlaggedFraud - ¿ «-«+++++x++ 54Figure 4.30: Original Paysim dataset - cà 1n HH ng it 56Figure 4.31: Data “isFraud” distributiOI - 5+ S1, 57Figure 4.32: Data Clean - ¿5c + 22121 221519121211 210121 21111111211 1111 58Figure 4.33: Confusion matrix of testing data when applying K Nearest Neighborsalgorithm in the original Paysim dataset - ¿+52 2522 S*+2£+£+zzersrserrrre 59Figure 4.34: Confution matrix of testing data when applying LR algorithm in the

original Paysim đa(AS€(L - + - S222 12121 3 5191212111 210171711110 11 711111 11t 60Figure 4.35: ROC curve of LR in the imbalaced Paysim dataset - 61Figure 4.36: Confution matrix of testing data when applying SVM algorithm in theOriginal Paysim điafAS€ 2523232212121 51912121111 1011211101011 2111011 1c 62Figure 4.37: ROC curve of SVM in the imbalanced Paysim dafaset 63Figure 4.38: Class distribution in the subsample after using Random Undersampling

Figure 4.39: Confusion matrix of testing data when applying KNN with

undersampling in the Paysim dataset

Figure 4.40: Confusion matrix of testing data when applying LR with

undersampling in the Paysim dataset

Figure 4.41: ROC curve of LR in the balanced Paysim dataset using Random

ndersaimpling - - - 25252 E121 E191 121 311 1712111210101 0100210101010 010111, 67Figure 4.42: Confusion matrix of testing data when applying SVM with

undersampling in the Paysim dataset ¿+ 5 5S E2 *EESESEkEEkekekrrkrree 68Figure 4.43: ROC curve of SVM in the balanced Paysim dataset using Ramdom

UnderSampÏinng - - ‹- + 1k 1 1 1 1E TT TT HT TH TH TH Hư 69Figure 4.44: Confusion matrix of testing data when applying LR with SMOTE in

the Paysim dafASCL ó2 1 121 1 9191 12111121110 010 2101010100101 T1 0H10 xe 70Figure 4.45: ROC curve of LR in the balanced Paysim dataset using SMOTE 71Figure 4.46: Confusion matrix of testing data when applying SVM with SMOTE inthe Paysim dataset - - 11k TT HH HH HH HH TH HH Hy 72

vii

Trang 10

Figure 4.47: ROC curve of SVM in the balanced Paysim dataset using SMOTE 73Figure 4.48: The comparison chart of the "accuracy" between different algorithms

on the Paysim đafaS€(S óc S11 121 11191 11H11 TT HT HH HH Hy 74Figure 4.49: The chart compares the "Fl-score" of the algorithms on the Paysim

viii

Trang 11

LIST OF TABLES

roles

Table 3.1: Commonly used Kernel functions -. -+ 55s + 5s+++sss£+c+s+>++ 18Table 4.1: Credit Card Fraud Detection Dataset description 24Table 4.2: Number of columns and records from the dataset - 25Table 4.3: Class distribution of the Credit Card Fraud Detection Dataset 25Table 4.4: Dataset check missing ValUC - ¿ ¿<5 + 2x ‡stskseerrreee 26Table 4.5: The dataset after scaÏing - ¿25252 +22 2*Esxerekererrrree 30Table 4.6: The Result when using KNN in original Credit Card dataset 30Table 4.7: The Result when using LR in original Credit Card dataset 32Table 4.8: The Result when using SVM in original Credit Card dataset 33Table 4.9: The result when running KNN with balanced Credit Card dataset usingRandom Undersaimpling + + + 2£ S252 **2E2*E£E#E£EEEEEEeEeErkrkrkrererrkrkrre 36Table 4.10: The result when running LR with balanced Credit Card dataset using

Random Undersampling 37Table 4.11: The result when running SVM with balanced Credit Card dataset usingRandom Undersampling 39Table 4.12: The dataset classes before and after using SMOTE 40Table 4.13: The result when running KNN with balanced Credit Card dataset usingb0 — 41Table 4.14: The result when running LR with balanced Credit Card dataset using

b0 ƠƠƠƠƠƠ 43Table 4.15: The result when running SVM with balanced Credit Card dataset using(00 44

Table 4.17: The comparison “F1-Score” between algorithms Credit Card dataset 48

Table 4.18: Dataset DesCrip(iOn ¿ĩ5 1S S1 12 E111 HH1, 49Table 4.19: Paysim Data Types ĩc S12 SH HH HH HH Hư, 50Table 4.20: check missing value in Paysim đaf(aS€( - 6-55 S* sex 51

ix

Trang 12

Table 4.21: Quantity statistics by transaction type - ác sssrersrsee 51Table 4.22: Class “isFraud” distribution of the Paysim Dataset 53Table 4.23: Statistics of the number of transactions on the isFlaggedFraud

D100 ốc ốc cố Cố Cổ CÔ 53Table 4.24: The Result when using K Nearest Neighbors in original Paysim

(a(AS€(L à n1 H111 re 58Table 4.25: The Result when using LR in original Paysim dataset 60Table 4.26: The Result when using SVM in original Paysim dataset 61Table 4.27: The result when running KNN with balanced Paysim dataset using

Random Undersampling c.cccecesescsesceeesesescseneesesesesesseseseseseeecsesesesensessseseaeees 64Table 4.28: The result when running LR with balanced Paysim dataset using

Random Ủndersaimpling + +54 E2*2***£5E2*2E+*££E#E#EEEEEEekskrkrkrkrerrrrkrkree 66Table 4.29: The result when running SVM with balanced Paysim dataset using

Random Ủndersaimpling ¿+ c5 222232 *9*212E*E£E£E 2 EEEExeEEekrrkree 67Table 4.30: Number of transactions in Paysim dataset before and after using

SMOTE

Table 4.34: The comparison “F1-score” between algorithms in Paysim dataset 75

Trang 13

4 | Support Vector Machine SVM

5 Techni Minority Oversampling SMOTE

Trang 14

The financial industry has always dealt with fraud-related problems like missingand damaging in transactions In The United States, there are over 270,000 reportswhich makes credit card fraud become the most common type of identity thief Thenumber of frauds has increased doubling from 2017 to 2019 [1]

Trang 15

Credit card fraud

Other identity thet P2)

Loan or lease fraud

Phone or utilities fraud

Bank fraud

Employment or tax-related fraud

Government documents or benefits fraud 4ã 52

50K 100K 150K 200K 250K

Figure 2: Most common types of identity theft”

Nowadays, there are more and more delicate techniques used by criminals forstealing the money from the user accounts As a result, detecting fraudulenttransactions is becoming more difficult because many illegal transactions look like

the normal one In addition, the number of fraudulent transactions is higher than in

the last few years

? Source: https://www.fool.com/the-ascent/research/identity-theft-credit-card-fraud-statistics

xiii

Trang 16

@ Credit card fraud reports

Figure 3: Credit card fraud reports by year?

One of the most common challenges when facing fraudulent transactions is theskewed distribution of classes The proportion of fraud classes is usually smallermany times than the unfraud classes Though the classifiers should be inclinedtowards the minority group like fraud, they will focus on the majority group because

of their regular appearances For dealing with this, in the research, we useUndersampling, Oversampling and One-class Classification techniques

Another problem incurs with an imbalance dataset is how to choose theperformance measures used to evaluate models From this research, we choose theFl-score, confusion matrix, precision and recall to evaluate the accuracy

3 Source: https://www.fool.com/the-ascent/research/identity-theft-credit-card-fraud-statistics

xiv

Trang 17

Chapter 1 INTRODUCTION

1.1 Problems

For many decades, fraud in financial transactions has caused a lot of seriousdamages to the economy and the development of many business over the world.Therefore, enterprises always spend large of their resources in detecting fraud intransactions

Nowadays, online transactions gradually become more popular, especially intime of the current COVID-19 epidemic, more than 50% of global transactions areusing digital payment Consequently, the way criminals used for stealing the moneyfrom transactions has changed to hacking the accounts of the victims In addition, asour life become increasingly modern technology, the fraudulences are more complexand sophisticated, the fraud detection have more difficulty The guilty behavior likestealing money from transactions happens regularly and recklessly They takeadvantages from any defects in system that they can find with the advancedtechnologies Before the financial enterprises recognized, they have already beenstolen millions of dollars This is a hard problem that many businesses have to facewith

When a fraud transaction appears, it is hard to define that transaction is fraud ornot because it looks like a legal one Furthermore, businesses have difficulty indetecting fraud due to the lack of related documents and public datasets The banksand financial enterprises have to keep their data secret because of the privacy concern.This put the financial businesses in problem of finding an effective way for detectfraudulent

Preventing and removing fraudulent in financial transactions are criticalproblems of each business There are many methods given for dealing with this, butone of the most effective ways is applying machine learning algorithms Normally,the datasets of transactions are so large that they have to be applied machine learning

Trang 18

algorithms for automatically analyzing behaviors of the users, forecasting fraud infuture and finding the more potential method to deal with fraudulent transactions.

Due to the importance of solving the fraud detection problem in real world, wedecide to study how to apply the machine learning algorithms to the classificationproblems like finding fraud transactions Then, to prove our study, we are going tobuild a simple tool which take the sample dataset as the input and the description offraud results as the output

1.2 Aims and Objectives

There are two things that we need to gain from this research The first isunderstanding the algorithms using in financial transactions The second is able toapply the algorithms we research on a number of sample dataset

The purpose of this study is how to detect fraud in financial transactions usingmachine learning algorithms and apply them to the sample datasets

The datasets we use for this research are Credit Card Fraud Detection andSynthetic Financial Datasets for Fraud Detection from Kaggle

In this research, we use two popular techniques, Undersampling andOversampling, to solve the imbalanced dataset problem In addition, we conductexperiments by appying three algorithms to build model: K-nearest Neighbor, LinearRegression, Support Vector Machine

1.3 Languages, Tools and Libraries

e Pandas: a open source library for analyzing the data

e Numpy: a library for working with array in python

e Matplotlib: used for plotting (usually combined with Numpy for analyzing the

data)

Trang 19

e Sklearn: a strong library for machine learning in Python It provides a ton of

useful tools for machine learning and statistical modeling

Trang 20

Chapter 2 BACKGROUND

2.1 Fraudulent transaction definition

Following the creditcard.com, “fraudulent transaction is one unauthorized by thecredit card holder Such transactions are categorized as lost, stolen, not received,issued on a fraudulent application, counterfeit, fraudulent processing of transactions,account takeover or other fraudulent conditions as defined by the card company orthe member company.”

2.2 Fraud Detection Approach

The information from the financial dataset like time, amount of transaction and

so on were used by the researcher to determine if the suspicious transaction is fraud

or not, or to define the outliers from the data

The scientists applied many different classification techniques for detectingfraudulent financial and predicting business failures: neural network, support vectormachine, k nearest neighbor, logistic regression, decision trees and so on The resultshows that logistic regression and support vector machine outperforms the otheralgorithms

e Data-driven Fraud Detection Approach

The year 2020 is the era of data Many types of data are collected fromeverywhere in our lives: digital technology, transactions, social media and so on.Following the data IDC, the volume of data collected everyday in the world willaccumulate to 175 Zettabytes by 2025 [2] Data is divided into some characteristics:volume, velocity and variety, which leads us to the concept of “Big Data and data-driven applications”

The explosion of data, especially in the financial field, makes fraud detection infinancial transactions explore new opportunities and challenges in approaching theproblem with the support of Big Data’s tool

Analysing fraud by data-driven method is a highly potential way because of threefollowing reasons [3]:

Trang 21

- Precision: Normally, the fraud case will be checked by using a human-driven

approach, but this way has many potential risks Humans can not control all ofthe data day-by-day in the best condition, this will lead to some unexpectedresults and the precision is slow down The quality of data is very importantbecause it decides the accuracy of a model The data-driven fraud analytics istowards a system with a higher precision and working with the fraudulentinspected cases

- Operational Efficiency: The data-driven system is built in a combination of

various domains including machine learning, deep learning, mathematics andstatistics Due to this encouragement, the fraud detection will be done moreeffectively and faster when associating with real-time techniques

- Cost Efficiency: An effective fraud detection system that is developed and

maintained by good experts is really challenging and costive By automatingwhen exposing fraudulent data with the involvement of data-drivenmethodologies, the cost will be reduced significantly

2.3 Techniques

e Classification in fraud detection

We use machine learning to forcast which transactions are fraud or not based onthe data from the past With the input includes some information about the transactionsuch as amount of money, type of transation and so on, the output of this problemwill be whether that transaction is fraud or not with binary result — 0 for not fraud and

1 for fraud

2.4 Related Problems

2.4.1 Imbalanced dataset problems

A dataset is considered as an imbalance when the distributions of different classesare unequal In this case, our imbalanced dataset has two classes: fraud and non-fraudwhere the non-fraud classes are the majority and occur more frequently than the fraudclasses (the minority)

Trang 22

Imbalanced dataset is one of the most challenges in fraud detecting problemsbecause most machine learning algorithms normally just focus on the occurrences ofmajority classes.

The problem with imbalanced datasets is that naive classifiers always give thebest results that are equal to the majority class, resulting in more complex algorithmshave the lower accuracy score than the naive classifiers Most of the more complexalgorithms will require modification to prevent prediction based on the majority class

in all cases The seriously imbalanced data makes us confused about the accuracy ofthe model In this case, there are several methods out there to eliminate naivebehaviors such as using Confusion Matrix, ROC, and AUROC

e Accuracy: In cases where her accuracy is too high, this is the accuracy of the

majority class, the accuracy of the minority class cannot be determined

e Precision and Recall:

- High Precision + High Recall: the class is perfectly handled by the model

- High precision + Low Recall: model can not detect class but has high

trustable in use

- Low Precision + High Recall: The detect class is good but the model also

includes the points of the other class in it

- Low Precision + Low Recall: the class is poorly handled by the model

Similarly, Fl-score can only be calculated for the majority class, not for theminority class

In this case, the confusion matrix helps us look at the model again, think aboutthe goal of using the model, and eliminate useless algorithms as well as naiveclassifiers, because the Confusion matrix basically shows how many "Real" datapoints belong to a class and is expected to belong to a class

The ROC curve is a curve described by the set of points created when a giventhreshold changes from 1 to 0 Curve starts at (0,0) and ends at (1.1) A good modelwill have a curve that increases rapidly from 0 to 1 Based on the ROC curve, we can

Trang 23

construct the AUROC-section below the curve The AUROC tends towards I-bestand towards 0.5-worst.

Depending on the purpose of use If the goal is to achieve the best accuracy, thenthe naive classifier will always give the best results because it always answers themajority class Our goal is to find the effective model that requires through a variety

of evaluation methods and conclusions

Descriptions of evaluation models are detailed in Chapter 3, section 3.4 ModelEvaluation

Random resampling is a technique that creates a new sample of the datasetrandomly This is a simple way to get a more balanced dataset to solve imbalanceddata problems

There are two common approaches of random resampling to deal with skeweddata problems - undersampling and oversampling

Importantly, the changes of distribution when applying resampling are just usedfor the training dataset This technique will not be applied to the test dataset whichused to evaluate the performance of the model

2.4.2 Methods to solve the imbalanced problems

e Undersampling

Undersampling techniques remove examples of the majority class from thetraining dataset The final result will be the better balance dataset This process can

be repeated many times until we get the expected class distribution

Undersampling methods are usually applied with oversampling in order to getthe better performance rather than using only undersampling or oversampling singly

on the training dataset

One of the most simple and effective ways to apply undersampling is choosingremoved examples from the majority class randomly and deleting them from thetraining dataset This technique is called random undersampling [4]

Trang 24

Despite the great effects this technique presents, it also has a limitation That isthis method can discard the potentially useful data from the training dataset whichleads to decreased classification performance.

e Synthetic Minority Oversampling Technique (SMOTE)

Opposing to the undersampling techniques, oversampling is a method thatduplicates the existing examples on the minority class and inserts them to the trainingdataset This method also refers to Synthetic Minority Oversampling Technique(SMOTE) This technique was described by Nitesh Chawla and the others in theirpaper published in 2002 named “SMOTE: Synthetic Minority Over-samplingTechnique”

This oversampling approach works by creating the “synthetic” examples instead

of replacing the examples [5]

2.4.2.1 SMOTE step-by-step:

Step 1: Choose an instance x from the minority class sample A With each x € A,calculate the k nearest neighbors distance by using Euclidean metrics between x andthe other samples in set A

Step 2: Randomly choose N examples from the k nearest neighbors for each xbelongs to A (i.e XI, Xa, Xã, , Xn), this set will be called set B

Step 3: For each examples xx € B (k = 1, 2, 3, , N), we use the followingformula to create new examples:

is one-class classification

Trang 25

One-class classification technique can be used for binary imbalancedclassification problems, with negative cases (target class/ 0) is considered normal andpositive cases (outlier class/ 1) is considered exceptional or outliers.

With this technique, instead of focusing on the majority class like the otheralgorithms in machine learning, it will treat the positive cases like the outliers.Therefore, one-class classification ignore the discrimination step and focus on theexpected classes or negative classes

Trang 26

Chapter 3 MACHINE LEARNING FOR FRAUD DETECTION

There are several methods of machine learning that are used for the frauddetection problem Machine learning algorithms can process millions of data objectsquickly, and link instances from seemingly unrelated datasets to detect suspicious

K Nearest Neighbor (KNN) is a simple supervised algorithm in machine learning

It can be used for solving both classification problems and regression problems Inthis research, the KNN is used to solve a classification problem like Fraud detection

in financial transactions

K Nearest Neighbor (KNN) is a non-parameter and lazy learning algorithm Thatmeans the model structure depends on the dataset itself instead of any specificmathematical theoretical and all the training data will be used for testing This will

be very helpful when most real world models are not based on mathematical andanalytic theoretical However, the lazy learning will make the model spend highercosts in memory and time because it makes more burden on testing to scan all thedata points [7]

Definition: “The K Nearest Neighbor (KNN) supposes that similar things existclose to each other.”

In other words, K Nearest Neighbor (KNN) is calculating the distance betweenthe query example and the current examples from the dataset, the query example will

be assigned to the group that has the shorter distance K is the number of neighborsthat we need to find the distance, it is usually an odd number if the number of classes

is an even number

10

Trang 27

3.1.1 Applying K Nearest Neighbor (KNN)

kNN Algorithm

0 Look at the data 1 Calculate distances

Say you want to classify the grey point Start by calculating the distances between

into a class Here, there are three potential the grey point and all other points.

classes - lime green, green and orange.

2 Find neighbours 3 Vote on labels

Point Distance fof

Cla:

O-® 21 —> 1senn % votes Class @ wins

O-@ 2.4 —> 2nd NN @ \2 the vote!

O-@ 3.1 — 3rd NN @ 1 > roinOis® 1 therefore predictedO-® 4.5 —> 4thNN to be of class Ấ)

Next, find the nearest neighbours by Vote on the predicted class labels based ranking points by increasing distance The ‘on the classes of the k nearest neigh- nearest neighbours (NNs) of the grey bours Here, the labels were predicted

point are the ones closest in dataspace based on the ke=3 nearest neighbours.

Figure 3.1: Applying the K Nearest Neighbor (KNN) algorithm step-by-step [8]

The following steps represent how to apply K Nearest Neighbor algorithms (figure3.1):

1 Load data

2 Initial k to the number of chosen neighbors

3 For each example in dataset:

11

Trang 28

3.1 Calculate the distance between the query example and the current from thedataset.

Minkowski distance X;= Oe Xe, 1 Mp)

Figure 3.2: Common distance metrics [9]

There are some ways to calculate the distance metrics (figure 3.2), but themost common way is using Euclidean distance In this research we will useEuclidean to compute the distance

3.2 Add the distance and index of the example to a ordered collection

4 Sort the collection by the distance, indices from smallest to largest

5 Pick the first k from the collection

6 Get the labels from the selected k

7 Return the mode of k labels (classification)

K Nearest Neighbor is one of the best classifier algorithms in detecting fraudtransactions However, there is one thing that we need to notice in order to get themore effective result, that is how to choose the optimal number of neighbors (k).There is no best k number suitable for all the datasets Parameter k depends ondifferent requirements of each dataset One of the ways for solving this problem is

12

Trang 29

running the algorithm several times with different values of k until we get theexpected result.

e Decrease the value of k to 1, the predictions become less stable

e Increase the value of k, the predictions become more stable But the number of

errors will increase

e Choose k as an odd when we taking a majority vote

3.1.2 Advantages and disadvantages

Advantages:

e Simple and easy to implement

e No need to build a model

e Flexible The algorithm can be used for classification, regression and even search.Disadvantages:

e The algorithm will be slower when the volume of data inceases

3.2 Logistic Regression

Logistic Regression (LR) is the most popular and most used machine learningalgorithm specifically used in classification, the general statistical model originallydeveloped and popularized by Joseph Berkson, starting with Berkson (1944) where

he coined "logit" [10] Models trained using Logistic regression algorithms can beused to describe relationships between data variables whether it is binary, continuous,

or categorical Predictions from Logistics Regression can be used to predict whethercertain things will happen or not With the help of this model, we can estimate theprobability, if the variable belongs to the class or not

3.2.1 Logistic Regression model

Logistic regression is one of the classification algorithms, used to predict a binaryvalue in a certain independent set of variables (1/0, Yes / No, True / False)

The binary dependent variable has values of 0 and | and predictive value(probability), Logistic Regression uses a "logistic curve" [11] to represent therelationship between the independent variable and the dependent variable anddetermines the relationship has limited by 0 and 1 If the independent variable

13

Trang 30

increases the predicted value will increase in curve and approach | but never equal 1.Likewise, at the lowest level, the probability approaches 0, but no ever equal to 0(figure 3.3).

0.8 exp(x)/(1 +exp(x)) 0.4

0.0

T T T |

2 4 6

Figure 3.3: Graph of Logistic curve where a=0 and B=1 [11]

The formula for the univariate logistic curve is:

e(€0†€1Z1)

HÀ 1+ e(€ote1*1) 6.1)

To explain Logistic regression we start with the logistic function The logisticfunction is a sigmoid function With any real input “t”, we have outputs a valuebetween 0 and | (figure 3.4)

== Lọ |0)

14

Trang 31

Logistic Regression can be understood simply as:

1 y= { BotBixte >0 (3.2)

else

Where f is parameter that best fit

£ is an error distributed by the standard logisticdistribution

The logistic function (o: R > (0,1)) [11]:

ef 1 1+et 1+ert

a(t) = (3.3)

Where o(t) is the predicted output

Bo is the bias or intercept term

B, is the coefficient for the single input value (x)Let t is a linear function follow as:

t=Bo+ Pix 3.4)

The general logistic function (p: R > (0,1)):

p(x) = a(t) = SES (3.5)

Where p(x) is Probability of the dependent variable

3.2.2 Apply Logistic Regression step-by-step

1

5

Visualizing the data

Building the input dataset is divided into trainData and testData

Setting up the model

Finding parameters

Predicting the new data by the model just found

3.2.3 Advantages and disadvantages

Advantages:

15

Trang 32

e There is no need to assume two data classes are linearly separable.

e Simple to implement

e Low variance

e Provide probabilities for outcomes

e High reliance on proper presentation of data

Disadvantages:

e High bias

¢ Logistic Regression requires data points to be created independently of each other

e Not a very powerful algorithm and can be easily outperformed by other

algorithms

¢ Poor performance on non-linear data

3.3 Support Vector Machine

Support Vector Machine (SVM) is a supervised machine learning algorithm.Developed at AT&T Bell Laboratories by Vapnik together with colleagues, SVMsclose to their current form was first introduced with a paper at the COLT 1992conference (Boser, Guyon and Vapnik 1992) [13, 14]

SVM can be used for both classification and regression However, it is mainlyused in classification problems

3.3.1 Advantages and Disadvantages

Advantages

e Handle high dimensional data well

e Work with both linear and non-linear boundary depending on the kernel used

e Memory savings

Disadvantages

e Susceptible to over-fitting depending on kernel

e Different kernels have different uses (No clear winner)

16

Trang 33

3.3.2.SVM model

Model: Focus to 2 parallel hyperplane

{ w'x+b=1 (3.6) w'x+b=-1

We separate two classes of data by preventing samples from falling into the margin:

The aim of SVM is simply to find an optimal hyperplane to separate two-classes

of data points with the widest margin.

SVM uses kernel methods which are a class of algorithm for sample analysis or recognition.

The kernel function given as follow:

(xj, Xj) > k(x, x;) (3.9)

e Linear kernel function: is the basic kernel function.

K(x, x;) = (%, x;) (3.10)

e Polynomial kernel function: is a non-stationary kernel function.

e RBFkernel function: is most accurate to use in SVM based applications The RBF

kernel function internally uses the y as Gaussian width, which controls the width

of the RBF kernel function [15].

17

Trang 34

k(Œi,xj) = exp(-y||xi-x)||") (3.12)

If y underrated, then training data will be noisy with a highly sensitive decision boundary On the other hand, if y overrated, the exponential will be almost linear and

the high dimensional projection will start losing its non-linearity power.

The table 3.1 shows Commonly used Kernel function.

Table 3.1: Commonly used Kernel functions Kernel name Expression K(x,y) Comments

Linear xTy No parameter

Read the sample dataset.

Split the data into training and testing subsets.

Select one of three kernels (Linear, Quadratic, and RBF).

Predict the class.

Evaluate the prediction.

a particular purpose Therefore, we use a number of model evaluation algorithms to evaluate whether the model is effective and compare the models' capabilities.

18

Trang 35

Several commonly used evaluation methods such as accuracy score, confusion matrix, True / False Positive / Negative, Precision and Recall.

3.4.1 Confusion matrix

The Confusion matrix is considered to be the easiest method to evaluate the performance, as with the help of a confusion matrix we can also visualize the performance, that how many data instances are classified correctly The short description of confusion matrix is shown in the Figure 3.5.

Actual Values

Predicted Values Positive Negative

Figure 3.5: Confusion Matrix

e TP = True Positives ; FP = False Positive

e FN = False Negative; TN = True Negative

The confusion matrix represents:

e True Positive values, which means an actual class of the data matches the

predicted class of the data.

e False Positive represents that the actual class of the data was 1 but the model

predicted it to be 0.

19

Trang 36

e False Negative represents that the actual value of the class was 0 but it was

3.4.2 Precision and Recall

With the classification problem where the data sets of the classes are very different from each other and accuracy is seemed to be misleading, there is an effective operation that is commonly used as Precision-Recall that can give a proper

evaluation of the model The figure 3.6 shows the description about Precision and

Figure 3.6: Precision and Recall*

* Source : https://en.wikipedia.org/wiki/Precision and_ recall

20

Trang 37

3.4.2.1 Precision

Precision demonstrates the ability of the model to correctly predict the X label.

We can calculate precision if TP, TN, FN, FP are available Precision is the number of correct observations (TP) to the predicted positive observations (TP+FP).

The formula for precision is as follows:

+ True Positives (TP

Precision = — Ttue Positives (TP) _ (3 13)

True Positives (TP) + False Positives (FP)

In the formula, the element that causes Precision to increase or decrease is not

the TP, but the FP Therefore, when Precision is high, it means that the FP is small or the number of labels mistakenly predicted to label X is low.

High precision means the accuracy of the points found is high.

3.4.2.2 Recall

A Recall is another metric for evaluating the prediction Recall is defined as the ratio of the predicted True Positive over the total number of points labeled as the

initial True Positive.

The formula for the recall is as follows:

True Positives (TP)

Recall = ———————————_——

True Positives (TP) + False Negative (FN) (3.14)

Recall demonstrates the ability of the model to predict not to miss the X label, just as Precision, Recall depend on FN or in other words, it depends on the possibility that the model predicts incorrectly with the X label.

A high recall means a high True Positive Rate, just as the ratio of actually True Positive points missing is low.

3.4.3 F-1 Score

If only Precision or Recall, the quality of the model cannot be evaluated In the case of Precision = | or recall = 1 however we cannot say this is a good model Then Fl-score is used F-1 1s the harmonic mean of precision and recall.

The F-1 score can be calculated if precision and recall value is available.

21

Trang 38

The formula for the Fl-score is as follows:

Recall~1+Precision~1\ 1 Recall-Precision

"——=——=—=——== -— (3.15)

Recall+Precision

However, in some cases, experts find that the importance of Precision and Recall

is different, so they find it necessary to assign more weights to calculate accordingly.

F1-score is changed as follows:

3.4.4 Receiver Operating Characteristic curve

The Receiver Operating Characteristic (ROC) curve is a graph that illustrates the performance of a binary classification system when changing the classification threshold Curves are generated by plotting the True Positive Rate (TPR) to the False Positive Rate (FPR) at different threshold settings The ROC curve represents the relationship between the sensitivity and the fall-out function (figure 3.7).

Trang 39

In which, the True Positive Rate (TPR) is Recall and The False Positive Rate

(FPR) is the rate of false alarms.

True Positives (TP)

True Positive Rate (TPR) = True Positives (TP) + False Negatives (FN) (3.17)

False Positives (FP)

Based on the ROC curve, we can show whether a model is effective or not An

efficient model has a low FPR and a high TPR, meaning that there exists a point on

the ROC curve that is close to the point with the coordinates (0, 1) on the graph (upper left corner) The closer the Curve, the more efficient model.

Drawing ROC curve leads to the calculation of the Area Under the Curve (AUC).

The AUC ranges from 0 to 1, the higher the value, the better the model.

23

Trang 40

Chapter 4 EXPERIMENTS

In our research, we decided to use two datasets provided by Kaggle - the synthetic dataset generated by an emulator called PaySim and the Credit Card Fraud Detection

dataset In order to build model for dectecting fraud, we decided to apply three

algorithms: K Nearest Neighbor, Logistic Regression and Support Vector Machine 4.1 Data Analytics Pipeline

data Modelling [yy Discussion

Figure 4.1: Data analytics Pipeline

In our problem, the two used datasets are independently Therefore, to make the explaination be simpler and easier to understand, we decided to perform this data analytics pipeline on each dataset in turn.

4.2 Credit Card Dataset

4.2.1 Describing Data

4.2.1.1 Descriptions

The dataset is used being collected and analyzed in a research collaboration of Worldline and the Machine Learning Group about big data mining and fraud detection of Université Libre de Bruxelles This dataset is about the credit card transactions occurred in two days in September 2013 made by European cardholders.

4.2.1.2 Data Exploration

The dataset’s description, types and columns are shown in the table 4.1.

Table 4.1: Credit Card Fraud Detection Dataset description Features Type Description

Number of seconds between this transaction and the

Time float64 ;

first transaction in the dataset

24

Tiêu đề	Fraud Detection in Financial Transactions
Tác giả	Nguyen Thi My Lan, Le Ngoc Uyen Vy
Người hướng dẫn	Dr. Cao Thi Nhan
Trường học	University of Information Technology
Chuyên ngành	Information Systems
Thể loại	Thesis
Năm xuất bản	2020
Thành phố	Ho Chi Minh City

Định dạng
Số trang	95
Dung lượng	35,55 MB