VIETNAM NATIONAL UNIVERSITY HOCHIMINH CITYUNIVERSITY OF INFORMATION TECHNOLOGY ADVANCED PROGRAM IN INFORMATION SYSTEMS NGUYEN THI MY LAN - 16520651 LE NGOC UYEN VY - 16521472 AN APPROACH
Trang 1VIETNAM NATIONAL UNIVERSITY HOCHIMINH CITY
UNIVERSITY OF INFORMATION TECHNOLOGY
ADVANCED PROGRAM IN INFORMATION SYSTEMS
NGUYEN THI MY LAN - 16520651
LE NGOC UYEN VY - 16521472
AN APPROACH FOR FRAUD DETECTION IN
FINANCIAL TRANSACTIONS
USING MACHINE LEARNING METHODS
BACHELOR OF ENGINEERING IN INFORMATION SYSTEMS
THESIS ADVISOR
Dr CAO THI NHAN
HO CHI MINH CITY, 2020
Trang 2ASSESSMENT COMMITTEE
The Assessment Committee is established under the Decision , date
—— by Rector of the University of Information Technology
1 - Chairman
”— - Secretary
=— - Member
Trang 3oElœ
The thesis topic will not be completed without any assistance So the authorsimplementing the topic gratefully give acknowledgment to their support andmotivation during the graduation thesis project
First of all, we would like to thank the Lecturers of the University of InformationTechnology as well as the Lecturers of the Information Systems Faculty, who taughtand provided the background solid knowledge throughout the time studying at school.This knowledge is an important basis for us to complete our graduation thesis
In particular, we would like to express my endless thanks and gratefulness to mysupervisor, Dr Cao Thi Nhan Her dedicated support and constant advice have helped
us to complete our thesis well Her words of encouragement and comments havegreatly enriched and improved our work Without her guidance, the thesis would nothave been done effectively
During the implementation of the thesis, we have tried to apply effectively what wehave learned as well as learn new technologies to be able to complete the thesis in thebest way However, in the process of implementation, because of limited knowledge,experience, and study time, it is difficult to avoid shortcomings Therefore, we hope
to receive the comments of teachers to complete the thesis better as well as thenecessary knowledge and skills
Thank you so much!
AuthorsNguyen Thi My Lan — Le Ngoc Uyen Vy
Trang 4Chapter 1 INTRODUCTION sssssssssssssssssseesesesessssssseseseseseessnseseseseseeeeeeeaees 11.1 Problems cect os saaece MO casas se seseevevesenssssusesnevevesenrnnaees 11.2 Aims and Obj€CfÏV€S - 5S 2t SS 1 22 12 1012121210101 1x rry 21.3 Languages, Tools and LibrarieS ¿5525252 S+s+*2+s£+x+xezsexexsx 2Chapter 2 BACKGROUND .cccsssssssenssssssessesesenssesssesesesenensnssssessecsenenenssessereeee 42.1 Fraudulent transaction definitiO - ¿+ ++5+52+2*+*+*+tsxsxexrrereree 42.2 Fraud Detection Approach - +: 5+ + 212121 * 21k 21111111 re 42.3 Techniques
2.4 Related Problems -:-+c s2 52.4.1 Imbalanced dataset problems ¿- ¿6-5252 +*2+2+£££sEzkerrereree 52.4.2 Methods to solve the imbalanced problems + + «+ +++<s<+ 7Chapter 3 MACHINE LEARNING FOR FRAUD DETECTION 103.1 K Nearest Neighibor - Ăn TT HH HH Hư 103.1.1 Applying K Nearest Neighbor (KNN) ccccccceieeeiey I13.1.2 Advantages and disadvantages cà ky 133.2 Logistic Regression
Trang 53.2.1 Logistic Regression model -¿- +61 £vskeekrkkskreerkeserree 133.2.2 Apply Logistic Regression step-by-step „l53.2.3 Advantages and disadvantages cà + St sseireeeiey 153.3 Support Vector Machine.
3.3.1 Advantages and Disadvantages -¿- cành 163.3.2 SVM model
3.4 Model evaluation ccceeccccscsseeescscsessesescseseeseecscseseeesececseseneseeseseeneeees 183.4.1 Comfusion matrix ee cccseseeeesescseseseseesesesesesecscsesesesecseseeneeees 193.4.2 Precision and Recall c.cccceeeeseeeeseecseeneeeesesesesesessesesesesecsesenseeeees 203.4.3 F-1 Score ,,cceertiet®.« seasmeegM 0 ssscsscsssssvssscesesensesevsesecses 213.4.4 Receiver Operating Characteristic CUTV€ ¿+ 5+5+5+++<++ 22Chapter 4 EXPERIMEENTTS - 5555 5< 3S 1141111315161 244.1 Data Analytics Pipeline - +5: S+ St re, 244.2 Credit Card Dataset - 5c St S ĐH HH HH re 244.2.1 Describing Data cece th HH HH HH HH HH 244.2.3 PT€DTOCGSSÏNE Làn 12 HH HH HH HH0 gi 254.2.4 Ñ€SUÏfS ng H0 H00 H010 011 H00 304.2.5 DiSCUuSSÏOH S2 2.121 1010101001010 01011 re 464.3 Synthetic Financial DataS€(S - cành HH HH HH it 484.3.1 Decribing Data óc 1à TT HT Hà nh Hệ 484.3.3 PT€DTOC€SSÏNE án TH HH HH HH gi 564.3.4 IÑ€SUÏ(S LH TH HT TH TH HT HH TH TH HH Hy 584.3.5 DisCUSSIONS HH HH HH HH HH HH Hi 73Chapter 5 CONCLUSIONS AND FUTURE WORKS -<5<< 76
iii
Trang 6S.1 - COnCÏUSIOPS S222 3212 9191212 1 10101210 1 0101010 re 765.1.1 The results achieved
5.1.1 The limitations - ¿(óc 2121 1131 1 E1 21 1 H1 1H HH ghê 765.2 Future works
REFERENCES Ăn 01011000010101010000001010101000000000 186 1
iv
Trang 7LIST OF EIGURES
voles
Figure 1: Identity theft reports in the United States -+©-<++ xiiFigure 2: Most common types of identity thet «655 S+c+xsserersreree xiiiFigure 3: Credit card fraud reports by Y€AT - - +5 + 55252 Sc+x+xzesersrsrseree xiv
Figure 3.1: Applying the K Nearest Neighbor (KNN) algorithm step-by-step [8] 11Figure 3.2: Common distance metrics [9] -+ + +5++2++£+++++++e£z£+czzezsx+ 12Figure 3.3: Graph of Logistic curve where o=0 and B=1 [1 I] - 14Figure 3.4: Graph of Sigmod function [12] - + -+++++++++++e++++x+++++ 14Figure 3.5: Confusion Matrix 0 cccccccccccsesecsceeeeesesesceseeeseseseneneeseseseneeesssseseeee 19Figure 3.6: Precision and Recall ¿-s + +++5+++£+£+x++e£srsxrtrrerersrrkreree 20Figure 3.7: Receiver Operating Characteristic curve model ‹-s+ 22
Figure 4.1: Data analytics Pipeline
Figure 4.2: The original dataset
Figure 4.3: Dataset Class Distribution 27Figure 4.4: Transactions amount distrIDutiOI ¿+ «++£s+£+*£*££veseeeeereee+ 27Figure 4.5: Transaction Time distributiON eee cesses ee eeeseseneeeeeeseneees 28Figure 4.6: Dataset after scaling features “Time” and “Amounf” - 29Figure 4.7: Confution matrix of testing data when applying KNN algorithm in theoriginal Credit Card dataset ¿+ ¿S1 1S 91 121 1 5111 1 1111 11101001 210101 r 31Figure 4.8: Confution matrix of testing data when applying LR algorithm in the
original Credit Card đafAS€K s6 5 111 1n TH TH ng HH nh Hư 32Figure 4.9: ROC curve of LR in the imbalanced Credit Card dataset 33Figure 4.10: Confution matrix of testing data when applying KNN algorithm in theoriginal Credit Card dataset ¿<5 1xx 91 121111510101 111211101010 1010 ty 34Figure 4.11: ROC curve of SVM in the imbalanced Credit Card dataset 34
Trang 8Figure 4.12: Class distribution in the subsample after using Random Undersampling
— ÔÔÔÔÔÔÔÔ 35Figure 4.13: Confusion matrix of testing data when applying KNN with
undersampling in the Credit Card dataset ccccccssscssssesesescseeeesesesesenseeeseseneees 36Figure 4.14: Confusion matrix of testing data when applying LR with
undersampling in the Credit Card dataset - 25-252 52 5++*+++£s£sezzezsrs 37Figure 4.15: ROC curve of LR in the balanced Credit Card dataset using RandomUndersammpling -. +2 + 5 232 393212323 5191212113 171 717111 1101711111 117111 tr 38Figure 4.16: Confusion matrix of testing data when applying SVM with
undersampling in the Credit Card dataset - ¿555252 5+*2+5*++x+++s+xcx++ 39Figure 4.17: ROC curve of SVM in the balanced Credit Card dataset using Randomndersampling - ¿- ¿+22 2% 1212123 5391212113 111712111 217101111 111110111011 111gr 40Figure 4.18: Class distributions of the Credit Card dataset after applying SMOTE 41Figure 4.19: Confusion matrix of testing data when applying KNN with SMOTE inthe Credit Card dataset 42Figure 4.20: Confusion matrix of testing data when applying LR with SMOTE in
the Credit Card dataset 43Figure 4.21: ROC curve of LR in the balanced Credit Card dataset using SMOTE 44Figure 4.22: Confusion matrix of testing data when applying SVM with SMOTE inthe Credit Card dataset ccccccseessesssssssssssesesnsnessssssesseseenensisssssesseseeeisssssasesseaes 45Figure 4.23: ROC of SVM in balanced Credit Card dataset using SMOTE 46Figure 4.24: The comparison chart of the "accuracy" between different algorithms
on the Credit Card đia(ASCTL S2 E121 E3 5191212111121 1 11111 1110101110 12101, 47Figure 4.25: The chart compares the "Fl-score" of the algorithms on the Credit
Card datasets cccccsccsessesesesessessssesessesesessssensesesssesessesensnesssssssssessesesssssessasesases 48Figure 4.26: shows number of transactions which are the actual fraud per
tramsaction tyPe St TT HT HH TH HH Hy 52Figure 4.27: Distribution Of types - ¿+ kg ng it 52Figure 4.28: Distribution of EFauid ceeceeececseseeeneeseseseseeeeeeeeseneeeeeeaeae 53
vi
Trang 9Figure 4.29: Distribution of the feature FlaggedFraud - ¿ «-«+++++x++ 54Figure 4.30: Original Paysim dataset - cà 1n HH ng it 56Figure 4.31: Data “isFraud” distributiOI - 5+ S1, 57Figure 4.32: Data Clean - ¿5c + 22121 221519121211 210121 21111111211 1111 58Figure 4.33: Confusion matrix of testing data when applying K Nearest Neighborsalgorithm in the original Paysim dataset - ¿+52 2522 S*+2£+£+zzersrserrrre 59Figure 4.34: Confution matrix of testing data when applying LR algorithm in the
original Paysim đa(AS€(L - + - S222 12121 3 5191212111 210171711110 11 711111 11t 60Figure 4.35: ROC curve of LR in the imbalaced Paysim dataset - 61Figure 4.36: Confution matrix of testing data when applying SVM algorithm in theOriginal Paysim điafAS€ 2523232212121 51912121111 1011211101011 2111011 1c 62Figure 4.37: ROC curve of SVM in the imbalanced Paysim dafaset 63Figure 4.38: Class distribution in the subsample after using Random Undersampling
Figure 4.39: Confusion matrix of testing data when applying KNN with
undersampling in the Paysim dataset
Figure 4.40: Confusion matrix of testing data when applying LR with
undersampling in the Paysim dataset
Figure 4.41: ROC curve of LR in the balanced Paysim dataset using Random
ndersaimpling - - - 25252 E121 E191 121 311 1712111210101 0100210101010 010111, 67Figure 4.42: Confusion matrix of testing data when applying SVM with
undersampling in the Paysim dataset ¿+ 5 5S E2 *EESESEkEEkekekrrkrree 68Figure 4.43: ROC curve of SVM in the balanced Paysim dataset using Ramdom
UnderSampÏinng - - ‹- + 1k 1 1 1 1E TT TT HT TH TH TH Hư 69Figure 4.44: Confusion matrix of testing data when applying LR with SMOTE in
the Paysim dafASCL ó2 1 121 1 9191 12111121110 010 2101010100101 T1 0H10 xe 70Figure 4.45: ROC curve of LR in the balanced Paysim dataset using SMOTE 71Figure 4.46: Confusion matrix of testing data when applying SVM with SMOTE inthe Paysim dataset - - 11k TT HH HH HH HH TH HH Hy 72
vii
Trang 10Figure 4.47: ROC curve of SVM in the balanced Paysim dataset using SMOTE 73Figure 4.48: The comparison chart of the "accuracy" between different algorithms
on the Paysim đafaS€(S óc S11 121 11191 11H11 TT HT HH HH Hy 74Figure 4.49: The chart compares the "Fl-score" of the algorithms on the Paysim
viii
Trang 11LIST OF TABLES
roles
Table 3.1: Commonly used Kernel functions -. -+ 55s + 5s+++sss£+c+s+>++ 18Table 4.1: Credit Card Fraud Detection Dataset description 24Table 4.2: Number of columns and records from the dataset - 25Table 4.3: Class distribution of the Credit Card Fraud Detection Dataset 25Table 4.4: Dataset check missing ValUC - ¿ ¿<5 + 2x ‡stskseerrreee 26Table 4.5: The dataset after scaÏing - ¿25252 +22 2*Esxerekererrrree 30Table 4.6: The Result when using KNN in original Credit Card dataset 30Table 4.7: The Result when using LR in original Credit Card dataset 32Table 4.8: The Result when using SVM in original Credit Card dataset 33Table 4.9: The result when running KNN with balanced Credit Card dataset usingRandom Undersaimpling + + + 2£ S252 **2E2*E£E#E£EEEEEEeEeErkrkrkrererrkrkrre 36Table 4.10: The result when running LR with balanced Credit Card dataset using
Random Undersampling 37Table 4.11: The result when running SVM with balanced Credit Card dataset usingRandom Undersampling 39Table 4.12: The dataset classes before and after using SMOTE 40Table 4.13: The result when running KNN with balanced Credit Card dataset usingb0 — 41Table 4.14: The result when running LR with balanced Credit Card dataset using
b0 ƠƠƠƠƠƠ 43Table 4.15: The result when running SVM with balanced Credit Card dataset using(00 44
Table 4.17: The comparison “F1-Score” between algorithms Credit Card dataset 48
Table 4.18: Dataset DesCrip(iOn ¿ĩ5 1S S1 12 E111 HH1, 49Table 4.19: Paysim Data Types ĩc S12 SH HH HH HH Hư, 50Table 4.20: check missing value in Paysim đaf(aS€( - 6-55 S* sex 51
ix
Trang 12Table 4.21: Quantity statistics by transaction type - ác sssrersrsee 51Table 4.22: Class “isFraud” distribution of the Paysim Dataset 53Table 4.23: Statistics of the number of transactions on the isFlaggedFraud
D100 ốc ốc cố Cố Cổ CÔ 53Table 4.24: The Result when using K Nearest Neighbors in original Paysim
(a(AS€(L à n1 H111 re 58Table 4.25: The Result when using LR in original Paysim dataset 60Table 4.26: The Result when using SVM in original Paysim dataset 61Table 4.27: The result when running KNN with balanced Paysim dataset using
Random Undersampling c.cccecesescsesceeesesescseneesesesesesseseseseseeecsesesesensessseseaeees 64Table 4.28: The result when running LR with balanced Paysim dataset using
Random Ủndersaimpling + +54 E2*2***£5E2*2E+*££E#E#EEEEEEekskrkrkrkrerrrrkrkree 66Table 4.29: The result when running SVM with balanced Paysim dataset using
Random Ủndersaimpling ¿+ c5 222232 *9*212E*E£E£E 2 EEEExeEEekrrkree 67Table 4.30: Number of transactions in Paysim dataset before and after using
SMOTE
Table 4.34: The comparison “F1-score” between algorithms in Paysim dataset 75
Trang 134 | Support Vector Machine SVM
5 Techni Minority Oversampling SMOTE
Trang 14The financial industry has always dealt with fraud-related problems like missingand damaging in transactions In The United States, there are over 270,000 reportswhich makes credit card fraud become the most common type of identity thief Thenumber of frauds has increased doubling from 2017 to 2019 [1]
Trang 15Credit card fraud
Other identity thet P2)
Loan or lease fraud
Phone or utilities fraud
Bank fraud
Employment or tax-related fraud
Government documents or benefits fraud 4ã 52
50K 100K 150K 200K 250K
Figure 2: Most common types of identity theft”
Nowadays, there are more and more delicate techniques used by criminals forstealing the money from the user accounts As a result, detecting fraudulenttransactions is becoming more difficult because many illegal transactions look like
the normal one In addition, the number of fraudulent transactions is higher than in
the last few years
? Source: https://www.fool.com/the-ascent/research/identity-theft-credit-card-fraud-statistics
xiii
Trang 16@ Credit card fraud reports
Figure 3: Credit card fraud reports by year?
One of the most common challenges when facing fraudulent transactions is theskewed distribution of classes The proportion of fraud classes is usually smallermany times than the unfraud classes Though the classifiers should be inclinedtowards the minority group like fraud, they will focus on the majority group because
of their regular appearances For dealing with this, in the research, we useUndersampling, Oversampling and One-class Classification techniques
Another problem incurs with an imbalance dataset is how to choose theperformance measures used to evaluate models From this research, we choose theFl-score, confusion matrix, precision and recall to evaluate the accuracy
3 Source: https://www.fool.com/the-ascent/research/identity-theft-credit-card-fraud-statistics
xiv
Trang 17Chapter 1 INTRODUCTION
1.1 Problems
For many decades, fraud in financial transactions has caused a lot of seriousdamages to the economy and the development of many business over the world.Therefore, enterprises always spend large of their resources in detecting fraud intransactions
Nowadays, online transactions gradually become more popular, especially intime of the current COVID-19 epidemic, more than 50% of global transactions areusing digital payment Consequently, the way criminals used for stealing the moneyfrom transactions has changed to hacking the accounts of the victims In addition, asour life become increasingly modern technology, the fraudulences are more complexand sophisticated, the fraud detection have more difficulty The guilty behavior likestealing money from transactions happens regularly and recklessly They takeadvantages from any defects in system that they can find with the advancedtechnologies Before the financial enterprises recognized, they have already beenstolen millions of dollars This is a hard problem that many businesses have to facewith
When a fraud transaction appears, it is hard to define that transaction is fraud ornot because it looks like a legal one Furthermore, businesses have difficulty indetecting fraud due to the lack of related documents and public datasets The banksand financial enterprises have to keep their data secret because of the privacy concern.This put the financial businesses in problem of finding an effective way for detectfraudulent
Preventing and removing fraudulent in financial transactions are criticalproblems of each business There are many methods given for dealing with this, butone of the most effective ways is applying machine learning algorithms Normally,the datasets of transactions are so large that they have to be applied machine learning
Trang 18algorithms for automatically analyzing behaviors of the users, forecasting fraud infuture and finding the more potential method to deal with fraudulent transactions.
Due to the importance of solving the fraud detection problem in real world, wedecide to study how to apply the machine learning algorithms to the classificationproblems like finding fraud transactions Then, to prove our study, we are going tobuild a simple tool which take the sample dataset as the input and the description offraud results as the output
1.2 Aims and Objectives
There are two things that we need to gain from this research The first isunderstanding the algorithms using in financial transactions The second is able toapply the algorithms we research on a number of sample dataset
The purpose of this study is how to detect fraud in financial transactions usingmachine learning algorithms and apply them to the sample datasets
The datasets we use for this research are Credit Card Fraud Detection andSynthetic Financial Datasets for Fraud Detection from Kaggle
In this research, we use two popular techniques, Undersampling andOversampling, to solve the imbalanced dataset problem In addition, we conductexperiments by appying three algorithms to build model: K-nearest Neighbor, LinearRegression, Support Vector Machine
1.3 Languages, Tools and Libraries
e Pandas: a open source library for analyzing the data
e Numpy: a library for working with array in python
e Matplotlib: used for plotting (usually combined with Numpy for analyzing the
data)
Trang 19e Sklearn: a strong library for machine learning in Python It provides a ton of
useful tools for machine learning and statistical modeling
Trang 20Chapter 2 BACKGROUND
2.1 Fraudulent transaction definition
Following the creditcard.com, “fraudulent transaction is one unauthorized by thecredit card holder Such transactions are categorized as lost, stolen, not received,issued on a fraudulent application, counterfeit, fraudulent processing of transactions,account takeover or other fraudulent conditions as defined by the card company orthe member company.”
2.2 Fraud Detection Approach
The information from the financial dataset like time, amount of transaction and
so on were used by the researcher to determine if the suspicious transaction is fraud
or not, or to define the outliers from the data
The scientists applied many different classification techniques for detectingfraudulent financial and predicting business failures: neural network, support vectormachine, k nearest neighbor, logistic regression, decision trees and so on The resultshows that logistic regression and support vector machine outperforms the otheralgorithms
e Data-driven Fraud Detection Approach
The year 2020 is the era of data Many types of data are collected fromeverywhere in our lives: digital technology, transactions, social media and so on.Following the data IDC, the volume of data collected everyday in the world willaccumulate to 175 Zettabytes by 2025 [2] Data is divided into some characteristics:volume, velocity and variety, which leads us to the concept of “Big Data and data-driven applications”
The explosion of data, especially in the financial field, makes fraud detection infinancial transactions explore new opportunities and challenges in approaching theproblem with the support of Big Data’s tool
Analysing fraud by data-driven method is a highly potential way because of threefollowing reasons [3]:
Trang 21- Precision: Normally, the fraud case will be checked by using a human-driven
approach, but this way has many potential risks Humans can not control all ofthe data day-by-day in the best condition, this will lead to some unexpectedresults and the precision is slow down The quality of data is very importantbecause it decides the accuracy of a model The data-driven fraud analytics istowards a system with a higher precision and working with the fraudulentinspected cases
- Operational Efficiency: The data-driven system is built in a combination of
various domains including machine learning, deep learning, mathematics andstatistics Due to this encouragement, the fraud detection will be done moreeffectively and faster when associating with real-time techniques
- Cost Efficiency: An effective fraud detection system that is developed and
maintained by good experts is really challenging and costive By automatingwhen exposing fraudulent data with the involvement of data-drivenmethodologies, the cost will be reduced significantly
2.3 Techniques
e Classification in fraud detection
We use machine learning to forcast which transactions are fraud or not based onthe data from the past With the input includes some information about the transactionsuch as amount of money, type of transation and so on, the output of this problemwill be whether that transaction is fraud or not with binary result — 0 for not fraud and
1 for fraud
2.4 Related Problems
2.4.1 Imbalanced dataset problems
A dataset is considered as an imbalance when the distributions of different classesare unequal In this case, our imbalanced dataset has two classes: fraud and non-fraudwhere the non-fraud classes are the majority and occur more frequently than the fraudclasses (the minority)
Trang 22Imbalanced dataset is one of the most challenges in fraud detecting problemsbecause most machine learning algorithms normally just focus on the occurrences ofmajority classes.
The problem with imbalanced datasets is that naive classifiers always give thebest results that are equal to the majority class, resulting in more complex algorithmshave the lower accuracy score than the naive classifiers Most of the more complexalgorithms will require modification to prevent prediction based on the majority class
in all cases The seriously imbalanced data makes us confused about the accuracy ofthe model In this case, there are several methods out there to eliminate naivebehaviors such as using Confusion Matrix, ROC, and AUROC
e Accuracy: In cases where her accuracy is too high, this is the accuracy of the
majority class, the accuracy of the minority class cannot be determined
e Precision and Recall:
- High Precision + High Recall: the class is perfectly handled by the model
- High precision + Low Recall: model can not detect class but has high
trustable in use
- Low Precision + High Recall: The detect class is good but the model also
includes the points of the other class in it
- Low Precision + Low Recall: the class is poorly handled by the model
Similarly, Fl-score can only be calculated for the majority class, not for theminority class
In this case, the confusion matrix helps us look at the model again, think aboutthe goal of using the model, and eliminate useless algorithms as well as naiveclassifiers, because the Confusion matrix basically shows how many "Real" datapoints belong to a class and is expected to belong to a class
The ROC curve is a curve described by the set of points created when a giventhreshold changes from 1 to 0 Curve starts at (0,0) and ends at (1.1) A good modelwill have a curve that increases rapidly from 0 to 1 Based on the ROC curve, we can
Trang 23construct the AUROC-section below the curve The AUROC tends towards I-bestand towards 0.5-worst.
Depending on the purpose of use If the goal is to achieve the best accuracy, thenthe naive classifier will always give the best results because it always answers themajority class Our goal is to find the effective model that requires through a variety
of evaluation methods and conclusions
Descriptions of evaluation models are detailed in Chapter 3, section 3.4 ModelEvaluation
Random resampling is a technique that creates a new sample of the datasetrandomly This is a simple way to get a more balanced dataset to solve imbalanceddata problems
There are two common approaches of random resampling to deal with skeweddata problems - undersampling and oversampling
Importantly, the changes of distribution when applying resampling are just usedfor the training dataset This technique will not be applied to the test dataset whichused to evaluate the performance of the model
2.4.2 Methods to solve the imbalanced problems
e Undersampling
Undersampling techniques remove examples of the majority class from thetraining dataset The final result will be the better balance dataset This process can
be repeated many times until we get the expected class distribution
Undersampling methods are usually applied with oversampling in order to getthe better performance rather than using only undersampling or oversampling singly
on the training dataset
One of the most simple and effective ways to apply undersampling is choosingremoved examples from the majority class randomly and deleting them from thetraining dataset This technique is called random undersampling [4]
Trang 24Despite the great effects this technique presents, it also has a limitation That isthis method can discard the potentially useful data from the training dataset whichleads to decreased classification performance.
e Synthetic Minority Oversampling Technique (SMOTE)
Opposing to the undersampling techniques, oversampling is a method thatduplicates the existing examples on the minority class and inserts them to the trainingdataset This method also refers to Synthetic Minority Oversampling Technique(SMOTE) This technique was described by Nitesh Chawla and the others in theirpaper published in 2002 named “SMOTE: Synthetic Minority Over-samplingTechnique”
This oversampling approach works by creating the “synthetic” examples instead
of replacing the examples [5]
2.4.2.1 SMOTE step-by-step:
Step 1: Choose an instance x from the minority class sample A With each x € A,calculate the k nearest neighbors distance by using Euclidean metrics between x andthe other samples in set A
Step 2: Randomly choose N examples from the k nearest neighbors for each xbelongs to A (i.e XI, Xa, Xã, , Xn), this set will be called set B
Step 3: For each examples xx € B (k = 1, 2, 3, , N), we use the followingformula to create new examples:
is one-class classification
Trang 25One-class classification technique can be used for binary imbalancedclassification problems, with negative cases (target class/ 0) is considered normal andpositive cases (outlier class/ 1) is considered exceptional or outliers.
With this technique, instead of focusing on the majority class like the otheralgorithms in machine learning, it will treat the positive cases like the outliers.Therefore, one-class classification ignore the discrimination step and focus on theexpected classes or negative classes
Trang 26Chapter 3 MACHINE LEARNING FOR FRAUD DETECTION
There are several methods of machine learning that are used for the frauddetection problem Machine learning algorithms can process millions of data objectsquickly, and link instances from seemingly unrelated datasets to detect suspicious
K Nearest Neighbor (KNN) is a simple supervised algorithm in machine learning
It can be used for solving both classification problems and regression problems Inthis research, the KNN is used to solve a classification problem like Fraud detection
in financial transactions
K Nearest Neighbor (KNN) is a non-parameter and lazy learning algorithm Thatmeans the model structure depends on the dataset itself instead of any specificmathematical theoretical and all the training data will be used for testing This will
be very helpful when most real world models are not based on mathematical andanalytic theoretical However, the lazy learning will make the model spend highercosts in memory and time because it makes more burden on testing to scan all thedata points [7]
Definition: “The K Nearest Neighbor (KNN) supposes that similar things existclose to each other.”
In other words, K Nearest Neighbor (KNN) is calculating the distance betweenthe query example and the current examples from the dataset, the query example will
be assigned to the group that has the shorter distance K is the number of neighborsthat we need to find the distance, it is usually an odd number if the number of classes
is an even number
10
Trang 273.1.1 Applying K Nearest Neighbor (KNN)
kNN Algorithm
0 Look at the data 1 Calculate distances
Say you want to classify the grey point Start by calculating the distances between
into a class Here, there are three potential the grey point and all other points.
classes - lime green, green and orange.
2 Find neighbours 3 Vote on labels
Point Distance fof
Cla:
O-® 21 —> 1senn % votes Class @ wins
O-@ 2.4 —> 2nd NN @ \2 the vote!
O-@ 3.1 — 3rd NN @ 1 > roinOis® 1 therefore predictedO-® 4.5 —> 4thNN to be of class Ấ)
Next, find the nearest neighbours by Vote on the predicted class labels based ranking points by increasing distance The ‘on the classes of the k nearest neigh- nearest neighbours (NNs) of the grey bours Here, the labels were predicted
point are the ones closest in dataspace based on the ke=3 nearest neighbours.
Figure 3.1: Applying the K Nearest Neighbor (KNN) algorithm step-by-step [8]
The following steps represent how to apply K Nearest Neighbor algorithms (figure3.1):
1 Load data
2 Initial k to the number of chosen neighbors
3 For each example in dataset:
11
Trang 283.1 Calculate the distance between the query example and the current from thedataset.
Minkowski distance X;= Oe Xe, 1 Mp)
Figure 3.2: Common distance metrics [9]
There are some ways to calculate the distance metrics (figure 3.2), but themost common way is using Euclidean distance In this research we will useEuclidean to compute the distance
3.2 Add the distance and index of the example to a ordered collection
4 Sort the collection by the distance, indices from smallest to largest
5 Pick the first k from the collection
6 Get the labels from the selected k
7 Return the mode of k labels (classification)
K Nearest Neighbor is one of the best classifier algorithms in detecting fraudtransactions However, there is one thing that we need to notice in order to get themore effective result, that is how to choose the optimal number of neighbors (k).There is no best k number suitable for all the datasets Parameter k depends ondifferent requirements of each dataset One of the ways for solving this problem is
12
Trang 29running the algorithm several times with different values of k until we get theexpected result.
e Decrease the value of k to 1, the predictions become less stable
e Increase the value of k, the predictions become more stable But the number of
errors will increase
e Choose k as an odd when we taking a majority vote
3.1.2 Advantages and disadvantages
Advantages:
e Simple and easy to implement
e No need to build a model
e Flexible The algorithm can be used for classification, regression and even search.Disadvantages:
e The algorithm will be slower when the volume of data inceases
3.2 Logistic Regression
Logistic Regression (LR) is the most popular and most used machine learningalgorithm specifically used in classification, the general statistical model originallydeveloped and popularized by Joseph Berkson, starting with Berkson (1944) where
he coined "logit" [10] Models trained using Logistic regression algorithms can beused to describe relationships between data variables whether it is binary, continuous,
or categorical Predictions from Logistics Regression can be used to predict whethercertain things will happen or not With the help of this model, we can estimate theprobability, if the variable belongs to the class or not
3.2.1 Logistic Regression model
Logistic regression is one of the classification algorithms, used to predict a binaryvalue in a certain independent set of variables (1/0, Yes / No, True / False)
The binary dependent variable has values of 0 and | and predictive value(probability), Logistic Regression uses a "logistic curve" [11] to represent therelationship between the independent variable and the dependent variable anddetermines the relationship has limited by 0 and 1 If the independent variable
13
Trang 30increases the predicted value will increase in curve and approach | but never equal 1.Likewise, at the lowest level, the probability approaches 0, but no ever equal to 0(figure 3.3).
0.8 exp(x)/(1 +exp(x)) 0.4
0.0
T T T |
2 4 6
Figure 3.3: Graph of Logistic curve where a=0 and B=1 [11]
The formula for the univariate logistic curve is:
e(€0†€1Z1)
HÀ 1+ e(€ote1*1) 6.1)
To explain Logistic regression we start with the logistic function The logisticfunction is a sigmoid function With any real input “t”, we have outputs a valuebetween 0 and | (figure 3.4)
== Lọ |0)
14
Trang 31Logistic Regression can be understood simply as:
1 y= { BotBixte >0 (3.2)
else
Where f is parameter that best fit
£ is an error distributed by the standard logisticdistribution
The logistic function (o: R > (0,1)) [11]:
ef 1 1+et 1+ert
a(t) = (3.3)
Where o(t) is the predicted output
Bo is the bias or intercept term
B, is the coefficient for the single input value (x)Let t is a linear function follow as:
t=Bo+ Pix 3.4)
The general logistic function (p: R > (0,1)):
p(x) = a(t) = SES (3.5)
Where p(x) is Probability of the dependent variable
3.2.2 Apply Logistic Regression step-by-step
1
5
Visualizing the data
Building the input dataset is divided into trainData and testData
Setting up the model
Finding parameters
Predicting the new data by the model just found
3.2.3 Advantages and disadvantages
Advantages:
15
Trang 32e There is no need to assume two data classes are linearly separable.
e Simple to implement
e Low variance
e Provide probabilities for outcomes
e High reliance on proper presentation of data
Disadvantages:
e High bias
¢ Logistic Regression requires data points to be created independently of each other
e Not a very powerful algorithm and can be easily outperformed by other
algorithms
¢ Poor performance on non-linear data
3.3 Support Vector Machine
Support Vector Machine (SVM) is a supervised machine learning algorithm.Developed at AT&T Bell Laboratories by Vapnik together with colleagues, SVMsclose to their current form was first introduced with a paper at the COLT 1992conference (Boser, Guyon and Vapnik 1992) [13, 14]
SVM can be used for both classification and regression However, it is mainlyused in classification problems
3.3.1 Advantages and Disadvantages
Advantages
e Handle high dimensional data well
e Work with both linear and non-linear boundary depending on the kernel used
e Memory savings
Disadvantages
e Susceptible to over-fitting depending on kernel
e Different kernels have different uses (No clear winner)
16
Trang 333.3.2.SVM model
Model: Focus to 2 parallel hyperplane
{ w'x+b=1 (3.6) w'x+b=-1
We separate two classes of data by preventing samples from falling into the margin:
The aim of SVM is simply to find an optimal hyperplane to separate two-classes
of data points with the widest margin.
SVM uses kernel methods which are a class of algorithm for sample analysis or recognition.
The kernel function given as follow:
(xj, Xj) > k(x, x;) (3.9)
e Linear kernel function: is the basic kernel function.
K(x, x;) = (%, x;) (3.10)
e Polynomial kernel function: is a non-stationary kernel function.
e RBFkernel function: is most accurate to use in SVM based applications The RBF
kernel function internally uses the y as Gaussian width, which controls the width
of the RBF kernel function [15].
17
Trang 34k(Œi,xj) = exp(-y||xi-x)||") (3.12)
If y underrated, then training data will be noisy with a highly sensitive decision boundary On the other hand, if y overrated, the exponential will be almost linear and
the high dimensional projection will start losing its non-linearity power.
The table 3.1 shows Commonly used Kernel function.
Table 3.1: Commonly used Kernel functions Kernel name Expression K(x,y) Comments
Linear xTy No parameter
Read the sample dataset.
Split the data into training and testing subsets.
Select one of three kernels (Linear, Quadratic, and RBF).
Predict the class.
Evaluate the prediction.
a particular purpose Therefore, we use a number of model evaluation algorithms to evaluate whether the model is effective and compare the models' capabilities.
18
Trang 35Several commonly used evaluation methods such as accuracy score, confusion matrix, True / False Positive / Negative, Precision and Recall.
3.4.1 Confusion matrix
The Confusion matrix is considered to be the easiest method to evaluate the performance, as with the help of a confusion matrix we can also visualize the performance, that how many data instances are classified correctly The short description of confusion matrix is shown in the Figure 3.5.
Actual Values
Predicted Values Positive Negative
Figure 3.5: Confusion Matrix
e TP = True Positives ; FP = False Positive
e FN = False Negative; TN = True Negative
The confusion matrix represents:
e True Positive values, which means an actual class of the data matches the
predicted class of the data.
e False Positive represents that the actual class of the data was 1 but the model
predicted it to be 0.
19
Trang 36e False Negative represents that the actual value of the class was 0 but it was
3.4.2 Precision and Recall
With the classification problem where the data sets of the classes are very different from each other and accuracy is seemed to be misleading, there is an effective operation that is commonly used as Precision-Recall that can give a proper
evaluation of the model The figure 3.6 shows the description about Precision and
Figure 3.6: Precision and Recall*
* Source : https://en.wikipedia.org/wiki/Precision and_ recall
20
Trang 373.4.2.1 Precision
Precision demonstrates the ability of the model to correctly predict the X label.
We can calculate precision if TP, TN, FN, FP are available Precision is the number of correct observations (TP) to the predicted positive observations (TP+FP).
The formula for precision is as follows:
+ True Positives (TP
Precision = — Ttue Positives (TP) _ (3 13)
True Positives (TP) + False Positives (FP)
In the formula, the element that causes Precision to increase or decrease is not
the TP, but the FP Therefore, when Precision is high, it means that the FP is small or the number of labels mistakenly predicted to label X is low.
High precision means the accuracy of the points found is high.
3.4.2.2 Recall
A Recall is another metric for evaluating the prediction Recall is defined as the ratio of the predicted True Positive over the total number of points labeled as the
initial True Positive.
The formula for the recall is as follows:
True Positives (TP)
Recall = ———————————_——
True Positives (TP) + False Negative (FN) (3.14)
Recall demonstrates the ability of the model to predict not to miss the X label, just as Precision, Recall depend on FN or in other words, it depends on the possibility that the model predicts incorrectly with the X label.
A high recall means a high True Positive Rate, just as the ratio of actually True Positive points missing is low.
3.4.3 F-1 Score
If only Precision or Recall, the quality of the model cannot be evaluated In the case of Precision = | or recall = 1 however we cannot say this is a good model Then Fl-score is used F-1 1s the harmonic mean of precision and recall.
The F-1 score can be calculated if precision and recall value is available.
21
Trang 38The formula for the Fl-score is as follows:
Recall~1+Precision~1\ 1 Recall-Precision
"——=——=—=——== -— (3.15)
Recall+Precision
However, in some cases, experts find that the importance of Precision and Recall
is different, so they find it necessary to assign more weights to calculate accordingly.
F1-score is changed as follows:
3.4.4 Receiver Operating Characteristic curve
The Receiver Operating Characteristic (ROC) curve is a graph that illustrates the performance of a binary classification system when changing the classification threshold Curves are generated by plotting the True Positive Rate (TPR) to the False Positive Rate (FPR) at different threshold settings The ROC curve represents the relationship between the sensitivity and the fall-out function (figure 3.7).
Trang 39In which, the True Positive Rate (TPR) is Recall and The False Positive Rate
(FPR) is the rate of false alarms.
True Positives (TP)
True Positive Rate (TPR) = True Positives (TP) + False Negatives (FN) (3.17)
False Positives (FP)
Based on the ROC curve, we can show whether a model is effective or not An
efficient model has a low FPR and a high TPR, meaning that there exists a point on
the ROC curve that is close to the point with the coordinates (0, 1) on the graph (upper left corner) The closer the Curve, the more efficient model.
Drawing ROC curve leads to the calculation of the Area Under the Curve (AUC).
The AUC ranges from 0 to 1, the higher the value, the better the model.
23
Trang 40Chapter 4 EXPERIMENTS
In our research, we decided to use two datasets provided by Kaggle - the synthetic dataset generated by an emulator called PaySim and the Credit Card Fraud Detection
dataset In order to build model for dectecting fraud, we decided to apply three
algorithms: K Nearest Neighbor, Logistic Regression and Support Vector Machine 4.1 Data Analytics Pipeline
data Modelling [yy Discussion
Figure 4.1: Data analytics Pipeline
In our problem, the two used datasets are independently Therefore, to make the explaination be simpler and easier to understand, we decided to perform this data analytics pipeline on each dataset in turn.
4.2 Credit Card Dataset
4.2.1 Describing Data
4.2.1.1 Descriptions
The dataset is used being collected and analyzed in a research collaboration of Worldline and the Machine Learning Group about big data mining and fraud detection of Université Libre de Bruxelles This dataset is about the credit card transactions occurred in two days in September 2013 made by European cardholders.
4.2.1.2 Data Exploration
The dataset’s description, types and columns are shown in the table 4.1.
Table 4.1: Credit Card Fraud Detection Dataset description Features Type Description
Number of seconds between this transaction and the
Time float64 ;
first transaction in the dataset
24