Trang 1 FACULTY OF INFORMATION SYSTEMS______________________FINAL PROJECT REPORTMACHINE LEARNING IN BUSINESS ANALYTICSTOPIC: CREDIT CARD FRAUD DETECTIONLecturer:1.. Data mining has been
Trang 1FACULTY OF INFORMATION SYSTEMS
FINAL PROJECT REPORT
MACHINE LEARNING IN BUSINESS ANALYTICS
TOPIC: CREDIT CARD FRAUD DETECTION
Lecturer:
1 Ho Trung Thanh, Ph.D.Group 8:
1 Le Quang Thanh Tai
2 Nguyen Nghia
3 Nguyen Thi Thuy Hien
4 Nguyen Thi Yen Nhi
Ho Chi Minh City, Month, 2023
Trang 2NO FULL NAME STUDENT
ID
POINT / 10(INDIVIDUALCONTRIBUTION)
SIGNATURE
Trang 3This research was supported by lecturer Ho Trung Thanh, who generously provided ourteam with knowledge and useful techniques to prepare for the course "Machine Learning
in Business Analytics" and guided us correctly throughout the course to complete thestudy Any mistake is our fault for not accomplishing what he expected, and we sincerelyexpress our gratitude to them
We would also like to thank the team members who were enthusiastic, active in theiractivities, working, giving late-night comments, and supporting each other mentallythroughout the course
Group 8
Trang 4This study was carried out by members of Group 8 under the guidance of Lecturer HoTrung Thanh If our team has any evidence of fraud in this research paper, we willcertainly take full responsibility for penalties at all levels In addition, there are references
to a few articles on related topics
Ho Chi Minh City, March 21, 2023
Group 8
Trang 5Members of Group 8 I Acknowledgements II Commitment III Table of Content IV List of Tables VII List of Figures VIII List of Acronyms IX
GANTT CHART 1
ABSTRACT 2
Project overview and business issues understanding 3
Introduction to project 3
Objectives 4
Objects 5
Business challenges 5
Business values 6
Methods/Models 7
Tools and Programing language 7
Structure of project 7
Proposed research model 8
Chapter 1 – Data Understanding and Collection 10
Trang 61.2 Data understanding 10
1.2.1 Target column 10
1.2.2 Correlation 10
1.2.3 Missing data 11
Chapter 2 - Data preparation 2
Overview of Chapter 2 2
2.1 EDA 2
2.1.1 Descriptive Statistics: 2
2.1.2 Visualization 3
2.2 Data preprocessing 4
Chapter 3 – Modelling 5
Chapter 4 - Visualizing results and sharing the findings – Recommendation making 6
4.1 Visualizing results 6
4.2 Business solution 8
Trang 7prism-2 key-5-pdf-free-ie2…English 93% (123)
Trang 8Key Bài tập Tiếng Anh 9 Tập 2 Mai La…English 90% (39)
22
Trang 9Table 4 1 Experimental results without SMOTE 6 Table 4 2 Experimental results with SMOTE 7
Trang 10Figure 0 1 Research model 8
Figure 1 1 Positive Correlations……… 11
Figure 1 2 Negative Correlations 11
Figure 1 3 Missing data 11
Figure 2 1 Descriptive Statistics……… 2
Figure 2 2 Visualization 3
Trang 11DB Digital Business
MIS Management Information System
SMOTE Synthetic Minority Over-sampling
Trang 13Financial fraud is a persistent issue in the financial industry with severe consequences.Data mining has become essential in detecting credit card fraud in online transactions.However, credit card fraud detection is challenging due to the constantly evolvingprofiles of normal and fraudulent behaviors, as well as highly skewed data sets Theeffectiveness of fraud detection in credit card transactions depends on various factorssuch as the sampling approach used on the data set, the selection of variables, and thedetection techniques employed This paper investigates the performance of RandomForest, Logistic Regression, Decision Tree, and Support Vector Machine on highlyskewed credit card fraud data The performance of the models is evaluated using theconfusion matrix, and the results are presented through visualization and analysis Ourfindings indicate that our approach can yield satisfactory results in detecting credit cardfraud with a reduced computational cost, providing valuable insights for further researchand practical implementation in the financial industry.
Trang 14Introduction to project
Financial fraud is a growing threat with far-reaching consequences in the financialindustry Data mining has been critical in detecting credit card fraud in onlinetransactions (Awoyemi et al., 2017) Credit card fraud is a wide-ranging for theft andfraud committed using a credit card or other similar payment mechanism as a fraudulentsource of funds in a transaction Credit card fraud is a growing problem in the credit cardindustry Detecting credit card fraud is a difficult task when using standard processes, sothe development of credit card fraud detection models is now important in academic andbusiness organizations
Credit card fraud occurs when a fraudster uses a credit card for their own purposeswhile the owner of the credit card is unaware In 2016, fraudulent transactions involvingcredit cards acquired worldwide totaled €1.8 billion (Patidar & Sharma, 2011) Fraudcosts businesses and institutions billions of dollars each year, and fraudsters areconstantly looking for new ways to commit illegal acts The good news is that fraud tends
to follow certain patterns and that such patterns, and thus fraud, can be detected.According to the European Central Bank (European Central Bank, 2014), the totallevel of fraud in the Single Euro Payments Area reached 1.33 billion Euros in 2012,representing a 14.8% increase over 2011 Furthermore, payments made through non-
Trang 15on Kaggle with 30.7511 datasets.
Objectives
The business objective of your project on machine learning in business analytics is
to use the insights gained from analyzing your dataset to improve the efficiency andeffectiveness of your credit fraud detection system By building a predictive model thatcan identify individuals who are likely to commit credit fraud, businesses can reduce thenumber of false positives and false negatives in their fraud detection system This cansave them time and money by minimizing the number of unnecessary investigationswhile also enabling them to focus their efforts on high-risk cases Additionally, bydetecting credit fraud more accurately and efficiently, businesses can protect theirreputation and maintain the trust of their customers Overall, the objective of your project
is to help businesses make better-informed decisions, improve their risk managementcapabilities, and ultimately increase their profitability and competitiveness
Data Collection and Preprocessing: Collecting the relevant data and preprocessing
it by cleaning, transforming, and normalizing it to make it suitable for analysis.Exploratory Data Analysis (EDA): Conducting EDA to understand the data, findpatterns, identify outliers, and gain insights into the data
Feature Engineering: Selecting the relevant features and engineering them tocreate new features that can improve the accuracy of the predictive model.Model Selection: Selecting a suitable machine learning algorithm to train thepredictive model
Model Training: Training the predictive model using the preprocessed data andselected algorithm
Model Evaluation: Evaluating the predictive model using suitable metrics such asaccuracy, precision, recall, and F1 score
Hyperparameter Tuning: Tuning the hyperparameters of the model to improve itsperformance
Trang 16monitoring its performance.
Model Maintenance: Monitoring the model's performance and making necessaryupdates and improvements to maintain its accuracy over time
Objects
Classifying customer fraud in credit cards based on the customer information
Scopes: Space scope: Sample dataset from Kaggle
Business challenges
Due to the volume, adaptability, and individuality of each fraud and therequirement for real-time or nearly real-time assessments, credit card fraud detection isquite challenging (requiring automated identification, classification, and annotation).According to many researchers, the challenge is tough, if not impossible, due toconstraints including the tremendously uneven, highly skewed nature of the data and thelack of true data sets due to sensitivity, secrecy, and privacy concerns
Data imbalance: The amount of data is constrained due to privacy concerns Thefew datasets that are readily available are also unbalanced There are more honest thandishonest transactions in them It is the minority class (frauds) that is of interest, and thelack of these significant frauds frequently hinders the classifier's capacity to train, leading
to unsatisfactory models
Trang 17human experts may find it difficult to differentiate the issue.
Lack of appropriate evaluation metrics: Traditional accuracy scores cannot beapplied to data that is wildly imbalanced Unfortunately, it is noted that there is a lack ofuniform evaluation criteria or metrics for assessing and contrasting the effectiveness andcaliber of fraud detection systems Hence, it is hard to compare different detectionmodels and methods highlights the absence of benchmarking as a major problem inaddition There is disagreement among researchers as to what models perform better andwhat optimal metric for evaluation would be
Business values
- Create a List of What Your Business Should Offer: Credit Card Fraud Detection
- Maintain Your Business and Its Relationships: Relying on anti-fraud technology,businesses maintain tight and highly secure operations When partnering withmany other parties, ensure prestige and quality
- Retain Your Best Employees and Customers: When the system is properlyoperated, employees will trust the accuracy of the business and work loyally evenmore dedicatedly Moreover, customers also trust when working with businesseswith long-term partners
- Maintain Your Business's Financial Health: Detecting fraud cases oftelecommunications charges to prevent loss of revenue for telecommunicationsservice providers
- Educate Your Employees and Customers: Increased training programs foremployees and customers on being on the lookout for fraud cases Applying thistechnology as the basis for providing optimal solutions with science and logic
- Look for Opportunities to Expand: Through a successful credit card fraudprevention tool, we can use this mechanism for other areas such as anti-hacking,anti-technology theft, anti-app fraud,
Trang 18Every day in the world, there are billions of transactions, and with such hugeamounts of data, it is impossible to classify them manually In addition, the accuracy ofthe classification is also a huge issue as many factors will affect the fraud of a transaction.Therefore, a solution is needed to classify this large amount of data efficiently.Many methods have been proposed, and machine learning is said to be one of the mosteffective methods for detecting credit fraud Many machine learning and deep learningalgorithms such as ANN, Random Forest, etc., are applied in practice However, thesemodels are computationally expensive and perform better on large datasets Thisapproach can lead to great results, as we can see in some papers up to 90% accuracy, butwhat if the results are similar, or even better, achievable with less resources? Our maingoal is to show that different machine learning algorithms can yield good results withproper preprocessing The authors of most of the articles mentioned used undersamplingtechniques and that was the impetus to use another approach - oversampling Consideringthe given events, we build basic machine learning models such as SVM, RF, DT, KNN…and suitable preprocessing methods to detect credit fraud To that end, an experiment wasconducted.
Tools and Programing language
To conduct the research, we use the Scikit-learn library available in the Python
Trang 19Figure 0 1 Research model
The figure illustrates the proposed model for detecting credit card fraud with animbalanced dataset The process includes four steps: data preprocessing, model training,model evaluation, visualization and analysis
Firstly, in data preprocessing, the data is cleaned to optimize the performance ofmachine learning models for classification Additionally, data augmentation is applied toincrease the number of instances in minority classes by interpolating new instancesbetween existing ones
Trang 20contrast their effectiveness If data augmentation is not included, the model is tuned usingclass weights.
In model evaluation, the confusion matrix is used to validate the performance ofthe models If the results are unsatisfactory, the model training stage is revisited todevelop an improved model
Finally, the classification model's performance is presented through visualizationand analyzed to compare their effectiveness across various approaches
Trang 21Understanding data is the process of discovering, analyzing, and interpreting data togain a better understanding of its meaning, quality, and structure This is an importantstep in the data mining process that includes collecting, cleaning, and transforming data
to make it useful for analysis In this chapter, we will perform the determination ofparameters such as quantity, attributes, and values of data
1.1 Dataset
Dataset is taken from Kaggle: Credit card Fraud Detection
Because of credit card fraud, the business has collected customer data that may berelevant to the fraud From there, we will take appropriate measures to find the mostinfluential attributes It consists of 122 columns, including 120 features, one identifiercolumn and one target column (TARGET)
The dataset contains 307511 rows Understanding and identifying the dataset'sdata type is essential to building an effective machine learning application Qualitativevariables and quantitative variables are the two main categories of variables in statistics
In the dataset, there are 53 quantitative variables 69 qualitative variables, and we need toencode qualitative variables into quantitative
1.2 Data understanding
1.2.1 Target column
In this column, value 0 has 282686 values or 91.93% of all values in the column,while value 1 has only 24825 values equal to 8.07% So that, the data is imbalancedbecause about 92% loan is paid on time, but only 8% are not paid Therefore, Imbalancedata need to be handled
1.2.2 Correlation
This involves examining the strength and direction of the relationship between twovariables using techniques such as Pearson correlation or Spearman correlation
Trang 22(10 positive and 10 negative) Based on those variables, we can have a more detailedmeasure to predict the output of the model.
Figure 1 1 Positive Correlations Figure 1 2 Negative Correlations
1.2.3 Missing data
Trang 23Overview of Chapter 2
Chapter 2 outlines the data preprocessing process, including the preprocessing methodsthat our team applies and describes Besides, there are contents related to EDA.2.1 EDA
The overall objective of EDA is to develop a thorough understanding of the data,use this understanding to guide our modeling choices, and enhance the accuracy as well
as reliability of our predictions When it comes to machine learning, EDA can assist us infinding potential problems or biases in our data, choosing relevant features, andidentifying trends or patterns that can guide our modeling decisions This is crucial whendealing with complex or noisy data, as well as when dealing with vast amounts of datathat may be difficult to comprehend without the use of visual aids or statistical analyses.The first step of EDA is statistical analysis which is the process of using mathematicalmodels and techniques to describe and summarize data, and to identify any significantdifferences or relationships between variables