Data mining has become essential in detecting credit card fraud in online transactions.. However, credit card fraud detection is challenging due to the constantly evolving profiles of no
Trang 1UNIVERSITY OF ECONOMICS AND LAW FACULTY OF INFORMATION SYSTEMS
3 Nguyen Thi Thuy Hien
4 Nguyen Thi Yen Nhi
Ho Chi Minh City, Month, 2023
Trang 2
Members of Group 8
POINT / 10 STUDENT
CONTRIBUTION)
3 | Nguyen Thi Thuy Hien | K214061756 10/10
Trang 3
We would also like to thank the team members who were enthusiastic, active in their activities, working, giving late-night comments, and supporting each other mentally throughout the course
Group 8
II
Trang 4Commitment
This study was carried out by members of Group 8 under the guidance of Lecturer Ho Trung Thanh If our team has any evidence of fraud in this research paper, we will certainly take full responsibility for penalties at all levels In addition, there are references
to a few articles on related topics
Ho Chi Minh City, March 21, 2023
Group 8
IH
Trang 595-225 5 Business challenges .0.0.0cccccccccccccccccccecesceeceseceecuseceeeeeceseeesustesecsssieseatesssseesssesesessenses 5
N01651100/12519) Ea 7 Tools and Programing languaØe€ - -.- c2: 2121212211211 11 11 1811511111111 1811 key 7 Structure Of Project 121211777 7 ali 7 Proposed research modelÏ . ¿ c2 c2 212212212 121211 151111151151 151 1111511118111 1 HH 8 Chapter I — Dafa Understanding and Collection - -.- E22: 222211212211 Errrerrreo 10
Trang 6LL Dataset cic cecccccccecssssccesesseescesssccssesssseccnvsscsssssecesesssescenssecesssssseceneaess 10 1.2 Data understanding o.oo cece cece censneceseeeseesseseesaessesseceseecsseciteueseseneeenees 10 1.2.1 Target eolumn c1 2221221223152 551 151151111111 111 1811111171111 11 1e krreg 10 I0 10 I6 oi aI/ỰỊIIid II Chapter 2 - Data preparatIom - c2 2 21221121111 112221 1111511 118110115 HH nành 2 Overview of Chapter 2 Q0 0221212122112 111 111112111 1011111 11111111 HH rà tre 2
QL EDA cece .Ä 2 2.1.1 Descriptive SŠfafISEIC§: ceccccesecseeecseeesssesserstessesseseeeieeseeieeses 2 2.1.2, Visualization 3
PP ¡82093 i‹a 4 Chapter 3 — Modelling 2 L1 21 1121121222111 1 21110112 12 21 1 Hà HH TH HH 5 Chapter 4 - Visualizing results and sharing the fñndings — Recommendation making 6 4.1 Visualizing r€SuÏ(§ - - c c 2211212121211 1 2111 1551 55511111211 51 E1 1111k ty 6 4.2 Business Solution n ẦẢ Ả 8 Chapter 5 Conclusion and Future WOrk§ 2 2c 2n 1 2112 2228211 Han e II
5.2 ContrIbutions and TŒSETICIOTIS t TS 2 2111112110111 12112111101 1.111 re il 5.3 Future cố a3 12 References NA .ồồ.ồ.ồốỐồỐ.xašsä 13
Trang 7List of Tables
Table 4 1 Experimental results without SMOTE
Table 4 2 Experimental results with SMOTE
VII
Trang 8List of Figures
Figure 0 1 Research modelL -:- 222112112111 111 11115115311 111 11111111111 11 11 4111 111110111111 ke 8 Figure 1 1 Positive CorrelaHons c cà các nề cớ sẽ d Eigure 1 2 NegatIve CoIrelaHiOfS - ĩc 12 2212115111121 11111121121101111 11101111111 011 0110 tk II Eigure 1 3 Missing data - c1 121121111 11111151311 11111111111 11 11 8111 11 1111 11 11T H hờ II Eigure 2 l DescriptIve SfatISHCS Lo cà bọc nàn nh nh THn nàn tr bà khe Hy xi 2 Eigure 2 2 VIsuaÏ1zafIOT ác c2 11211211 11111 1111111111111 11111 1011111110110 10111110111 H1 HH 3
VII
Trang 9List of Acronyms
SMOTE | Synthetic Minority Over-sampling
IX
Trang 10GANTT CHART
Credit Card Fraud Detection
ue Project Starc| Tue, 318/2023 |
Trang 11ABSTRACT
Financial fraud is a persistent issue in the financial industry with severe consequences Data mining has become essential in detecting credit card fraud in online transactions However, credit card fraud detection is challenging due to the constantly evolving profiles of normal and fraudulent behaviors, as well as highly skewed data sets The effectiveness of fraud detection in credit card transactions depends on various factors such as the sampling approach used on the data set, the selection of variables, and the detection techniques employed This paper investigates the performance of Random Forest, Logistic Regression, Decision Tree, and Support Vector Machine on highly skewed credit card fraud data The performance of the models is evaluated using the confusion matrix, and the results are presented through visualization and analysis Our findings indicate that our approach can yield satisfactory results in detecting credit card fraud with a reduced computational cost, providing valuable insights for further research and practical implementation in the financial industry
Trang 12Project overview and business issues understanding
Introduction to project
Financial fraud is a growing threat with far-reaching consequences in the financial industry Data mining has been critical mm detecting credit card fraud in online transactions (Awoyemi et al., 2017) Credit card fraud is a wide-ranging for theft and fraud committed using a credit card or other similar payment mechanism as a fraudulent source of funds in a transaction Credit card fraud is a growing problem in the credit card industry Detecting credit card fraud is a difficult task when using standard processes, so the development of credit card fraud detection models is now important in academic and business organizations
Credit card fraud occurs when a fraudster uses a credit card for their own purposes while the owner of the credit card is unaware In 2016, fraudulent transactions involving credit cards acquired worldwide totaled €1.8 billion (Patidar & Sharma, 2011) Fraud costs businesses and institutions billions of dollars each year, and fraudsters are constantly looking for new ways to commit illegal acts The good news is that fraud tends
to follow certain patterns and that such patterns, and thus fraud, can be detected According to the European Central Bank (European Central Bank, 2014), the total level of fraud in the Single Euro Payments Area reached 1.33 billion Euros in 2012, representing a 14.8% increase over 2011 Furthermore, payments made through non- traditional channels (mobile, internet, etc.) accounted for 60% of all fraud, up from 46%
in 2008 As new fraud patterns emerge, current fraud detection systems are less effective
in preventing these frauds (Correa Bahnsen et al., 2016)
Recognizing the urgency in detecting credit card fraud, the team conducted a study
on the topic of "Credit Card Fraud Detection" with the hope that banks and businesses in the credit sector can reduce financial risks From there, illegal fraud and losses for businesses are avoided The paper uses machine learning algorithms, including Support Vector Machine (SVM), Random Forest (RF), Decision Tree (DT), K-nearest Neighbor
Trang 13(KNN), and Logistic Regression (LR) to analyze the Credit Card Fraud Detection dataset
on Kaggle with 30.7511 datasets
Objectives
The business objective of your project on machine learning in business analytics is
to use the insights gained from analyzing your dataset to improve the efficiency and effectiveness of your credit fraud detection system By building a predictive model that can identify individuals who are likely to commit credit fraud, businesses can reduce the number of false positives and false negatives in their fraud detection system This can save them time and money by minimizing the number of unnecessary investigations while also enabling them to focus their efforts on high-risk cases Additionally, by detecting credit fraud more accurately and efficiently, businesses can protect their reputation and maintain the trust of their customers Overall, the objective of your project
is to help businesses make better-informed decisions, improve their risk management capabilities, and ultimately increase their profitability and competitiveness
e Data Collection and Preprocessing: Collecting the relevant data and preprocessing
it by cleaning, transforming, and normalizing it to make it suitable for analysis
e Exploratory Data Analysis (EDA): Conducting EDA to understand the data, find patterns, identify outliers, and gain insights into the data
e Feature Engineering: Selecting the relevant features and engineering them to create new features that can improve the accuracy of the predictive model
e Model Selection: Selecting a suitable machine learning algorithm to train the predictive model
e Model Training: Training the predictive model using the preprocessed data and selected algorithm
e Model Evaluation: Evaluating the predictive model using suitable metrics such as accuracy, precision, recall, and F1 score
e Hyperparameter Tuning: Tuning the hyperparameters of the model to improve its performance
Trang 14e Deployment: Deploying the predictive model in a real-world environment and monitoring its performance
e Model Maintenance: Monitoring the model's performance and making necessary updates and improvements to maintain its accuracy over time
Objects
Classifying customer fraud in credit cards based on the customer information
Scopes: Space scope: Sample dataset from Kaggle
Business challenges
Due to the volume, adaptability, and individuality of each fraud and the requirement for real-time or nearly real-time assessments, credit card fraud detection is quite challenging (requiring automated identification, classification, and annotation) According to many researchers, the challenge is tough, if not impossible, due to constraints including the tremendously uneven, highly skewed nature of the data and the lack of true data sets due to sensitivity, secrecy, and privacy concerns
Data imbalance: The amount of data is constrained due to privacy concerns The few datasets that are readily available are also unbalanced There are more honest than dishonest transactions in them It is the minority class (frauds) that is of interest, and the lack of these significant frauds frequently hinders the classifier's capacity to train, leading
to unsatisfactory models
Lack of data sets: Because of issues over confidentiality, privacy, secrecy, and legality, card firms are unable to make these sets available to the public or researchers Even though there are numerous machine learning algorithms and a great deal of curiosity, any research in this area will require real data to advance
Fraudulent behavior that is adaptable and innovative: Fraud and normal behavior profiles change frequently and continue to evolve, necessitating ongoing relearning of the new patterns Also, it's possible that the system will pick up on previous fraudulent transactions that were mistaken for legitimate ones Fresh scams can be created to look
Trang 15like everyday transactions, and if one is detected, all others could follow suit Even human experts may find it difficult to differentiate the issue
Lack of appropriate evaluation metrics: Traditional accuracy scores cannot be applied to data that is wildly imbalanced Unfortunately, it is noted that there is a lack of uniform evaluation criteria or metrics for assessing and contrasting the effectiveness and caliber of fraud detection systems Hence, it is hard to compare different detection models and methods highlights the absence of benchmarking as a major problem in addition There is disagreement among researchers as to what models perform better and what optimal metric for evaluation would be
Business values
- Create a List of What Your Business Should Offer: Credit Card Fraud Detection
- Maintain Your Business and Its Relationships: Relying on anti-fraud technology, businesses maintain tight and highly secure operations When partnering with many other parties, ensure prestige and quality
- Retain Your Best Employees and Customers: When the system is properly operated, employees will trust the accuracy of the business and work loyally even more dedicatedly Moreover, customers also trust when working with businesses with long-term partners
- Maintain Your Business's Financial Health: Detecting fraud cases of telecommunications charges to prevent loss of revenue for telecommunications service providers
- Educate Your Employees and Customers: Increased training programs for employees and customers on being on the lookout for fraud cases Applying this technology as the basis for providing optimal solutions with science and logic
- Look for Opportunities to Expand: Through a successful credit card fraud prevention tool, we can use this mechanism for other areas such as anti-hacking, anti-technology theft, anti-app fraud,
Trang 16Methods/Models
Every day in the world, there are billions of transactions, and with such huge amounts of data, it is impossible to classify them manually In addition, the accuracy of the classification is also a huge issue as many factors will affect the fraud of a transaction Therefore, a solution is needed to classify this large amount of data efficiently Many methods have been proposed, and machine learning is said to be one of the most effective methods for detecting credit fraud Many machine learning and deep learning algorithms such as ANN, Random Forest, etc., are applied in practice However, these models are computationally expensive and perform better on large datasets This approach can lead to great results, as we can see in some papers up to 90% accuracy, but what if the results are similar, or even better, achievable with less resources? Our main goal is to show that different machine learning algorithms can yield good results with proper preprocessing The authors of most of the articles mentioned used undersampling techniques and that was the impetus to use another approach - oversampling Considering the given events, we build basic machine learning models such as SVM, RF, DT, KNN and suitable preprocessing methods to detect credit fraud To that end, an experiment was conducted
Tools and Programing language
To conduct the research, we use the Scikit-learn library available in the Python programming language
Structure of project
There are 5 chapters in the report We get an overview of this topic in the opening section Next, in Chapter 1, we will define information about the data that we must process Chapter 2 will present projects to build models and preprocess data The data will be analyzed and evaluated in Chapter 3 The results will be presented and the problem- solving approach of Chapter 4 will also be shown Finally, in chapter 5 we will summarize the problems and solutions