1. Trang chủ
  2. » Luận Văn - Báo Cáo

final project report machine learning in business analytics topic credit card fraud detection

33 0 0
Tài liệu được quét OCR, nội dung có thể không chính xác
Tài liệu đã được kiểm tra trùng lặp

Đang tải... (xem toàn văn)

Tài liệu hạn chế xem trước, để xem đầy đủ mời bạn chọn Tải xuống

THÔNG TIN TÀI LIỆU

Nội dung

Trang 1

UNIVERSITY OF ECONOMICS AND LAW FACULTY OF INFORMATION SYSTEMS

FINAL PROJECT REPORT

MACHINE LEARNING IN BUSINESS ANALYTICS

TOPIC: CREDIT CARD FRAUD DETECTION Lecturer:

1 Ho Trung Thanh, Ph.D Group 8:

1 Le Quang Thanh Tai 2 Nguyen Nghia 3 Nguyen Thi Thuy Hien 4 Nguyen Thi Yen Nhi

Ho Chi Minh City, Month, 2023

Trang 2

Members of Group 8

POINT / 10 STUDENT

CONTRIBUTION)

3 | Nguyen Thi Thuy Hien | K214061756 10/10

Trang 3

Acknowledgements

This research was supported by lecturer Ho Trung Thanh, who generously provided our team with knowledge and useful techniques to prepare for the course "Machine Learning in Business Analytics" and guided us correctly throughout the course to complete the study Any mistake is our fault for not accomplishing what he expected, and we sincerely express our gratitude to them

We would also like to thank the team members who were enthusiastic, active in their activities, working, giving late-night comments, and supporting each other mentally throughout the course

Group 8

II

Trang 4

Commitment

This study was carried out by members of Group 8 under the guidance of Lecturer Ho Trung Thanh If our team has any evidence of fraud in this research paper, we will certainly take full responsibility for penalties at all levels In addition, there are references to a few articles on related topics

Ho Chi Minh City, March 21, 2023 Group 8

IH

Trang 5

95-225 5 Business challenges .0.0.0cccccccccccccccccccecesceeceseceecuseceeeeeceseeesustesecsssieseatesssseesssesesessenses 5

N01651100/12519) Ea 7 Tools and Programing languaØe€ - -.- c2: 2121212211211 11 11 1811511111111 1811 key 7 Structure Of Project 121211777 7 ali 7 Proposed research modelÏ . ¿ c2 c2 212212212 121211 151111151151 151 1111511118111 1 HH 8 Chapter I — Dafa Understanding and Collection - -.- E22: 222211212211 Errrerrreo 10

Trang 6

LL Dataset cic cecccccccecssssccesesseescesssccssesssseccnvsscsssssecesesssescenssecesssssseceneaess 10 1.2 Data understanding o.oo cece cece censneceseeeseesseseesaessesseceseecsseciteueseseneeenees 10 1.2.1 Target eolumn c1 2221221223152 551 151151111111 111 1811111171111 11 1e krreg 10 I0 10 I6 oi aI/ỰỊIIid II Chapter 2 - Data preparatIom - c2 2 21221121111 112221 1111511 118110115 HH nành 2 Overview of Chapter 2 Q0 0221212122112 111 111112111 1011111 11111111 HH rà tre 2 QL EDA cece .Ä 2 2.1.1 Descriptive SŠfafISEIC§: ceccccesecseeecseeesssesserstessesseseeeieeseeieeses 2 2.1.2, Visualization 3 PP ¡82093 i‹a 4 Chapter 3 — Modelling 2 L1 21 1121121222111 1 21110112 12 21 1 Hà HH TH HH 5 Chapter 4 - Visualizing results and sharing the fñndings — Recommendation making 6 4.1 Visualizing r€SuÏ(§ - - c c 2211212121211 1 2111 1551 55511111211 51 E1 1111k ty 6 4.2 Business Solution n ẦẢ Ả 8 Chapter 5 Conclusion and Future WOrk§ 2 2c 2n 1 2112 2228211 Han e II

5.2 ContrIbutions and TŒSETICIOTIS t TS 2 2111112110111 12112111101 1.111 re il 5.3 Future cố a3 12 References NA .ồồ.ồ.ồốỐồỐ.xašsä 13

Trang 7

List of Tables

Table 4 1 Experimental results without SMOTE Table 4 2 Experimental results with SMOTE

VII

Trang 8

List of Figures

Figure 0 1 Research modelL -:- 222112112111 111 11115115311 111 11111111111 11 11 4111 111110111111 ke 8 Figure 1 1 Positive CorrelaHons c cà các nề cớ sẽ d Eigure 1 2 NegatIve CoIrelaHiOfS - ĩc 12 2212115111121 11111121121101111 11101111111 011 0110 tk II Eigure 1 3 Missing data - c1 121121111 11111151311 11111111111 11 11 8111 11 1111 11 11T H hờ II Eigure 2 l DescriptIve SfatISHCS Lo cà bọc nàn nh nh THn nàn tr bà khe Hy xi 2 Eigure 2 2 VIsuaÏ1zafIOT ác c2 11211211 11111 1111111111111 11111 1011111110110 10111110111 H1 HH 3

VII

Trang 9

List of Acronyms

SMOTE | Synthetic Minority Over-sampling

IX

Trang 11

ABSTRACT

Financial fraud is a persistent issue in the financial industry with severe consequences Data mining has become essential in detecting credit card fraud in online transactions However, credit card fraud detection is challenging due to the constantly evolving profiles of normal and fraudulent behaviors, as well as highly skewed data sets The effectiveness of fraud detection in credit card transactions depends on various factors such as the sampling approach used on the data set, the selection of variables, and the detection techniques employed This paper investigates the performance of Random Forest, Logistic Regression, Decision Tree, and Support Vector Machine on highly skewed credit card fraud data The performance of the models is evaluated using the confusion matrix, and the results are presented through visualization and analysis Our findings indicate that our approach can yield satisfactory results in detecting credit card fraud with a reduced computational cost, providing valuable insights for further research and practical implementation in the financial industry.

Trang 12

Project overview and business issues understanding

Introduction to project

Financial fraud is a growing threat with far-reaching consequences in the financial industry Data mining has been critical mm detecting credit card fraud in online transactions (Awoyemi et al., 2017) Credit card fraud is a wide-ranging for theft and fraud committed using a credit card or other similar payment mechanism as a fraudulent source of funds in a transaction Credit card fraud is a growing problem in the credit card industry Detecting credit card fraud is a difficult task when using standard processes, so the development of credit card fraud detection models is now important in academic and business organizations

Credit card fraud occurs when a fraudster uses a credit card for their own purposes while the owner of the credit card is unaware In 2016, fraudulent transactions involving credit cards acquired worldwide totaled €1.8 billion (Patidar & Sharma, 2011) Fraud costs businesses and institutions billions of dollars each year, and fraudsters are constantly looking for new ways to commit illegal acts The good news is that fraud tends to follow certain patterns and that such patterns, and thus fraud, can be detected

According to the European Central Bank (European Central Bank, 2014), the total level of fraud in the Single Euro Payments Area reached 1.33 billion Euros in 2012, representing a 14.8% increase over 2011 Furthermore, payments made through non- traditional channels (mobile, internet, etc.) accounted for 60% of all fraud, up from 46% in 2008 As new fraud patterns emerge, current fraud detection systems are less effective in preventing these frauds (Correa Bahnsen et al., 2016)

Recognizing the urgency in detecting credit card fraud, the team conducted a study on the topic of "Credit Card Fraud Detection" with the hope that banks and businesses in the credit sector can reduce financial risks From there, illegal fraud and losses for businesses are avoided The paper uses machine learning algorithms, including Support Vector Machine (SVM), Random Forest (RF), Decision Tree (DT), K-nearest Neighbor

Trang 13

(KNN), and Logistic Regression (LR) to analyze the Credit Card Fraud Detection dataset on Kaggle with 30.7511 datasets

Objectives

The business objective of your project on machine learning in business analytics is to use the insights gained from analyzing your dataset to improve the efficiency and effectiveness of your credit fraud detection system By building a predictive model that can identify individuals who are likely to commit credit fraud, businesses can reduce the number of false positives and false negatives in their fraud detection system This can save them time and money by minimizing the number of unnecessary investigations while also enabling them to focus their efforts on high-risk cases Additionally, by detecting credit fraud more accurately and efficiently, businesses can protect their reputation and maintain the trust of their customers Overall, the objective of your project is to help businesses make better-informed decisions, improve their risk management capabilities, and ultimately increase their profitability and competitiveness

e Data Collection and Preprocessing: Collecting the relevant data and preprocessing it by cleaning, transforming, and normalizing it to make it suitable for analysis e Exploratory Data Analysis (EDA): Conducting EDA to understand the data, find

patterns, identify outliers, and gain insights into the data

e Feature Engineering: Selecting the relevant features and engineering them to create new features that can improve the accuracy of the predictive model e Model Selection: Selecting a suitable machine learning algorithm to train the

Trang 14

e Deployment: Deploying the predictive model in a real-world environment and monitoring its performance

e Model Maintenance: Monitoring the model's performance and making necessary updates and improvements to maintain its accuracy over time

Data imbalance: The amount of data is constrained due to privacy concerns The few datasets that are readily available are also unbalanced There are more honest than dishonest transactions in them It is the minority class (frauds) that is of interest, and the lack of these significant frauds frequently hinders the classifier's capacity to train, leading to unsatisfactory models

Lack of data sets: Because of issues over confidentiality, privacy, secrecy, and legality, card firms are unable to make these sets available to the public or researchers Even though there are numerous machine learning algorithms and a great deal of curiosity, any research in this area will require real data to advance

Fraudulent behavior that is adaptable and innovative: Fraud and normal behavior profiles change frequently and continue to evolve, necessitating ongoing relearning of the new patterns Also, it's possible that the system will pick up on previous fraudulent transactions that were mistaken for legitimate ones Fresh scams can be created to look

Trang 15

like everyday transactions, and if one is detected, all others could follow suit Even human experts may find it difficult to differentiate the issue

Lack of appropriate evaluation metrics: Traditional accuracy scores cannot be applied to data that is wildly imbalanced Unfortunately, it is noted that there is a lack of uniform evaluation criteria or metrics for assessing and contrasting the effectiveness and caliber of fraud detection systems Hence, it is hard to compare different detection models and methods highlights the absence of benchmarking as a major problem in addition There is disagreement among researchers as to what models perform better and what optimal metric for evaluation would be

- Maintain Your Business's Financial Health: Detecting fraud cases of telecommunications charges to prevent loss of revenue for telecommunications service providers

- Educate Your Employees and Customers: Increased training programs for employees and customers on being on the lookout for fraud cases Applying this technology as the basis for providing optimal solutions with science and logic - Look for Opportunities to Expand: Through a successful credit card fraud

prevention tool, we can use this mechanism for other areas such as anti-hacking, anti-technology theft, anti-app fraud, .

Trang 16

Methods/Models

Every day in the world, there are billions of transactions, and with such huge amounts of data, it is impossible to classify them manually In addition, the accuracy of the classification is also a huge issue as many factors will affect the fraud of a transaction Therefore, a solution is needed to classify this large amount of data efficiently Many methods have been proposed, and machine learning is said to be one of the most effective methods for detecting credit fraud Many machine learning and deep learning algorithms such as ANN, Random Forest, etc., are applied in practice However, these models are computationally expensive and perform better on large datasets This approach can lead to great results, as we can see in some papers up to 90% accuracy, but what if the results are similar, or even better, achievable with less resources? Our main goal is to show that different machine learning algorithms can yield good results with proper preprocessing The authors of most of the articles mentioned used undersampling techniques and that was the impetus to use another approach - oversampling Considering the given events, we build basic machine learning models such as SVM, RF, DT, KNN and suitable preprocessing methods to detect credit fraud To that end, an experiment was conducted

Tools and Programing language

To conduct the research, we use the Scikit-learn library available in the Python programming language

Structure of project

There are 5 chapters in the report We get an overview of this topic in the opening section Next, in Chapter 1, we will define information about the data that we must process Chapter 2 will present projects to build models and preprocess data The data will be analyzed and evaluated in Chapter 3 The results will be presented and the problem- solving approach of Chapter 4 will also be shown Finally, in chapter 5 we will summarize the problems and solutions.

Ngày đăng: 23/08/2024, 15:26

TÀI LIỆU CÙNG NGƯỜI DÙNG

TÀI LIỆU LIÊN QUAN