1. Trang chủ
  2. » Giáo Dục - Đào Tạo

Final project reportmachine learning in business analyticstopic credit card fraud detection

38 1 0
Tài liệu được quét OCR, nội dung có thể không chính xác
Tài liệu đã được kiểm tra trùng lặp

Đang tải... (xem toàn văn)

Tài liệu hạn chế xem trước, để xem đầy đủ mời bạn chọn Tải xuống

THÔNG TIN TÀI LIỆU

Thông tin cơ bản

Tiêu đề Credit Card Fraud Detection
Tác giả Le Quang Thanh Tai, Nguyen Nghia, Nguyen Thi Thuy Hien, Nguyen Thi Yen Nhi
Người hướng dẫn Ho Trung Thanh, Ph.D.
Trường học University of Economics and Law
Chuyên ngành Information Systems
Thể loại Final Project Report
Năm xuất bản 2023
Thành phố Ho Chi Minh City
Định dạng
Số trang 38
Dung lượng 4,57 MB

Nội dung

Trang 1 FACULTY OF INFORMATION SYSTEMS______________________FINAL PROJECT REPORTMACHINE LEARNING IN BUSINESS ANALYTICSTOPIC: CREDIT CARD FRAUD DETECTIONLecturer:1.. Data mining has been

Trang 1

FACULTY OF INFORMATION SYSTEMS

FINAL PROJECT REPORT

MACHINE LEARNING IN BUSINESS ANALYTICS

TOPIC: CREDIT CARD FRAUD DETECTION

Lecturer:

1 Ho Trung Thanh, Ph.D.Group 8:

1 Le Quang Thanh Tai

2 Nguyen Nghia

3 Nguyen Thi Thuy Hien

4 Nguyen Thi Yen Nhi

Ho Chi Minh City, Month, 2023

Trang 2

NO FULL NAME STUDENT

ID

POINT / 10(INDIVIDUALCONTRIBUTION)

SIGNATURE

Trang 3

This research was supported by lecturer Ho Trung Thanh, who generously provided ourteam with knowledge and useful techniques to prepare for the course "Machine Learning

in Business Analytics" and guided us correctly throughout the course to complete thestudy Any mistake is our fault for not accomplishing what he expected, and we sincerelyexpress our gratitude to them

We would also like to thank the team members who were enthusiastic, active in theiractivities, working, giving late-night comments, and supporting each other mentallythroughout the course

Group 8

Trang 4

This study was carried out by members of Group 8 under the guidance of Lecturer HoTrung Thanh If our team has any evidence of fraud in this research paper, we willcertainly take full responsibility for penalties at all levels In addition, there are references

to a few articles on related topics

Ho Chi Minh City, March 21, 2023

Group 8

Trang 5

Members of Group 8 I Acknowledgements II Commitment III Table of Content IV List of Tables VII List of Figures VIII List of Acronyms IX

GANTT CHART 1

ABSTRACT 2

Project overview and business issues understanding 3

Introduction to project 3

Objectives 4

Objects 5

Business challenges 5

Business values 6

Methods/Models 7

Tools and Programing language 7

Structure of project 7

Proposed research model 8

Chapter 1 – Data Understanding and Collection 10

Trang 6

1.2 Data understanding 10

1.2.1 Target column 10

1.2.2 Correlation 10

1.2.3 Missing data 11

Chapter 2 - Data preparation 2

Overview of Chapter 2 2

2.1 EDA 2

2.1.1 Descriptive Statistics: 2

2.1.2 Visualization 3

2.2 Data preprocessing 4

Chapter 3 – Modelling 5

Chapter 4 - Visualizing results and sharing the findings – Recommendation making 6

4.1 Visualizing results 6

4.2 Business solution 8

Trang 7

prism-2 key-5-pdf-free-ie2…English 93% (123)

Trang 8

Key Bài tập Tiếng Anh 9 Tập 2 Mai La…English 90% (39)

22

Trang 9

Table 4 1 Experimental results without SMOTE 6 Table 4 2 Experimental results with SMOTE 7

Trang 10

Figure 0 1 Research model 8

Figure 1 1 Positive Correlations……… 11

Figure 1 2 Negative Correlations 11

Figure 1 3 Missing data 11

Figure 2 1 Descriptive Statistics……… 2

Figure 2 2 Visualization 3

Trang 11

DB Digital Business

MIS Management Information System

SMOTE Synthetic Minority Over-sampling

Trang 13

Financial fraud is a persistent issue in the financial industry with severe consequences.Data mining has become essential in detecting credit card fraud in online transactions.However, credit card fraud detection is challenging due to the constantly evolvingprofiles of normal and fraudulent behaviors, as well as highly skewed data sets Theeffectiveness of fraud detection in credit card transactions depends on various factorssuch as the sampling approach used on the data set, the selection of variables, and thedetection techniques employed This paper investigates the performance of RandomForest, Logistic Regression, Decision Tree, and Support Vector Machine on highlyskewed credit card fraud data The performance of the models is evaluated using theconfusion matrix, and the results are presented through visualization and analysis Ourfindings indicate that our approach can yield satisfactory results in detecting credit cardfraud with a reduced computational cost, providing valuable insights for further researchand practical implementation in the financial industry.

Trang 14

Introduction to project

Financial fraud is a growing threat with far-reaching consequences in the financialindustry Data mining has been critical in detecting credit card fraud in onlinetransactions (Awoyemi et al., 2017) Credit card fraud is a wide-ranging for theft andfraud committed using a credit card or other similar payment mechanism as a fraudulentsource of funds in a transaction Credit card fraud is a growing problem in the credit cardindustry Detecting credit card fraud is a difficult task when using standard processes, sothe development of credit card fraud detection models is now important in academic andbusiness organizations

Credit card fraud occurs when a fraudster uses a credit card for their own purposeswhile the owner of the credit card is unaware In 2016, fraudulent transactions involvingcredit cards acquired worldwide totaled €1.8 billion (Patidar & Sharma, 2011) Fraudcosts businesses and institutions billions of dollars each year, and fraudsters areconstantly looking for new ways to commit illegal acts The good news is that fraud tends

to follow certain patterns and that such patterns, and thus fraud, can be detected.According to the European Central Bank (European Central Bank, 2014), the totallevel of fraud in the Single Euro Payments Area reached 1.33 billion Euros in 2012,representing a 14.8% increase over 2011 Furthermore, payments made through non-

Trang 15

on Kaggle with 30.7511 datasets.

Objectives

The business objective of your project on machine learning in business analytics is

to use the insights gained from analyzing your dataset to improve the efficiency andeffectiveness of your credit fraud detection system By building a predictive model thatcan identify individuals who are likely to commit credit fraud, businesses can reduce thenumber of false positives and false negatives in their fraud detection system This cansave them time and money by minimizing the number of unnecessary investigationswhile also enabling them to focus their efforts on high-risk cases Additionally, bydetecting credit fraud more accurately and efficiently, businesses can protect theirreputation and maintain the trust of their customers Overall, the objective of your project

is to help businesses make better-informed decisions, improve their risk managementcapabilities, and ultimately increase their profitability and competitiveness

Data Collection and Preprocessing: Collecting the relevant data and preprocessing

it by cleaning, transforming, and normalizing it to make it suitable for analysis.Exploratory Data Analysis (EDA): Conducting EDA to understand the data, findpatterns, identify outliers, and gain insights into the data

Feature Engineering: Selecting the relevant features and engineering them tocreate new features that can improve the accuracy of the predictive model.Model Selection: Selecting a suitable machine learning algorithm to train thepredictive model

Model Training: Training the predictive model using the preprocessed data andselected algorithm

Model Evaluation: Evaluating the predictive model using suitable metrics such asaccuracy, precision, recall, and F1 score

Hyperparameter Tuning: Tuning the hyperparameters of the model to improve itsperformance

Trang 16

monitoring its performance.

Model Maintenance: Monitoring the model's performance and making necessaryupdates and improvements to maintain its accuracy over time

Objects

Classifying customer fraud in credit cards based on the customer information

Scopes: Space scope: Sample dataset from Kaggle

Business challenges

Due to the volume, adaptability, and individuality of each fraud and therequirement for real-time or nearly real-time assessments, credit card fraud detection isquite challenging (requiring automated identification, classification, and annotation).According to many researchers, the challenge is tough, if not impossible, due toconstraints including the tremendously uneven, highly skewed nature of the data and thelack of true data sets due to sensitivity, secrecy, and privacy concerns

Data imbalance: The amount of data is constrained due to privacy concerns Thefew datasets that are readily available are also unbalanced There are more honest thandishonest transactions in them It is the minority class (frauds) that is of interest, and thelack of these significant frauds frequently hinders the classifier's capacity to train, leading

to unsatisfactory models

Trang 17

human experts may find it difficult to differentiate the issue.

Lack of appropriate evaluation metrics: Traditional accuracy scores cannot beapplied to data that is wildly imbalanced Unfortunately, it is noted that there is a lack ofuniform evaluation criteria or metrics for assessing and contrasting the effectiveness andcaliber of fraud detection systems Hence, it is hard to compare different detectionmodels and methods highlights the absence of benchmarking as a major problem inaddition There is disagreement among researchers as to what models perform better andwhat optimal metric for evaluation would be

Business values

- Create a List of What Your Business Should Offer: Credit Card Fraud Detection

- Maintain Your Business and Its Relationships: Relying on anti-fraud technology,businesses maintain tight and highly secure operations When partnering withmany other parties, ensure prestige and quality

- Retain Your Best Employees and Customers: When the system is properlyoperated, employees will trust the accuracy of the business and work loyally evenmore dedicatedly Moreover, customers also trust when working with businesseswith long-term partners

- Maintain Your Business's Financial Health: Detecting fraud cases oftelecommunications charges to prevent loss of revenue for telecommunicationsservice providers

- Educate Your Employees and Customers: Increased training programs foremployees and customers on being on the lookout for fraud cases Applying thistechnology as the basis for providing optimal solutions with science and logic

- Look for Opportunities to Expand: Through a successful credit card fraudprevention tool, we can use this mechanism for other areas such as anti-hacking,anti-technology theft, anti-app fraud,

Trang 18

Every day in the world, there are billions of transactions, and with such hugeamounts of data, it is impossible to classify them manually In addition, the accuracy ofthe classification is also a huge issue as many factors will affect the fraud of a transaction.Therefore, a solution is needed to classify this large amount of data efficiently.Many methods have been proposed, and machine learning is said to be one of the mosteffective methods for detecting credit fraud Many machine learning and deep learningalgorithms such as ANN, Random Forest, etc., are applied in practice However, thesemodels are computationally expensive and perform better on large datasets Thisapproach can lead to great results, as we can see in some papers up to 90% accuracy, butwhat if the results are similar, or even better, achievable with less resources? Our maingoal is to show that different machine learning algorithms can yield good results withproper preprocessing The authors of most of the articles mentioned used undersamplingtechniques and that was the impetus to use another approach - oversampling Consideringthe given events, we build basic machine learning models such as SVM, RF, DT, KNN…and suitable preprocessing methods to detect credit fraud To that end, an experiment wasconducted.

Tools and Programing language

To conduct the research, we use the Scikit-learn library available in the Python

Trang 19

Figure 0 1 Research model

The figure illustrates the proposed model for detecting credit card fraud with animbalanced dataset The process includes four steps: data preprocessing, model training,model evaluation, visualization and analysis

Firstly, in data preprocessing, the data is cleaned to optimize the performance ofmachine learning models for classification Additionally, data augmentation is applied toincrease the number of instances in minority classes by interpolating new instancesbetween existing ones

Trang 20

contrast their effectiveness If data augmentation is not included, the model is tuned usingclass weights.

In model evaluation, the confusion matrix is used to validate the performance ofthe models If the results are unsatisfactory, the model training stage is revisited todevelop an improved model

Finally, the classification model's performance is presented through visualizationand analyzed to compare their effectiveness across various approaches

Trang 21

Understanding data is the process of discovering, analyzing, and interpreting data togain a better understanding of its meaning, quality, and structure This is an importantstep in the data mining process that includes collecting, cleaning, and transforming data

to make it useful for analysis In this chapter, we will perform the determination ofparameters such as quantity, attributes, and values of data

1.1 Dataset

Dataset is taken from Kaggle: Credit card Fraud Detection

Because of credit card fraud, the business has collected customer data that may berelevant to the fraud From there, we will take appropriate measures to find the mostinfluential attributes It consists of 122 columns, including 120 features, one identifiercolumn and one target column (TARGET)

The dataset contains 307511 rows Understanding and identifying the dataset'sdata type is essential to building an effective machine learning application Qualitativevariables and quantitative variables are the two main categories of variables in statistics

In the dataset, there are 53 quantitative variables 69 qualitative variables, and we need toencode qualitative variables into quantitative

1.2 Data understanding

1.2.1 Target column

In this column, value 0 has 282686 values or 91.93% of all values in the column,while value 1 has only 24825 values equal to 8.07% So that, the data is imbalancedbecause about 92% loan is paid on time, but only 8% are not paid Therefore, Imbalancedata need to be handled

1.2.2 Correlation

This involves examining the strength and direction of the relationship between twovariables using techniques such as Pearson correlation or Spearman correlation

Trang 22

(10 positive and 10 negative) Based on those variables, we can have a more detailedmeasure to predict the output of the model.

Figure 1 1 Positive Correlations Figure 1 2 Negative Correlations

1.2.3 Missing data

Trang 23

Overview of Chapter 2

Chapter 2 outlines the data preprocessing process, including the preprocessing methodsthat our team applies and describes Besides, there are contents related to EDA.2.1 EDA

The overall objective of EDA is to develop a thorough understanding of the data,use this understanding to guide our modeling choices, and enhance the accuracy as well

as reliability of our predictions When it comes to machine learning, EDA can assist us infinding potential problems or biases in our data, choosing relevant features, andidentifying trends or patterns that can guide our modeling decisions This is crucial whendealing with complex or noisy data, as well as when dealing with vast amounts of datathat may be difficult to comprehend without the use of visual aids or statistical analyses.The first step of EDA is statistical analysis which is the process of using mathematicalmodels and techniques to describe and summarize data, and to identify any significantdifferences or relationships between variables

Ngày đăng: 23/03/2024, 16:06

TỪ KHÓA LIÊN QUAN

TÀI LIỆU CÙNG NGƯỜI DÙNG

TÀI LIỆU LIÊN QUAN

w