VIETNAM NATIONAL UNIVERSITY HO CHI MINH CITYUNIVERSITY OF INFORMATION TECHNOLOGY ADVANCED PROGRAM IN INFORMATION SYSTEMS TRUONG MINH KHIET - 19520628 LUONG TIEN THUAN HAI - 19521462 THES
CCF represents an escalating threat to financial institutions, as fraudsters continually develop novel methods to exploit vulnerabilities The need for a robust classifier is paramount to adjust to the ever-changing landscape of fraud The primary goal of a fraud detection system is to precisely forecast instances of fraud while minimizing the occurrence of false positives.
The effectiveness of machine learning (ML) approaches varies across diverse business scenarios, where the characteristics of input data play a crucial role in determining the suitable ML techniques In the realm of CCF detection, pivotal factors influencing model performance encompass the quantity of features, transaction volume, and inter-feature correlations.
Deep learning (DL) techniques, exemplified by Convolutional Neural Networks (CNNs) and their layers, are essential for text processing and serve as a foundational model Applying these techniques to identify fraudulent credit card activities surpasses the capabilities of traditional algorithms A comparative evaluation of algorithmic performances highlights the CNN with 20 layers and the XG Boost model as the leading methods, achieving an impressive accuracy of 94.36%.
Numerous sampling methods are employed to boost the performance of existing examples, yet they exhibit a noticeable decline in efficacy when applied to unseen data Intriguingly, there is an observed enhancement in performance on unseen data with an increased class imbalance Future research initiatives might explore the integration of more sophisticated deep learning methods to elevate the overall performance of the model proposed in this study.
INTRODUCTION - 6 33 v99 HH HH ng ưkp 10 1.1 Reason for choosing the {OpIC - 5 5 2+ 119119 1 9 1 vn ng tr 10 1.2 TOpic PUTDOSG - d1 1E TT TH HH nh 13 I0)) ung i00 0n IAẠẠỌỤ
Research scope Of the fODIC - 6 SE 1911 11 1 vn HH ng ngư 16
In this research, our primary objective is to harness machine learning (ML) and deep learning (DL) algorithms for detecting credit card fraud transactions The study contributes in the following ways:
Algorithms for feature selection are employed to rank the most significant features within the Credit Card Fraud (CCF) transaction dataset, facilitating the prediction of class labels.
A novel deep learning model is introduced by incorporating additional layers These added layers play a crucial role in feature extraction and the classification of transactions within the identified credit card fraud dataset.
Architectural Analysis of CNN Model:
The performance of the Convolutional Neural Network (CNN) model is scrutinized using different architectures of CNN layers to optimize its effectiveness in credit card fraud detection.
A comparative analysis is carried out between traditional Machine Learning (ML) and Deep Learning (DL) algorithms and our proposed CNN model against a baseline model The results underscore the superior performance of our proposed method compared to existing approaches.
The accuracy of classifiers is assessed using performance metrics such as accuracy, precision, recall, and Fl-score These evaluations are conducted on the latest credit card dataset.
The structure of the article is organized as follows:
- Section 2: Reviews related works in the field.
- Section 3: Provides a comprehensive description of the proposed model and its methodology.
- Section 4: Outlines the dataset and the metrics used for evaluation, presenting the results of experiments on a real dataset along with analysis.
Table 1.1 - Configure the execution computer
Processor 11th Gen Intel(R) Core(TM) i5-
11400H @ 2.70GHz 2.69 Installed RAM 16.0 GB (15.7 GB usable) System type 64-bit operating system, x64-based processor
Table 1.2 - Configure the Python 3 Google Compute Engine
THEORETICAL BACKGROUND AND RELATED WORKS
P.0 nh
The research method determines the research methodology, which is a systematic way of solving problems Applied research aims to find solutions to the problems. Before conducting experiments in the real world, the research covers all the basics by following these steps:
Table 3.1 - Accuracy based results of deep learning algorithms a ae
Ensemble model £ as baseline models
EXPERIMENTAL STUDY -.- 5 55 St EsvEseeseesreesessee 33 3.1 Experimental environment 000rraẳaẳ
Experimental data S€( - 3x HH TH HH ng grư 33
The research method determines the research methodology, which is a systematic way of solving problems Applied research aims to find solutions to the problems. Before conducting experiments in the real world, the research covers all the basics by following these steps:
Table 3.1 - Accuracy based results of deep learning algorithms a ae
Ensemble model £ as baseline models
The mainframe transaction table of credit cards shows the essential features in Table 3.1 The structure of the table may vary slightly depending on the card issuer, but the key characteristics are stored in the database and can be used for fraud detection modeling.
For research purposes, the credit card dataset is available The dataset [11] contains transactions made by a cardholder in September 2018 over two days Out of 284,807 transactions, 492, or 0.172 percent, were fraudulent To protect the confidentiality of the consumer’s transaction details, most of the dataset’s features are transformed using principal component analysis (PCA) PCA is a common and widely used method in the relevant literature to reduce the dimensionality of such datasets, while preserving interpretability and minimizing information loss 2, 4, [19] It does this by creating new uncorrelated variables that maximize variance successively Table 4 shows the detail of the dataset with 31 columns, including time, V1, V2, V3 V28 as PCA applied features, amount, and class labels.
Table 3.2 - The list of features available in the CCF dataset
Sr No | Name of Feature Description
3 Credit Limit The maximum amount of credit of the associated account
Card number Number of Credit card
Transaction Amount The transaction amount submitted by the
Transaction Time Time of the transaction
Transaction Date Date of the transaction
Transaction Type Types of Transaction, such as a each withdrawal and purchase
Currency Code The currency code ơ jo) Merchant Category Code The Merchant business type number
Merchant Number The merchant reference number
— t Transaction Country The country where the transaction takes place
Transaction City The city where the transaction takes place
Approval Code The response to the authorisation request,
—ơ G3) ơ A it means approve or reject
Based on the data set, the following columns will be obtained.
Table 3.3 - Characteristics of the dataset
Description between the current transaction and the first transaction
VI, V2, V3, , V28 attributes | These 28 columns show result of a PCA dimensionality reduction to protect user identities and sensitive feature
Class label Binary class labels 1 and 0 for non fraudulent and fraudulent
Overview of the original unprocessed data set.
Table 3.4 - Characteristics of the dataset
3 Colum name VI, V2, V3, , V28, Time, Amount,
5 Number of row with the same | 0 : 284215 label ( 0: Non-fraud , 1: | 1: 492 Fraud)
Large size: The dataset is large, with 284,807 rows and 31 columns This is a fairly large dataset that can be used for classification problems, especially fraud classification.
Large number of columns: The data set has 31 columns These are some quite large columns, which can contain a lot of useful information for the classification problem.
No null values: The tuple has no null values This is a favorable condition for data analysis.
Generic column names: Column names include V1, V2, V3, V28, Time, Amount, Class These column names are quite generic, which can be difficult to understand.
Small number of labels: The dataset has two labels: Non-fraud and Fraud This is a fairly small number of labels, which can make the classification problem more difficult.
Large difference in the number of samples: The number of samples for the Non-fraud label is 284,215, while the number of samples for the Fraud label is 492 This is a quite large difference, which can make the classification problem more difficult.
Overall, your dataset has many advantages, especially its large size and no null values However, you need to note some disadvantages such as generic column names, small number of labels, and large sample difference To improve the performance of your classification problem, you can experiment with techniques such as data analytics, advanced machine learning, data balancing, data dimensionality reduction, data preprocessing, and data integration
Prepare raw data and use the drive for storage, then connect to the drive to connect and receive data. from google.colab import drive drive.mount('/content/drive")
Mounted at /content/drive import numpy as np # linear algebra import pandas as pd # data processing, CSV file I/O (e.g pd.read csv} df = pd.read_csv('/content/drive/MyDrive/KLTN/creditcard.csv") df.head()
Figure 10 - Connect data through drive
Remove data with null and NaN values, instead add value to it based on the average value of the column
# Replace the NaN value in the data with the column's average value df.fillna(df.mean(), inplace=True) threshold = 9.5 # For example, threshold value 9.5 df['Class'] = df['Class'].apply(lambda x: 1 if x > threshold else @) df.describe()
Time vi v2 v3 va v5 ve v7 V8 count 284807.000000 2.848070e+05 2.848070e+05 2.848070e+05 2848070e+05 2848070e+05 2848070e+05 2.848070e+05 2.848070e+05 mean 94813859575 1.168375e-15 3.416908e-16 -1.379537e-15 2.074095e-15 9.604066e-16 1.487313e-15 -5.556467e-16 1.213481e-16 std 47488145955 1958696e+00 1.651309e+00 1.516255e+00 1.415869e+00 1.380247e+00 1.332271e+00 1.237094e+00 1.194353e+00 min 0.000000 -5.640751e+01 -7.271573e+01 -4.832559e+01 -5.683171e+00 -1.137433e+02 -2616051e+01 -4.355724e+01 -7.321672e+01
# count the number of NULL values df.isnull().sum().max()
Figure TT - Incorrect data handling
Check the data with a total of 31 columns, of which 28 columns are variables (V1 to V28), one column is time (Time), one column is money (Amount) and one column is class (Class). df.columns
# Solve the problem of layers being heavily skewed print('No Frauds’, round(df['Class' ].value_counts()[8]/len(df) * 100,2), '% of the dataset’) print('Frauds", round(df['Class*].value_counts()[1]/len(df) * 106,2), '% of the dataset")
No Frauds 99.83 % of the dataset
Figure 12 - Check the columns in data
Analyze and show the imbalance in the Class column with Class 0 representing "no fraud" and Class | representing "fraud" and the chart shows that there is a large number of non-fraud cases and few cases of fraud import seaborn as sns import matplotlib.pyplot as plt colors = ["#@101DF", “#DF9191"] sns.countplot(x='Class', data palette=colors) plt.title('Class Distributions \n (8: No Fraud || 1: Fraud)', fontsize)
Text(8.5, 1.8, ‘Class Distributions \n (8: No Fraud || 1: Fraud)')
Class Distributions (0: No Fraud || 1: Fraud)
Figure 13 - Chart of Class Distributions
Rearrange some columns so the data can be viewed more easily.
# Redistribute data by Amount and Time from sklearn.preprocessing import StandardScaler, RobustScaler
# RobustScaler is less prone to outliers.
StandardScaler() RobustScaler() std_scaler rob_scaler df['scaled_amount'] = rob_scaler.fit_transform(df[ ‘Amount’ ].values.reshape(-1,1)) df[‘scaled_time'] = rob_scaler.fit_transform(df[ ‘Time’ ].values.reshape(-1,1)) df.drop([ 'Time', 'Amount' ], axis=1, inplace=True) scaled_amount = df['scaled_amount' ] scaled time = df['scaled time" ] df.drop([ 'scaled_amount', 'scaled time'], axis=l, inplace=True) df.insert(@, 'scaled amount', scaled_amount) df.insert(1, ‘scaled time’, scaled time)
# Amount and Time have been scaled
Calculate the percentage of labels relative to the data and divide the data. df.head() scaled_amount scaled time V1 v2 V3 v4
) from sklearn.model_ selection import KFold, StratifiedKFold from sklearn.model_selection import train_test_split from sklearn.model_selection import StratifiedShuffleSplit print('No Frauds’, round(df['Class' ].value_counts()[8]/1len(df) * 100,2), '% of the dataset’) print('"Frauds", round(df[ 'Class' ].value_counts()[1]/1en(df) * 108,2), '% of the dataset’)
X = df.drop('Class', axis=1) y = df['Class'] sss = StratifiedKFold(n_splits=5, random_state=None, shufflese) for train_index, test_index in sss.split(X, y): print("Train:", train_index, “Test:", test_index) original _Xtrain, original_Xtest = X.iloc[train_index], X.iloc[test_index] original_ytrain, original_ytest = y.iloc[train_index], y.iloc[test_index]
# Use the already existing X_train and y train
Figure 15 - Allocating data for training and testing.
Divide the data into 5 parts, then apply the cross-validation method
# original Xtrain, original Xtest, original_ytrain,
# original_ytest = train test split(X, y, test_size=0.2, random_stateB)
# Convert into an array original Xtrain = original _Xtrain.values original Xtest = original_Xtest.values original_ytrain = original_ytrain.values original_ytest = original_ytest.values
# See if the distribution of training labels and test labels are distributed the same train_unique_label, train_counts_label = np.unique(original_ytrain, return_counts=True) test_unique_label, test_counts_label = np-unique(original_ytest, return_counts=True) print('-' * 100) print('Label Distributions: \n') print(train_counts_label/ len(original_ytrain)) print(test_counts_label/ len(original_ytest))
No Frauds 99.83 % of the dataset
Train: [ 9 1 2 284804 284805 284806] Test: [ 30473 30496 31002 113964 113965 113966] Train: [ 9 1 2 284804 284805 284806] Test: [ 81609 82400 83053 170946 170947 170948] Train: [ 9 1 2 284804 284805 284806] Test: [150654 150660 150661 227866 227867 227868] Train: [ 9 1 2 227866 227867 227868] Test: [212516 212644 213992 284804 284805 284806]
Figure 16 - Result of Label Distribution
Shuffles data based on previously divided parts
# Since our classes are highly skewed we should make them equivalent
# in order to have a normal distribution of the classes.
# Lets shuffle the data before creating the subsamples if = df.sample(frac=1)
# amount of fraud classes 492 rows. fraud_df = df.l1oc[df['Class'] == 1] non_fraud_df = df.loc[df[‘Class'] == 9][:492] normal_distributed_df = pd.concat([fraud_df, non_fraud_df])
# Shuffle dataframe rows new_df = normal_distributed_df.sample(frac=1, random_stateB) new_df.head() scaled_amount scaled_time V1 v2 V3 V4
After applying the subsampling method, the amount of data of each label is balanced print( ‘Distribution of the Classes in the subsample dataset’) print (new_df['Class'].value_counts()/len(new_df)) sns.countplot(x='Class', data=new_df, palette=colors) plt.title( ‘Equally Distributed Classes’, fontsize) plt.show()
Applied machine learning & ensemble learning techn1ques
We use and apply the following machine and ensemble learn-ing algorithm.
Use Grid Search Cross- Validation technique to search for optimal parameters for the model These optimal parameters will help the model achieve the best performance without overfitting. result_final.update({ "DecisionTree”: accuracy DecisionTree}); print(f'Accuracy: {accuracy _DecisionTree}') print(#*Confusion Matrix: \n{conf_matrix}") print(f'Classification Report: \n{classification_rep}')
Best Parameters: {'max depth': 5, "min samples leaf': 1,
Classification Report: precision recall f1-score support
1 0.93 8.98 0.91 134 accuracy 0.92 284 macro avg 0.92 0.92 0.92 284 weighted avg 0.92 0.92 9.92 284
Figure 25 - Score model Decision Tree
Table 3.6 - Table score of Decision Tree
Data Model Prediction Value Score of
Prediction | Model Processed Decision Tree | True positive 141 91.9 %
False positive 14 True negative 120 False Negative 9
Normalization makes the features have the same scale, which improves the performance of the KNN algorithm, especially when the features have different scales.
Use Grid Search Cross-Validation to find the best hyperparameters for the model. Use the model with the best hyperparameters to predict on the test set result final.update({ “KNN” print(f'Accuracy: {accuracy KNN}")
: accuracy_KNN}); print(f'Confusion Matrix: \n{conf_matrix}') print(f'Classification Report: \n{classification_rep}')
Classification Report: accuracy macro avg weighted avg precision
Figure 26 - Score model KNN Table 3.7 - Table score of KNN
Data Model Prediction Value Score of
Prediction | Model Processed KNN True positive 146 91.5 %
False positive 20 True negative 114 False Negative 4
Use Grid Search Cross- Validation technique to search for optimal parameters for the model These optimal parameters will help the model achieve the best performance without overfitting.
The model consists of a set of decision trees built independently of each other The number of trees in the forest is controlled by the parameter n_estimators Each tree is constructed using a random set of features and samples (if bootstrap=True), which enhances the diversity of the forest.
# Evaluate the refined model accuracy _RandomForest = accuracy_score(y_test, y_pred) conf_matrix = confusion_matrix(y_test, y_pred) classification_rep = classification_report(y_test, y_pred) result_final.update({ "RandomFores”: accuracy RandomForest}); print(f'RandomForest: ', accuracy RandomForest) print(f'Confusion Matrix: \n{conf_matrix}") print(f'Classification Report: \n{classification_rep}')
Best Parameters: {'bootstrap'`: True, ‘max_depth’: None, 'min samples leaf': 2, ‘min_samples_split’: 2, 'n estimators': 50}
Classification Report: precision recall f1-score support
1 9.96 9.99 9.93 134 accuracy 0.93 284 macro avg 9.94 9.93 9.93 284 weighted avg 9.93 9.93 9.93 284
Table 3.8 - Table score of RF
Data Model Prediction Value Score of
Prediction | Model Processed RF True positive 145 93.3 %
False positive 14 True negative 120 False Negative 5
Data are standardized before building the SVM model This often improves the performance of SVM, especially when features scale differently.
The SVM model uses the RBF kernel The RBF kernel is a popular kernel for SVM applications.
The parameters C and gamma are used to adjust the stiffness of the decision boundary The larger the C parameter, the stiffer the decision boundary, while the smaller the gamma parameter, the more sensitive the decision boundary is to outlier data points.
# Evaluation of the refined model accuracy SVM = accuracy score(y test, y_pred) conf_matrix = confusion matrix(y test, y_pred) classification rep = classification report(y test, y pred) result_final.update({ "SVM": accuracy_SVM}); print(f'svc: {accuracy _SVM}") print(f"Confusion Matrix: \n{conf_matrix}' ) print(f'Classification Report: \n{classification_rep}')
Best Parameters: {'C': 10, ‘gamma’: ‘scale’, ‘kernel’: 'rbf'}
Classification Report: precision recall f1-score support
1 9.96 9.90 8.93 134 accuracy 9.94 284 macro avg 9.94 8.93 9.94 284 weighted avg 9.94 0.94 9.94 284
Table 3.9 - Table score of SVM
Data Model Prediction Value Score of
Prediction | Model Processed SVM True positive 145 93.6 %
False positive 13 True negative 121 False Negative 5
Similar to previous models, data were normalized using StandardScaler to improve model performance.
Grid Search Cross- Validation is used to find the best hyperparameters for the model. Hyperparameters tested include: e C: Adjust the level of regularization, helping to prevent overfitting the model to the training data. e penalty: Determines the type of regularization to be used (11 for Lasso and 12 for Ridge).
# Print out the best hyperparameter print("Best Parameters: ằ grid_search.best_params_)
# Predict on the test set using the fine-tuned model y_pred = grid_search.best _estimator_.predict(X test scaled)
# Evaluation of the refined model accuracy_LogisticRegression = accuracy_score(y test, y_pred) conf_matrix = confusion_matrix(y_test, y_pred) classification_rep = classification report(y test, y_pred) result_final.update({ “LogisticRegression": accuracy _LogisticRegression}) ; print(f' accuracy: {accuracy_LogisticRegression}") print(f"Confusion Matrix: \n{conf_matrix}') print(f"Classification Report: \n{classification_rep}')
Classification Report: precision recall f1-score support
1 9.96 9.91 9.93 134 accuracy 9.94 284 macro avg 9.94 9.94 9.94 284 weighted avg 9.94 9.94 9.94 284
Figure 29 - Score model LOGISTIC REGRESSION
Table 3.10 - Table score of Decision Tree
Data Model Prediction Value Score of
Use Grid Search Cross-Validation to find the best hyperparameters for the model.
Hyperparameter testing: e n_estimators: Number of decision trees in the set. e max_depth: The maximum depth of each tree is determined. e learning_rate: Learning rate of the algorithm. e subsample: Percentage of samples used to train each tree. e colsample_bytree: The specific percentage used to train each tree.
# Prediction on the test set using the fine-tuned model y_pred = grid_search.best_estimator_.predict(X_test)
# Evaluation of the refined model accuracy _XGBoost = accuracy_score(y test, y_pred) conf_matrix = confusion_matrix(y_test, y_pred) classification_rep = classification report(y test, y pred) result_final.update({ "XGBoost": accuracy _XGBoost}); print(f'Accuracy: {accuracy_XGBoost}') print(f'Confusion Matrix: \n{conf_matrix}') print(f'Classification Report: \n{classification_rep}')
Classification Report: precision recall f1-score support
1 0.97 9.91 0.94 134 accuracy 9.94 284 macro avg 9.95 9.94 0.94 284 weighted avg 0.94 9.94 0.94 284
Figure 30 - Score model XG Boost
Table 3.11 - Table score of XG Boost
Data Model Prediction Value Score of
Prediction | Model Processed XG Boost True positive 146 94.3%
False positive 12 True negative 122 False Negative 4
Create a Sequential object that represents the sequential model.
Input layer with 100 neurons uses ReLU.
Hidden layer with 50 neurons uses ReLU.
Output layer with 1 neuron using sigmoid function.
Loss function: Use the binary crossentropy loss function suitable for the binary classification problem.
Optimizer: Use the Adam optimizer to update the weights during training.
Evaluation: Monitor accuracy during training.
Classification Report: precision recall fi-score support
1 0.96 0.93 0.94 134 accuracy 0.95 284 macro avg 9.95 9.95 9.95 284 weighted avg 0.95 0.95 0.95 284
Figure 31 — Score of ELM Model
Table 3.12 - Table score of ELM
Data Model Prediction Value Score of
Prediction | Model Processed | ELM True positive 145 94.7 %
We use and apply the following deep learning algorithm.
In machine learning, a baseline model serves as an uncomplicated initial point of reference for comparison with more intricate models Its purpose is to set a performance benchmark, enabling the assessment of the effectiveness of advanced algorithms In the realm of fraud detection, a typical baseline could entail employing uncomplicated techniques such as Logistic Regression or a basic decision tree These straightforward models offer an initial framework for analysis, enabling practitioners to assess the predictive capabilities of more sophisticated approaches.
Convolutional Neural Networks (CNNs) are crucial in fraud detection, especially with visual data like scanned documents or credit card images Known for their adept feature extraction from images, CNNs automatically learn patterns, making them valuable for distinguishing fraud Their adaptability extends to tasks with sequential or temporal data, making them versatile for specific fraud detection challenges. Transfer learning allows pre-trained CNN models, initially for image classification, to be fine-tuned for fraud detection, leveraging learned features for improved performance While not the default for tabular data, CNNs are indispensable for fraud detection in contexts dominated by visual or sequential data.
Building the Model: The CNN model in keras consists of conv layer, max pooling layer, dropout layer, conv layer, max pooling layer, dropout layer, and two fully connected layers in sequence Figure 4 shows the input neural network and the output of the dropout layer Model Compilation: Categorical Cross-Entropy: We used binary
Page | 62 cross-entropy in previous sections and in ML Now, we use categorical cross-entropy. This means that we have multiple classes The equation is as follows:
CCE = — (y; log(y;) + (1 — yj) log(1 — y,)) mm
Building the Model: Epochs and Batch Size: We used a dataset of 20 samples, a batch size of 2 and decided that the algorithm needed to run for three epochs Therefore, in each epoch, we use five batches (20/2 = 10) All batches are processed by the algorithm; then, we have five iterations per epoch This method is often better than the sequential model The most change comes from the Stalk group and some minor changes in the module of the sequential model.
Building the Model: Traditional methods of evaluating ML classifiers can use confusion matrices to measure the difference between the actual data truth and the model’s prediction The confusion matrix consists of four elements: TP, TN, FP, and
FN, which stand for true positive, true negative, false positive, and false negative, respectively.