Build predictive models based on available training data sets and compare the obtained results with each other to choose the most suitable model to understand the influence of financial
Theoretical research methods . - Sc 2 2112 1122111111111 1 111111 ra 7
Analytical-synthetic method: from the available documents, read and synthesize to draw out the necessary content for the thesis of the research paper
Modeling method: building a research model based on theory and applying the model to predict in order to test the accuracy of the model
From that theoretical basis, proceed to apply practical research methods:
Use Orange software - a fairly intuitive tool to research about Machine learning algorithms and popular DTM practices today to analyze data and clarify research problems
Build predictive models based on available training data sets and compare the obtained results with each other to choose the most suitable model to understand the influence of financial information on decision making Choosing a quality audit company 1.5 Structure of the research
In addition to the table of contents, the list of tables and figures, the list of acronyms, references and appendices, the topic is structured into 4 chapters as follows:
CHAPTER 2: THEORETICAL BASIC 2.1 Data mining
2.1.1 Knowledge discovery and data mining
Knowledge Discovery and Data Mining - a rapidly growing academic field that combines DTB management, statistics, machine learning (ML) with the ultimate aim of extracting useful knowledge from the dataset big data
O Knowledge discovery is the process of identifying new values, latent knowledge, and eventual knowledge of patterns or patterns in data
O Data mining is a step in the knowledge discovery process This technique allows us to derive key knowledge
The knowledge discovery process consists of several basic steps as follows:
O Data Selection: Aggregates data mined from a DTB, data warehouse, or web source into a single DTB From there, select the necessary data for the following steps
In the process of data collection, there will often be obstacles because the data is located in many places and different forms
O Data preprocessing: When aggregating data, it is easy to encounter errors such as: data is incomplete, tight and lacks logic due to the inconsistency of DTB Therefore, this is a step that helps to avoid misleading results in data mining
O Data transformation: Data after being “preprocessed” will be returned to a convenient form for data exploration
0 Data Mining: Using techniques to uncover key insights hidden in data
O Evaluation of sample results: Determine the evaluation criteria for the purpose of finding the necessary knowledge because not all samples are useful, meaningful, even some samples are wrong
Knowledge discovery is the whole process of extracting knowledge from DTBs, in which data mining is the key stage of the process Data mining is done after having filtered and preprocessed the data, that is, it is done to extract meaningful patterns on the appropriate dataset
Interpretation Evalaation Data mining Transformation
Figure 1: Process of Data mining
Source: Phantuanduy (2013) 2.1.2 Data mining definition
Data mining refers to the process of analyzing data and using specialized techniques to find a feature pattern in an extremely large data set For professionals, Data Mining has many different definitions:
O Ferrurra's definition: "Data mining is a set of methods used in the knowledge discovery process to find out the distinct relationships and unknown patterns within data.”’
O Parsaye's definition: “Data mining is a decision aid process in which we look for unexpected and unknown patterns of information in large DTBs.”
O Mitchell's definition: “Data mining is the use of existing data to discover rules and make informed decisions.”
O Definition of Berry & Linoff: “Data mining is the automatic discovery and analysis of large amounts of data to discover patterns and rules `
In a nutshell, Data Mining refers to the process of using data analysis tools and techniques to find patterns from multiple angles with the aim of discovering associations between data, objects, and data inside the DTB
Xác Xác Thu thập Giải thuật định —y| định dir ằ„ị và tiền khai phỏ nhiệm liệu liên xử lý dữ dữ liệu quan liệu vụ mã =
Figure 2: Process of data mining Souce: MS Tran Hùng Cường, MS Ngô Đức Vinh (2011) The data mining process starts from identifying the problem that is being encountered correctly, then learning the relevant data to use to build the solution Then, the necessary data is carefully collected and preprocessed into a form that the data mining algorithm can understand Although it consists of only a few steps, this is not a simple process, when conducting may encounter some difficulties such as: if the model needs to modify the data, the whole process must be repeated until the model is modified suitable for time consuming, or having to make multiple copies of extracted data into files, etc
Performing data mining is the next step after choosing the appropriate algorithm to find meaningful patterns represented in the corresponding forms
The pattern is characterized as having to be new (at least for that system) Novelty is usually evaluated through a logic or novelty function and is measured relative to the change in the data (by comparing the found values with desired or previous values) , or by knowledge (the relationship between the old search method and the new method) In addition, after processing the samples and the output, the results must be evaluated through a utility function to measure the potential usability
2.1.4.1 Mining of frequent sets and association rules
This technique is intended to determine the relationship between different variables in the database and is used to “unpack" latent patterns in the data An association rule X
— Y reflects the simultaneous occurrence of the set Y when the set X occurs
This technique is very commonly used by businesses to analyze shopping behavior, predict trends from the shopping cart of potential customers, and predict consumer behavior in the retail industry or in the public sector information technology, specifically Machine Learning programs
[Ll The process of classifying a data object into one or more given classes (types) by means of a classification model This model is built based on a previously labeled data set (which class it belongs to)
| This technique is used to extract the necessary information from the available data warehouse Therefore, for this technique, we will apply different algorithms depending on the purpose of use
1 This is also a technique that plays an important role in predicting the rules, trends, by describing the attributes related to the object being classified into a specific class
1 This is also a technique that plays an important role in predicting the rules, trends, by describing the attributes related to the object being classified into a specific class
Is the process of clustering/grouping objects/data with similar characteristics into corresponding clusters/groups In which: objects with similar properties will be classified in the same cluster and vice versa The data used in this technique is unlabeled data and is commonly found in practice
In business, this technique is often applied to manage customer profiles or segment customers in the field of Marketing
Although DTM still has many limitations that need to be improved, its current potential cannot be denied This is a technique that attracts the attention of most researchers because of its diverse applications in many different fields such as:
+ Financial and banking sectors: Building models to forecast credit and debt risks, supporting decision making when investing in securities This is also the main research direction of the article ô E-commerce: Analyze the shopping attitude of customers and rely on each type of customer to have a suitable marketing plan
* Medical field: Detecting disease-cure relationships to find new drugs, based on risk factors to predict the type of disease a patient may have ¢ The field of biology: Support the collection, storage and analysis of data on genetics, research on diseases, nutrients, through visualization with tables and graphs ¢ Education field: Helps to analyze data in the educational environment to determine each student's learning situation and forecast future learning outcomes to find appropriate teaching methods ¢ Education field: Helps to analyze data in the educational environment to determine each student's learning situation and forecast future learning outcomes to find appropriate teaching methods
The data classification process consists of two main steps: ¢ Step 1: Model building (or “learning” or “training”’) phase
The training process aims to build a model describing an existing data set The input to this process is a preprocessed and labeled sample data set, each data element is assumed to belong to a preclassified class, the class here being the value of an attribute selected as an attribute labeling or classification properties Each tuple is collectively referred to as a data element, which can be templates, examples, objects, or instances The result of this step is the trained classification model
Combi then Risk = High or Car Type =Sports
Source: Nguyễn Thị Tùy Linh (2005)
* Step 2: Using the model, divided into 2 small steps:
+ Step 2.1: Evaluate the model (check the correctness of the model)
The input is a sample data set randomly selected and independent of the samples in the training dataset, also labeled and preprocessed However, this labeled attribute is
"ignored" when it is included in the classification model
THEORETICAL BASIC 2.11201112111211 11111111111111 11111911 khu 8 2.1 Data mining - - c0 01022011101 11101 112111 11111112 11101111 1110 11101111 01111211 kk 8 2.1.1 Knowledge discovery and data mining - - - 2 2222212 2+2xszzs+2 8 2.1.2 Data mining definition . c1 2 22221111111 12111 1221111111111 1118211111 21x12 9 2.1.3 The process of data minIng - - - 2c 2: 122012211 1111353 1111511111 15551 111 xrg2 10 2.1.4 Data mining techrniques - 5: 2 2221222211211 11211 152311511 11511 1552111112 x12 10 2.1.4.1 Mining of fequent sets and assocIatIon rules 2 cxsss s22 10 2.1.4.2 Data classification HH
Applications of data min1ng 1220122011121 1 1511115511111 155111 s2 II P9 oi con 12 2.2.1 Data classification DFOCSS: L2 201020112011 1211 1121111151111 1 158111 12
Although DTM still has many limitations that need to be improved, its current potential cannot be denied This is a technique that attracts the attention of most researchers because of its diverse applications in many different fields such as:
+ Financial and banking sectors: Building models to forecast credit and debt risks, supporting decision making when investing in securities This is also the main research direction of the article ô E-commerce: Analyze the shopping attitude of customers and rely on each type of customer to have a suitable marketing plan
* Medical field: Detecting disease-cure relationships to find new drugs, based on risk factors to predict the type of disease a patient may have ¢ The field of biology: Support the collection, storage and analysis of data on genetics, research on diseases, nutrients, through visualization with tables and graphs ¢ Education field: Helps to analyze data in the educational environment to determine each student's learning situation and forecast future learning outcomes to find appropriate teaching methods ¢ Education field: Helps to analyze data in the educational environment to determine each student's learning situation and forecast future learning outcomes to find appropriate teaching methods
The data classification process consists of two main steps: ¢ Step 1: Model building (or “learning” or “training”’) phase
The training process aims to build a model describing an existing data set The input to this process is a preprocessed and labeled sample data set, each data element is assumed to belong to a preclassified class, the class here being the value of an attribute selected as an attribute labeling or classification properties Each tuple is collectively referred to as a data element, which can be templates, examples, objects, or instances The result of this step is the trained classification model
Combi then Risk = High or Car Type =Sports
Source: Nguyễn Thị Tùy Linh (2005)
* Step 2: Using the model, divided into 2 small steps:
+ Step 2.1: Evaluate the model (check the correctness of the model)
The input is a sample data set randomly selected and independent of the samples in the training dataset, also labeled and preprocessed However, this labeled attribute is
"ignored" when it is included in the classification model
By comparing the labeling attribute of the input data and the classification results from the model, it is easy to determine the correctness of the model Holdout is a simple technique to estimate this correctness based on the percentage of samples in the predicted data set that are correctly classified by the model (relative to reality) The result of this step is that the model will be used to classify the data that is needed in the future, or the data for which the value of the classification attribute is unknown if the model is suitable and highly accurate
Figure 4: Evaluate the accuracy of the classification model
Source: Nguyễn Thị Tùy Linh (2005) + Step 2.2: Classify new data
In this step, the input data is the "missing" attribute data to predict the class (label) The model will automatically classify (label) these data objects based on what was trained in step 1
Source: Nguyễn Thị Tùy Linh (2005) 2.2.2 Classification of classification problems
The task of the classification problem is to classify data objects into n given classes: binary classifier ifn = 2 and multiclass ifn > 2
The problem is single-label classification if each data object belongs to only one class and multi-label classification if it belongs to many different classes
2.2.3 Data classification algorithms used in this study
Decision trees are defined in several ways depending on the aspect:
* In the field of data analysis, this is considered a perfect combination of two aspects: mathematical and computational techniques to support the description, classification and generalization of the input data set The decision tree then describes a tree structure, where the leaves represent the categories and the branches represent the combinations of attributes that lead to that classification ô A data set can be represented by many corresponding decision trees Finally, the shortest tree 1s selected (according to Ockham's Razor principle)
Source: Decision Tree Expression 2.2.3.2 Support Vector Machine
Support Vector Machine (SVM) is a set of supervised algorithms that receive data and treat the data as vectors in space By building a hyperplane in multi-dimensional space as the interface between data layers, this algorithm helps to classify data into two different classes For the most accurate classification results, it is necessary to determine the hyperplane located as far away from the data points of all classes (margin function) as possible because the larger the margin, the higher the error the smaller the generalization of the classification technique
So SVM is a binary classification algorithm SVM builds a model to classify the forecast data into two attributes of the training dataset Currently, SVM has a variety of variants to suit different classification problems and can also be used for regression or other purposes
Source: Ong Xuan Hong (2015) 2.2.3.3 Neural Network
Neural Network is the use of a series of complex algorithms to identify, process information, and find underlying relationships in data sets Neural Network connects simple nodes to form a network of nodes such as neurons
With all fluctuations of input data, Neural Network is completely adaptable and gives the most accurate results while keeping the output criteria
O Logistic regression is a statistical method by which discrete output values are expected to be reported through an input data file Logistic regression uses logical functions presented in vector form, by predicting outcomes or opportunities, to help infer relationships between dependent and independent variables
| There are three different forms that include Logistic regression:
Binary: The dependent variable has only two possible outcomes/classes
Polynomial: The dependent variable has only 2 or 3 or more possible outcomes/classes that are ordered randomly
O Normally: The dependent variable has only 2 or > 3 possible outcomes/classes in the correct order
Source: Analytics Vidhya 2.2.4 Methods to evaluate classification models
The confusion matrix indicates, in a particular class, how much data actually belongs to it and which class falls into it This method is of size k x k (where k is the number of layers of the data)
This is one of the widely used performance measurement techniques, especially for classification models
Source: Sang Ha Ngoc (2021) Assume class A is the positive class and class B is the negative class The key terms of the confusion matrix are as follows:
True Positive (TP: True Positive): The positive class prediction is positive
False Positive (FP: False Positive): Predict the negative class to be positive
False Negative (FN: False Negative): The positive class prediction is negative
True Negative (TN: True Negative): Predict the negative class to be negative
Accuracy can be understood as the ratio between the number of correctly predicted samples to the total number of samples in the data set But there are a few downsides to precision that doesn't indicate the correct classifier for each type, such as:
To which class do the most correct classifications belong?
In which layer is the data that is often misclassified into another class?
However, the accuracy can still help us to evaluate the predictive performance of the model on a set of data The higher the accuracy, the more reliable the model
Precision indicates the ratio of true positives (TP) among those classified as positive
Recall (coverage) is the ratio between the number of true positives (TP) and those that are actually positive (TP + FN)
Fl-score is the harmonic mean of two measures Precision and Recall
=> Fl has a value between 2 Precision and Recall values, and F1 is larger if both Precision and Recall values are large, indicating higher model reliability
2.2.4.4, Receiver Operating Characteristic (ROC) va Area Under the Curve (AUC)
The receiver operating characteristic (ROC) 1s a widely used graph for evaluating binary classification models The curve is generated by plotting the true positive rate
(TPR) prediction rate based on the false positive rate (FPR) prediction rate at different thresholds A model is effective when the ROC is closer to (0,1) or has high TPR and low FPR, the more suitable the model is
True Positive Rate oO nN m= ROC curve (area = 0.85)
Figure 11: Receiver Operating Characteristic - ROC
Area Under the Curve (AUC) is the area under the ROC curve and has a positive value
[7] ANALYTICS VIDHYA Understanding Logistic Regression
[8] Sang Hà Ngọc (2021) Confusion Matrix/Ma trận nhằm lẫn/Ma trận lỗi Truy cập ngay 19/08/2021.
[9] ResearchGate COVID MTNet: COVID-19 Detection with Multi-Task Deep Learning Approaches
[10] Joakim Warholm (2021) Detecting Unhealthy Comments in Norwegian using BERT Faculty of Science and Technology Department of Physics and Technology, 3
[11]
[12]
[13] Slide bai giang gido vién cung cap
Source: Captured from Orange software oon nu & WN mm
Viyvininin in [sy NV NNN NN NN NOS “Na ~ Vin yl w yy N NON ~” NyvN NN NN NNN NO NON ~_ yN NN NON ờ
Mi ớM AM NN NON yvV NNN NNN NN NON ~ yN NNN yin Ny NY NON NYVVN NN NV VN NV VNN NNN YN NY
Source: Captured from Orange software
Neural Network P/E ROA ROE SIZE AUD
NyN NNN NNN NN NN NON
W5 ek feet | eat] ed | at | | at | S| cet feet | eek | ek et | at | et | et et | Ol et | el | Ol lO ent | eet | mm | een | ee |e | et | at | j | we] wee | | =|—=|—=|=|=l=|=|©l-|=|=|=|=
_ơơ ~_ NNN NN VN NNN NNN ND ND YN OY `3 2xx yy NNN DN NNN NY YN SN ~ — sờ ng