1. Trang chủ
  2. » Luận Văn - Báo Cáo

Báo cáo nghiên cứu khoa học: Investigating some ensemble methods in machine learning and application in student performance prediction

56 0 0
Tài liệu đã được kiểm tra trùng lặp

Đang tải... (xem toàn văn)

Tài liệu hạn chế xem trước, để xem đầy đủ mời bạn chọn Tải xuống

THÔNG TIN TÀI LIỆU

Thông tin cơ bản

Tiêu đề Investigating Some Ensemble Methods In Machine Learning And Application In Student Performance Prediction
Tác giả Nguyễn Thị Thanh Bình
Người hướng dẫn Dr. Nguyen Doan Dong, Dr. Tran Duc Quynh
Trường học Vietnam National University, Hanoi
Chuyên ngành Business Data Analytics
Thể loại Student Research Report
Năm xuất bản 2024
Thành phố Hanoi
Định dạng
Số trang 56
Dung lượng 1,4 MB

Cấu trúc

  • 1. I NTRODUCTION (8)
  • 2. L ITERATURE R EVIEW (11)
    • 2.1 Learning Analytics (11)
    • 2.2 Machine Learning and Deep Learning models (12)
    • 2.3 Ensemble Algorithms (13)
  • 3. R ESEARCH M ETHODOLOGY (14)
    • 3.1. System Model (14)
    • 3.2. Data Collection and data description (15)
    • 3.3. Data Description (16)
    • 3.4. Model Construction (19)
    • 3.5. Model Evaluation (23)
  • 4. E XPERIMENTS (25)
    • 4.1. Data Preprocessing (25)
    • 4.2. Description Analysis (30)
    • 4.3. Description of Experimental Protocol (34)
  • 5. D ISCUSSION & R ESULT (34)
    • 5.1. Result Analysis (34)
    • 5.2. Discussion (35)
  • 6. C ONCLUSION & R ECOMMENDATIONS (38)
    • 6.1 Conclusion (38)
    • 6.2 Recommendations (39)
  • 7. A PPENDIX (41)
    • 7.1 Linear Regression (41)
    • 7.2. Logistic Regression (43)
    • 7.3 Supported Vector Machine Regression (45)
    • 7.4 Random Forest (45)
    • 7.5 Adaboost (48)
    • 7.6 Gradient Boosting (49)
    • 7.7 Xgboost (51)
    • 7.8 Ensemble Model (52)
  • 8. A BBREVIATION (53)
  • 9. R EFERENCES (54)

Nội dung

The demand for applying machine learning ML, deep learning DL models, and ensemble algorithms in Learning Analytics LA is growing due to their ability to analyze large amounts of data an

I NTRODUCTION

The International School belongs to Vietnam National University, offers English training programs for all majors The International School is at the forefront of developing new subjects in information technology and data In addition, the institution provides dual degree programs for students, allowing them to engage in a vibrant international environment In recent years, the International School has accepted a high number of students throughout each enrollment period, resulting in a rapid expansion in the school's teaching scale, which has created both opportunities and challenges To answer to the current scenario, the International School implements modern support resources such as LMS online learning software that has been tested on the preparation program, as well as tools to help analyze the current training program's strengths and weaknesses The use of information technology is greatly benefiting the education business, and educational data analysis provides a wealth of practical evidence to assist universities in resolving a variety of critical issues Applying it to the analysis of teaching and learning data will provide schools with a solid platform for continuing to care for students and more mindfully welcoming students to future courses Student characteristics at international schools vary greatly To perform effectively in analyzing school kids, schools must begin the long process of developing a system immediately

To enhance education, it's crucial to evaluate current challenges faced by students regarding tools (internet and student data) By identifying root causes of students dropping out early and proposing situational adjustments, we can improve the educational environment at the International School and the national education system This analysis provides valuable insights for the International School to adapt its curriculum to evolving generational needs, ensuring its relevance and effectiveness.

The rising volume of digital data in education has created a strong demand for automated analytics to enhance teaching and learning processes Learning analytics, or the analysis of data from students and learning environments to promote learning, is becoming increasingly significant in education [1] The assessment of educational information is an age-old practice, yet advancements in computing capabilities, educational tools, and data collection techniques have now paved the path for novel methods to thoroughly analyze extensive amounts of data pertaining to the educational sphere [2] The use of learning analytics can improve learning outcomes and contribute to educational development to maximize the cognitive and non-cognitive education outcomes of students [1] Dynamic analyses of student transcript data can highlight learning trends and behaviors and provide prediction The goal is to provide timely information to educational stakeholders, including teachers, students, curriculum designers, and educational administrators, to support better decision-making [3, 4] For example, LA technologies can give dashboards that provide instructors and advisers with at-a-glance information on students' actions and progress in a course, as well as forecasts of student achievement based on computed probabilities of student success or failure in a course or program These can also present students with messages to urge stronger study behaviors and warn teachers and advisors on student behaviors [5] Furthermore, LA may help curriculum designers and educational professionals develop pedagogical foundations, syllabuses, and modify course materials based on students' needs and the learning environment [6]

The demand for applying machine learning (ML), deep learning (DL) models, and ensemble algorithms in Learning Analytics (LA) is growing due to their ability to analyze large amounts of data and make accurate predictions According to a study, tree- based methods dominate in both accuracy and uncertainty handling in retail demand prediction Ensemble algorithms, such as Random Forest, have been found to be more accurate than individual algorithms in predicting student performance in LA Machine learning models, such as Random Forest, XGBoost, and LightGBM, have been found to be effective in predicting student performance in LA [7] These models can handle large amounts of data and make accurate predictions, making them valuable tools in LA Finding out whether there are pertinent data trends that may be utilized to forecast a student's likelihood of continuing in the program at the International School or of dropping out completely is the aim of this research This study specifically looks at educational data from information systems to find useful and efficient KPIs to help with complicated decision-making processes by creating precise prediction models Determine with accuracy the factors that may contribute to a student's dropout status in order to make recommendations for the best practices for both domestic education in general and international schools in particular The problem is divided into two phases in this study Making better judgments in the initial phase that impacts student learning is facilitated by identifying pertinent aspects as model variables In this situation, the following research questions (RQs) are created

• RQ1: Which machine learning model is more accurate at predicting whether a student after enrolling and enrolled in a preparatory English program will continue their major at the International School or drop out altogether after a while?

• RQ2: Which factors are associated with student attrition?

• RQ3: Does the requirement of a B2 English certificate and all instruction in

English affect students' decision to continue their studies or to stop?

RQ4 examines the adjustment of International School students to university environments where English is the sole language of instruction This research employs ensemble-based and classification methods to predict adjustment outcomes and achieve its objectives.

- Helping universities and educational administrators find ways to improve to reduce student dropout

- Predicting student behavior, identify personal factors early that lead to students continuing or dropping out of school to have timely intervention measures to provide, support and advice for students

-Optimizing International School resource management such as allocating teachers and consulting on support programs based on the assessment of students' risk of dropping out of school

- Improve the effectiveness of the education system in general and International Schools in particular by reducing absenteeism and increasing school completion

In this study, using data from numerous sources linked to various research topics, the accuracy and efficiency of individual classifiers of various sorts, as well as ensemble classifiers, are studied and compared experimentally All things considered, this research effectively tackles the central query: what set of factors best predicts student performance? and deepen knowledge of how learner data can be utilized with integrated methodologies to forecast student learning outcomes How likely is it that the polling set, boosting, packing, and stacking procedures will be applied consistently?

The research is divided into six components The literature review is covered in Section

4 Section 5 provides a description of Machine Learning (ML) models The approach used in this investigation is outlined in Section 6 Section 7 presents the results, whereas Section 8 contains the findings and discussion Finally, the recommendation will be found in section 9.

L ITERATURE R EVIEW

Learning Analytics

Learning analytics (LA) is a field of research and practice that involves the measurement, collecting, analysis, and reporting of data about learners and their environments with the objective of understanding and optimizing learning processes [8,

9] It is influenced by and closely related to several other academic disciplines, such as action analytics, web analytics, business analytics, academic analytics, and educational data mining [10] It utilizes computational analysis of learning process data to gain insights that can lead to actions aimed at improving learning outcomes Learning analytics is a multidisciplinary field that sits at the convergence of learning, analytics, and human-centered design, incorporating elements from educational research, learning sciences, data sciences, artificial intelligence, and more [8] Additionally, it focuses on developing systems that can decrease the time lag between data collection and usage by continuously gathering, reporting, analyzing, and acting upon data in order to modify content, support levels, and other personalized services [10]

Key uses of learning analytics include predicting student academic success, identifying at-risk students, supporting the development of lifelong learning skills, providing personalized feedback to students, fostering important skills like collaboration and critical thinking, enhancing student awareness through self-reflection, and supporting quality learning and teaching by offering empirical evidence on pedagogical innovations

Learning analytics encompasses diverse methodologies to analyze data and enhance learning experiences Descriptive analytics provides insights into historical data, while diagnostic analytics aims to identify the underlying causes of events Predictive analytics leverages data to forecast future outcomes, enabling educators to anticipate trends and make data-driven decisions Prescriptive analytics goes a step further by recommending specific actions to optimize results, guiding educators in tailoring interventions and improving student learning.

Learning analytics plays a crucial role in providing researchers, educators, instructional designers, and institutional leaders with tools to study teaching and learning, offer timely feedback, gain new insights, and improve the learning process through data-informed decision-making [8, 9].

Machine Learning and Deep Learning models

Machine learning leverages AI to enable computers to synthesize data for predictions, while deep learning employs neural networks to solve complex problems In education, machine learning analyzes student data for personalized feedback and adaptive learning experiences Deep learning, on the other hand, enables content analytics, customized learning paths, and adaptive learning strategies By breaking down problems into parts, machine learning differs from deep learning's end-to-end approach through neural network layers Deep learning techniques in learning analytics facilitate the development of tailored learning plans, adaptive strategies, and the analysis of teaching-learning gaps, ultimately enhancing educational outcomes.

The emphasis on utilizing students' cognitive ability, recorded activities, and student demographic attributes to forecast their success and identify at-risk students early on has been a significant focus in educational research Various machine learning classification model techniques are employed for early prediction, given the binary nature of the classification task These techniques include Bayesian classifiers, decision tree (DT) algorithms, artificial neural networks (ANN), logistic regression (LR), k-nearest neighbor (kNN), support vector machine (SVM), extreme gradient boosting (XGB), adaptive boosting, random forest (RF), semi-supervised learning, interpretable classification rule mining algorithms, genetic programming, evolutionary algorithms, multi-view learning, multi-objective optimization, ensemble models, and deep learning models The choice of model depends on parameters such as data size, data type, and pedagogy, reflecting the complexity and diversity of educational data analysis [15-17] Numerous research studies use machine learning models that make use of a range of factors to forecast students' academic performance For instance, a study used machine learning models to predict students' academic performance and study strategies based on their motivation, using attributes such as intrinsic, extrinsic, autonomy, relatedness, competence, and self-esteem The study used five machine learning models, including Decision Tree, K-Nearest Neighbour, Random Forest, Linear/Logistic Regression, and Support Vector Machine, and found that tree-based models such as the random forest (with prediction accuracy of 94.9%) and decision tree shown the best results [18] Another study used machine learning models to predict student academic performance in higher education, using data from learning management systems The study used various supervised learning algorithms, including logistic regression, support vector machine, and conditional random fields, and found that the random forest algorithm achieved the highest accuracy of 88.3% [19]

Utilizing several different machine learning models, such as Decision Tree, K-Nearest Neighbour, Random Forest, Linear/Logistic Regression, and Support Vector Machine, can improve the accuracy of predictions for students' academic performance These models are evaluated using accuracy, precision, error rate, and other metrics to ensure their effectiveness in making predictions.

Ensemble Algorithms

Ensemble algorithms are a class of machine learning techniques that combine the predictions of multiple models to produce a more accurate and robust prediction [20] They can also reduce the risk of overfitting and underfitting by balancing the trade-off between bias and variance, and by using different subsets and features of the data Furthermore, they can handle different types of data and tasks, such as classification, regression, clustering, and anomaly detection, by using different types of base models and aggregation methods and also provide more confidence and reliability by measuring the diversity and agreement of the base models, and by providing confidence intervals and error estimates for the predictions [21-23]

Ensemble algorithms are used in learning analytics to improve the accuracy and performance of predictive models in various applications, such as student performance prediction, web usage mining, and individualized treatment effects estimation [24, 25] For instance, in student performance prediction, ensemble learning has been used to predict final marks of students using Moodle courses, combining the predictions of multiple algorithms such as Support Vector Machine (SVM), Random Forest, AdaBoost, and Logistic Regression In this study, the ensemble model achieved higher accuracy compared to individual models, demonstrating the benefits of ensemble learning in learning analytics [24]

While most existing research focuses on machine learning ensemble learning, this study

[26] offers a student academic performance prediction system that employs an ensemble of machine learning and deep learning models The suggested method employs long short-term memory (LSTM), Rf, and GB The OULAD dataset and a self-formulated dataset are used in the studies In comparison to other deep learning and current models, the suggested approach outperforms them with 96% accuracy

Ensemble algorithms, by combining predictions from multiple models, enhance predictive accuracy and model performance This technique reduces errors and improves robustness, proving effective in diverse applications, particularly in predicting student performance Ensemble learning has demonstrated its ability to elevate the quality of predictive models, making it a valuable tool in the field of learning analytics.

R ESEARCH M ETHODOLOGY

System Model

In the first stage, we process raw data (Collected data) into clean data We then removed the lines that did not affect the output labels along with using the one hot encoding method to convert columns containing categorical values into numeric data that can be applied to model training The data set includes students studying the preparatory English program from Intake 3 to Intake 5 with labels "studying" = 0 and "dropout" 1 Next, we apply Scale technique to standardize data to ensure variables have the same specific value range The number of students dropping out of our school is 26/740 records (accounting for only 3.51%), much less than the number of students currently studying is 719/740 records (accounting for 96.49%), so our data set is serious imbalance So, we applied SMOTE to process this dataset into balanced classes in the training dataset before applying machine learning algorithms This process plays an important role in improving the accuracy of the model We then applied supervised learning algorithm models to be able to test and predict whether that student will continue to study to complete the preparatory English program or will drop out of school The test suite will be used to test the accuracy and efficiency of the models We continue to use 10 x 10-fold cross-validation to optimize the algorithm Then our dataset will be randomly divided into 10 equal parts and in turn nine of them are taken as the training set and the remaining one is used as the test set This method is applied to avoid overfitting and underfitting cases, because when we train the model and let it predict results already in the training set, it can cause overfitting, when new data comes in then the prediction cannot be made or the date was wrongly predicted from the beginning.

Data Collection and data description

This study was carried out at the International School - Hanoi National University This study employs quantitative primary data from the university information system to collect all of the variables required for analysis, analyze the relationships between variables in each data set, and show the data set's characteristics

For two consecutive academic years, student enrollment data from the International School was collected for a total of 2017 records, focusing on five categories: student demographics (course number, date of birth, gender, major, place of birth), academic achievements from grade 12 (academic and behavioral results, national high school graduation exam scores), English placement test scores from the International School, pre-English course information, and enrollment details.

Data Description

The International School - Vietnam National University, Hanoi (VNU) has a total of

12 training majors awarded by Vietnam National University itself, awarded by a foreign university or co-awarded with a foreign university

1 Informatics and Computer Engineering (ICE)

3 Accounting, Analyzing and Auditing (AC)

8 Hospitality, Sport and Tourism Management (TROY)

If the students do not have the certificate at level B2, they will have to participate in the school’s preparatory training program to improve their English abilities to get the required exam The English entrance test includes three basic skills listening, reading, and writing skills This test will be held to assign students to classes of different levels: F (foundation), 1, 2, 3, 4, and 5 The pre-English course is described in the figure below:

Figure 1 : VNUIS pre-English course

The problem is defined as "Predicting student learning outcomes", which uses data related to students' initial English ability and information from English preparatory courses to predict performance Effectively addressing this issue will require analyzing and utilizing students' learning ability and study time outcomes of the preparatory English course Whether the students without an English certificate can pass the B2 English certificate exam or not and would they drop out of school or continuing pursue the main major that is taught by English To enable necessary modifications, the school must be aware of the course's actual impact on students' English proficiency

For course 19, the International School still applies the standard preparatory English program framework However, for course 20, due to the COVID-19 outbreak, the International School changed the way of testing students' English input from a test of the three skills of listening, writing, and reading to an interview test

All data is collected in detail and compiled into a complete dataset The dataset consists of 2017 rows (corresponding to 2017 student records) and 46 columns in below:

Category No Data feature Data description Data type

1 Date of birth Including day, month and year

4 Major ICE, MIS, AC, IB, BDA,

AIT, BEL, TROY, UEL, Keuka, HELP, Duo-Keuka

5 Course number Including course number

13 Status Including status of being enrolled, having dropped out of school (decided or undecided) or reserved

15 Writing score /30 Placement test score and the maximum score of each skill

18 Placement test scores/ Interview score

The overall score on a scale of 10

Intake level, class arrangement and final score of intakes

Intake level, Intake class: Categorical; Intake score: Numerical

Intake 5 level, class, score Enrollment information

Code of subject group for university

35 Order of admission International school in the student’s university with list

36 Admission score Sum of 3 subjects score Numerical

37 Subject 1 First subject in subject group

39 Subject 2 Second subject in subject group

41 Subject 3 Third subject in subject group

43 Admission method Student admission method: direct entry, university exam

44 Conduct Student attitudes at school includes: good, normal and bad

45 Performance Learning performance at high school includes: good, normal and bad

The average score of all subjects learning in grade

Model Construction

In this section, we applied supervised classification algorithms to train the models so that they can learn the mapping function of input features to output labels The test set we used to test the prediction performance of models trained with completely different algorithms To improve the accuracy of this section, we used ten-fold cross- validation to train and evaluate our models We tested four types of classification algorithms described below:

Decision trees are one of the algorithms we use in this project Decision trees are structural algorithms inspired by trees and based on rules to represent features like a tree Each node on the decision tree represents a feature, each connection between nodes on the decision tree represents a decision rule, and the output is represented by leaf nodes A decision tree is also seen as a flow chart In this, the flow starts from the root node and ends with predictions made on the leaves A decision tree is a decision support tool and can also be viewed as a tree diagram to display future predictions These predictions come from a series of splits based on multiple features

Random forest classifier is also known as decision tree forest The random forest algorithm is a classification algorithm that reduces the risks of overfitting Random forests work by embedding randomness together through multiple decision trees Such multiple pruning will be performed by combining multiple decision trees From there, the random forest will generate replacement samples and divide the nodes in the decision tree according to the best division It does this using a random subset of characteristics The random forest classifier was first introduced in 2001 by two people named Leo Breiman and Ad'ele Cutler This algorithm was later considered one of the most efficient algorithms and requires little effort more data preprocessing

The extreme gradient boosting algorithm (XGBoost) has a similar form to the gradient boosting algorithm However, XGBoost works more efficiently and faster Because this algorithm includes linear models and tree models besides the ability it can perform parallel calculations on a single machine

Trees built using the XGBoost algorithm are built in series The effect of such construction is to be able to perform a gradient descent step to minimize the loss function Unlike the Random Forest algorithm, the XGBoost algorithm can build its own trees in a parallel fashion In other words, the statistical information contained in each column can be calculated in parallel right there

The importance of variables in XGBoost is also calculated in the same way as for Random Forests It is calculated by calculating and averaging the values of a variable to reduce the impurity of the tree at each step

SVM (Support Vector Machine) is an effective data classification model, widely used in image processing, text classification, and opinion analysis SVM focuses on finding a hyperplane to separate the data set into two separate parts The effectiveness of layering can be determined based on the concept of spatial transformation using Kernel functions, making spatial transformation methods more flexible

The SVM model has some advantages and disadvantages:

• SVM is a model with a simple idea, capable of being explained mathematically and finding the generality of the model very conveniently

• The generality of the model can be applied in practice with complex and high- dimensional data

• The SVM model is based on the optimization problem using Quadratic Programming In general, this is an optimization problem with a very expensive cost

• In case the vector 𝑥 𝑛 has more dimensions, or the number of data points N is larger, it also takes a lot of time

The SVM model poses the problem of solving the simplest problem of classifying 2 data classes With the most initial data set consisting of 2 classes that are clearly separated from each other, we should choose the path with the highest generalizability The author of the SVM model, Vladimir Vapnik, believes that "a straight line separates two equidistant data layers"

Logistic regression is a supervised learning algorithm This algorithm is used to predict binary outcomes based on its input features The Logistic Regression algorithm is an extension of Linear Regression But instead of predicting a continuous outcome, it works to predict the probability of a binary outcome This algorithm works by estimating the coefficients of the input features The purpose of this algorithm is to maximize the likelihood of the observed outcomes occurring, given the input data The predicted probability after the operation is converted to a binary outcome using a threshold value, typically around 0.5 Logistic regression algorithms are also widely used in many different applications Examples include credit scoring, medical diagnosis, and fraud detection

Naive Bayes classifiers, based on Bayes' Theorem, assume that data point characteristics are independent of class variables Despite this often-incorrect assumption, Naive Bayes simplifies calculations, making it scalable with linear parameters for classification tasks, typically involving text Despite their simplicity, Naive Bayes classifiers have proven effective in various applications, utilizing probabilistic principles to model class distributions for classification.

Gradient Boosting is a synthetic algorithm that uses boosting to develop machine learning models It is like AdaBoost, but has some key differences:

• Gradient Boost builds trees that typically have 8-32 leaves, while AdaBoost builds tree roots

• Gradient Boost views the boosting problem as an optimization problem, in that it uses a loss function and tries to minimize error

Gradient Boosting leverages decision trees to predict residuals, the difference between predicted and actual values Like AdaBoost, it initially constructs a tree to fit the data, prioritizing areas where existing learners exhibit poor performance By focusing on these areas, Gradient Boosting attempts to improve the overall accuracy of the model.

Gradient Boosting also has some great additional features, such as proportional shrinking of leaf nodes, Newton Boosting, and an additional randomization parameter to reduce correlation between trees

Gradient Boosting is widely used in machine learning problems with good results, and it can achieve high accuracy on both training and testing sets.

Model Evaluation

The process of measuring an ML model's performance to determine which one suits the given problem best is known as model evaluation Model evaluation is essential to guarantee that a model operates correctly and optimally when used in a production setting A model's performance can be assessed using holdout and cross-validation Metrics must be defined to assess the model's performance after the approach has been decided upon

While there are many evaluation criteria, the most widely used ones include F1- score, recall, accuracy, precision, mean absolute error, and root mean squared error Testing a model using several metrics is a best practice for model evaluation as it helps determine whether the model is appropriate for the problem it is intended to address [27]

Accuracy: This is the ratio of the number of correct predictions to the total number of predictions made It is a common evaluation metric for classification problems

Precision: This is the ratio of true positives (correctly predicted positive instances) to the total predicted positives It is used to measure the accuracy of a model when the cost of a false positive is high

Recall: This is the ratio of true positives to the total number of actual positive instances It is used to measure the completeness of a model when the cost of a false negative is high

F1 score: This is the harmonic mean of precision and recall and is used to balance the trade-off between precision and recall

Mean absolute error (MAE): This is a metric used for regression problems and measures the average absolute difference between the predicted and actual values

Root mean squared error (RMSE): This is a metric measures the average difference between a statistical model's predicted values and the actual values, revealing how tightly the observed data clusters around the predicted values

𝑛 Those formulars used metrics from the confusion matrix where TP stands for True Positive, TN stands for True Negative, FP stands for False Positive and FN stands for False Negative

- True positives occur when you anticipate that an observation belongs to a specific class and it does [28]

- True negatives occur when you anticipate that an observation will not belong to a class, and it does not [28]

- False positives happen when you anticipate that an observation belongs to a class when it actually does not [28]

- False negatives arise when you anticipate that an observation does not belong to a class, but it actually does [28].

E XPERIMENTS

Data Preprocessing

To focus on answering four main research questions, we have filtered and removed data that does not bring any meaning to the research, including the following columns:

• Missing data because these are important data that cannot be replaced or supplemented, so they need to be removed such as English certificate exam date (1330 null/1863), etc

• Students who have an English certificate at level B2 upon admission should also be eliminated because it is not meaningful for prediction purposes

• All columns related to notes and additional information that are not meaningful to our output such as certificate date, notes, certificate scores will also be omitted

• Because we are not interested in the time it takes students to obtain an English certificate, but only whether they continue to study at school or not, the column for the time they receive certificate status will also be deleted

Personally identifiable information (PII) such as citizen identification numbers, phone numbers, and dates of birth will also be removed due to their uniqueness and lack of commonalities across individuals.

• Regardless of which class a student attends, we will omit including the class name in an intake such as F01, 2_01, etc

• Eliminate students who only go through two intakes to advance to a major because these are foreign data that will confuse information about students who do not have an English certificate and can continue to study majors at the International School (619/1848)

The total number after data selection: 740

Therefore, the new dataset consists of 740 rows (corresponding to 740 student records) and 19 columns in below:

Category No Data feature Data type

4 Place of birth Categorical Certificate information

Table 2: Data description after selection

We proceed to label categorical variables including: Gender, Major, Place of birth, Intake max, Status, Combinatorial code, Academic ability and Conduct We use two main methods including Label Encoder and One Hot

Label Encoding is a technique for converting category columns into numerical ones so that they can be fitted by machine learning models that only accept numerical input It is a critical pre-processing step in any machine-learning effort Label Encoding assigns a unique number (beginning with 0) to each class of data, which may cause priority concerns during model training of data sets A label with a high value may be considered more important than one with a lower value

In we have labeled the Status variable: 0 if the student is studying and 1 if the student has dropped out completely

One Hot Encoding is a technique for representing categorical variables as numerical values in a machine learning model It is used to convert categorical data into a format suitable for machine learning algorithms, which often require numerical data as input One Hot Encoding is very effective when dealing with categorical variables that do not have a fixed order of preference, as it avoids adding bias to the model This technique produces a new column for each category in the variable, with a binary value of 1 or 0 indicating whether the category is present or absent for each observation

We represent the remaining columns using the One Hot Encoder method

-Because the data is entered manually, it is not consistent and because the input habits depend on each person entering, a city with a card with many different spellings but the same information as the birthplace column may include Thai Binh, Thai Binh city, Thai Binh province, Thai Binh hospital

- Missing data such as the Intake 2 and Intake 3 score columns are unknown, filled by averaging the Intake 2 or 3 scores and the level the student is studying This data can be filled in because only 3 to 5 records are missing values

After feature selection and data selection, the new dataset contains 740 rows and 19 columns

Data grouping (or data pooling) is an important part of the data processing and analysis process We have too many classes, and the data in each class is imbalanced Instead of deleting classes that only contain 1 or 2 values, by grouping the data, we can summarize and understand clearly about data, classify data into meaningful groups, reduce data dimensionality, handle missing and noisy data, and create data analysis and visualization This helps improve data quality, ensuring consistency and ease of data-driven analysis and decision-making

So, we applied data grouping in our dataset Instead of the place of birth column having 40 different classes, representing 38 provinces and cities in Vietnam, in this step I will group the regions together and reduce from 40 classes to 4 classes

Figure 2: Data group of birth place

Places of birth before grouping included 38 provinces and cities of Vietnam and 2 foreign countries: Russia and Ukraine Because the International School is in the North of Vietnam, the majority of students will be from the North compared to the Central and South regions

- There are many provinces and cities that have only one value or quite a few values, especially provinces in the southern region of Vietnam and a few provinces in the central region, so we will group 12 provinces and cities in this region together: Thanh Hoa, Nghe An, Ha Tinh, Quang Tri, Quang Nam, Binh Dinh, Dak Lak, Lam Dong, Binh Duong, Ba Ria-Vung Tau, Ho Chi Minh City, Dong Thap into a group called " Central and South".

- Next, we grouped the provinces in the Northern Midlands and Mountains into a group of 14 provinces: Dien Bien, Lai Chau, Son La, Hoa Binh, Lao Cai, Yen Bai, Phu Tho, Ha Giang, Tuyen Quang, Lang Son, Bac Kan, Thai Nguyen, Bac Giang, Quang Ninh.

- Hanoi, Ha Tay and two foreign countries: Russia and Ukaira, we will group together called the Hanoi group Because we realize the English similarities of these regions.

- The remaining group will be the Red River Delta group including 10 provinces and cities: Hai Phong, Quang Ninh, Vinh Phuc, Bac Ninh, Hai Duong, Hung Yen, Thai Binh, Ha Nam, Nam Dinh, Ninh Binh.

Similar to majors, the number of students in each major is also imbalanced, so we will group:

- Accounting + Analyzing and Auditing + Accounting and Finance = AC

- Business Data Analytics + Informatics and Computer Engineering + Management Information Systems = IT

- Hospitality, Sport and Tourism Management + Dual degree_Marketing + Management + Dual degree_Management/Management = 2D

Description Analysis

The data distribution can be displayed through the histogram below This dataset contains many layers of imbalance but diversity is university nature, it is unavoidable in this situation

Figure 3: Distribution of categorical variable

Figure 4: Distribution of continuous variables1

Looking Distribution of continuous variables1, we see that initially, the English entrance exam scores were distributed quite evenly, the highest concentration was around 3 - 4 points, but after going through each intake, intake 1,2,3, the English score had The trend increases over time and shifts to the right of the graph Through intake

1, the high score is concentrated around 7 -8 points, through intake 2, the high score is concentrated around 8 points, and through intake 3, the high score is around 10 points

It is proven that when going through the process of studying the International School's preparatory English course, students' English ability is significantly improved and they are able to major and study in the English program

Figure 5: Distribution of continuous variables2

4.2.2 Relations between Status and categorical variables

Figure 6: Relations between Status and categorical variables

- Relations between Status and Coursenumber: The number of students in course

19 takes more courses than course 20, but the number of students dropping out of course 20 is equivalent to course 19

- Relations between Status and Combinatorial Code: The number of students with English as an admission subject to enter international schools is mostly Block D01 (Math - Literature - English) followed by Block A00 (Math - Physics - Chemistry) without English subjects entrance examination subject But the number of students leaving school in block D01 with an English background is much higher than in block A00 - the majority of the block is not as good at English

Distinctive patterns in student origins are evident, with the majority hailing from the Red River Delta However, despite the Northern Midlands and Mountains region being the second smallest area of origin, it produces almost as many students as the Red River Delta This discrepancy highlights the need for the school to prioritize outreach efforts and support mechanisms for students from this region.

Figure 7: Relations between Status and continuous variables1

Figure 8: Relations between Status and continuous variables2

Description of Experimental Protocol

This section aims to explain the experimental procedure for testing and evaluating the performance of the proposed models for student performance prediction The implementations were conducted on a Windows framework using Python 3.12.0 and primarily the sklearn library The machine in use is an ‘‘ASUS Vivobook′′ model with the following configuration: 16 GB of RAM, AMD Ryzen 5 5600H with Radeon Graphics, and NVIDA GeForce RTX 3050 Laptop GPU

D ISCUSSION & R ESULT

Result Analysis

After running several models including logistic regression, decision tree, SVM regression and Naive Bayes, we have the model evaluation table as below

Table 3: Model evaluation for Logistic Regression, Decision Tree, SVM and Nạve Bayes

Table 4: Model evaluation for Random Forest, XG Boost and Gradient Boosting

The best machine learning model is Random Forest It has the highest F1 score, recall and precision (0.99) for both class labels 0 and 1, which means it’s the most accurate model for predicting both positive and negative cases.

Discussion

Predicting student learning outcomes is a research topic of great interest In this study, we proposed an ensemble model where we used 2-layer superposition to predict students' academic performance in academic competition In this model, several algorithms (SVM, RF, DT, Native Bayes, Gradient Boosting and AdaBoost) with the performance of accurate predictions have been implemented as a Base Learner and an Algorithm relatively simple (logistic regression) was used as Ensemble Methods to reduce the risk of overfitting, and we used 10 × 10-fold cross- validation to avoid label leakage

According to the available sources, the "Random Forest" machine learning model is effective at predicting student dropout rates and retention In research [16], the

Random Forest model was utilized to predict student retention status (dropped out or continued) in a binary prediction task, confirming its appropriateness in such circumstances The study [29] found that Random Forest, Decision Tree, K-Nearest Neighbors, AdaBoost, Multilayer Perceptron, and Logistic Regression outperformed other machine learning algorithms in predicting student dropout rates The Random Forest algorithm was used in a study to identify at-risk students and achieve a 90% accuracy rate in predicting course failure This research [30] confirms the Random Forest model's usefulness in forecasting student outcomes, giving it a viable option for predicting whether a student would continue their major at the International School or drop out after enrolling in a preparatory English program

Course number is part of the impact on student dropout rates Here we compare the number of students dropping out of courses 19 and 20 and clearly see that students in course 20 have the same dropout rate as students in course 19 even though the number of students in course 20 is less than course 19 The rate of college students dropping out in Vietnam has indeed increased over time According to the sources provided, there has been a concerning trend of more students failing to stay in college until graduation, with the numbers steadily increasing and worrying educators [31] Additionally, the data shows that three in five university students in Vietnam are dropping out without finishing their first year of study, indicating a significant increase in dropout rates [32]

Gender significantly impacts undergraduate student retention and attrition rates Research indicates that women generally graduate at higher rates than men (Corbett et al., 2008; Hagedorn, 2005) However, in certain contexts such as the International School, women may experience higher dropout rates Furthermore, gender disparities persist in STEM disciplines, with women accounting for a minority of graduates in the European Union (European Union, 2018).

Undergraduate students' retention and dropout rates are influenced by their major According to Kim's (2020) research [35], students are more likely to drop out of college owing to a job change or major maladjustment if they discover that their major does not fit with their career goals or interests

The birthplace can affect the continued study and dropout rates of undergraduate students According to a study, students from different ethnic backgrounds and socioeconomic statuses may face varying challenges that impact their educational journey [36] The birthplace in the Red River Delta in Vietnam has the number of International School’s students that drop out higher than other places This region is known for its rich cultural heritage and agricultural economy, and students from this area may face unique challenges in pursuing higher education While there is limited research specifically on the impact of the Red River Delta on dropout rates, studies have shown that students from rural areas and low-income backgrounds are at higher risk of dropping out of college

Academic performance in high school has a substantial impact on undergraduate student retention and dropout rates According to research [37], high school academic performance is strongly correlated with college dropout rates Students who perform well academically in high school are more likely to stick with their college studies and successfully complete their degree programs However, at the International School, the dropout rate of students with good academic performance in high school is higher than the number of students with good academic performance Students with good results in high school may drop out of college because of pressure, unsuitable majors, stress, looking for other opportunities, financial problems or lack of passion

Research indicates that English proficiency plays a crucial role in undergraduate student retention and attrition Specifically, performance in foundational courses like Freshman English correlates with overall academic success and college completion rates [38] While dropout rates remain consistent across intake periods 1-4, the number of enrolled students varies significantly, highlighting the impact of language proficiency on student retention.

Intakes 3 and 4 exhibit significantly higher student dropout rates compared to other intakes It is hypothesized that either intake 3 or 4 may face specific issues contributing to this disparity Despite intake 5 having the highest enrollment, it maintains a notably low dropout rate, suggesting that the duration of the preparatory English program may not be the primary factor Further investigation is warranted to determine the underlying causes of the elevated dropout rates in intakes 3 and 4.

Our research determined that Conduct (behavior in high school) and Combinatorial code (a technical exam detail) did not influence dropout rates While academic performance and exam-related factors heavily impact educational outcomes, Conduct and Combinatorial code are not typically cited as direct influencers of student persistence or withdrawal Conduct focuses on behavioral aspects rather than academic achievement, and Combinatorial code is a technicality in the exam process that does not directly affect student decisions regarding their educational pursuits.

C ONCLUSION & R ECOMMENDATIONS

Conclusion

Based on the data table after running several models including logistic regression, decision trees, SVM regression and Naive Bayes, we have a model evaluation table The data table shows the performance of different machine learning algorithms in classifying different class labels Based on the data table, the following conclusions can be drawn Random Forest and Logistic Regression are the two best-performing machine learning algorithms, with high precision, recall, and F1 scores for both class labels Gradient Boosting and XGBoost also have good performance but the accuracy and F1 score for type 1 labels are lower than Random Forest and Logistic Regression Naive Bayes is not effective in classifying samples belonging to class 1 labels

We found that factors that can affect the attendance and dropout rate of International School students are gender, major, number of course, Intake max, ability to study at high school and place of birth Two factors that do not affect the dropout rate of students are the combinatorial code and the student's high school conduct.

Recommendations

- Use Random Forest or XGBoost to classify class labels in this dataset These two algorithms have the best performance for both class labels

- GradientBoosting may be considered if the performance of Random Forest or XGBoost is unsatisfactory However, it should be noted that the performance of GradientBoosting for layer 1 labels may be lower

- Naive Bayes should not be used to classify samples belonging to class 1 labels This algorithm is not effective in classifying samples belonging to this class

1 For female students dropping out more than male students:

• Organize support programs and counseling specifically for female students, focusing on addressing their unique challenges such as social pressure, work-life balance, or confidence and interest in their chosen fields of study

• Create a female-friendly learning environment, encouraging their active participation and positive engagement in school activities

2 For students from the Red River Delta region dropping out more than others:

• Conduct research to understand the root causes behind this trend, which may involve economic, social, or cultural factors

• Provide special scholarships and financial assistance programs for students from economically disadvantaged regions

• Establish support programs and counseling services tailored to students from the Red River Delta, helping them overcome the challenges they face in their academic journey

3 For high-achieving students dropping out more than average students:

• Develop advanced academic programs and personal development opportunities for high-achieving students to keep them challenged and engaged in their studies

• Offer opportunities for them to enhance their skills and abilities, such as participating in research projects or joining specialized courses and events designed for them

4 For students enrolled in the third and fourth English preparatory courses dropping out more:

• Place emphasis on providing language support and counseling for students enrolled in English preparatory courses

• Organize activities and events to help students practice and improve their English skills, such as discussions, debate clubs, or extracurricular language classes

5 For the increased dropout rate related to major:

• Curriculum Review and Adaptation: Evaluate the curriculum of the major to ensure it is up-to-date, relevant, and engaging for students Consider integrating practical applications, real-world projects, and hands-on experiences to make the coursework more meaningful and applicable to students' career goals

To foster student success, faculty should be equipped with resources and training to offer tailored support By encouraging faculty to act as mentors and advisors, students gain access to personalized guidance and encouragement This support system empowers students to navigate academic challenges and explore career paths with confidence, ultimately contributing to their overall academic well-being and future professional endeavors.

• Internship and Work Experience Opportunities: Forge partnerships with industry leaders and organizations to offer internship and work experience opportunities relevant to the major Practical exposure to the field can enhance students' understanding of the subject matter and motivate them to persist in their studies

Enhancing student support services is crucial to fostering a supportive learning environment By providing tailored academic tutoring, career counseling, mental health resources, and peer support groups, universities can address the specific needs of students within a major This comprehensive approach fosters a sense of belonging and reduces feelings of isolation, which can significantly contribute to dropout rates By creating a welcoming and supportive culture, universities can empower students to thrive academically and personally, ultimately increasing retention and fostering a successful student experience.

• Flexibility in Course Scheduling: Offer flexible course scheduling options to accommodate students' diverse needs and obligations outside of academics This may involve providing evening or weekend classes, online course offerings, or alternative study formats to accommodate students who work or have family responsibilities

Regular program evaluation and feedback are crucial for continuous improvement By implementing a system to gather feedback from students, faculty, and industry professionals, institutions can identify areas for improvement This feedback should guide adjustments made to the major, ensuring its alignment with industry needs and enhancing the quality of the learning experience By regularly evaluating and adjusting based on feedback, the program remains relevant and meets the evolving demands of the field.

• Promotion of Career Opportunities: Increase awareness among students about the potential career opportunities and pathways available within the major Host career fairs, networking events, and alumni panels to connect students with professionals in the field and showcase the value of their academic pursuits

• By addressing these aspects of the major, universities can work to improve student retention and success within the program.

A PPENDIX

Linear Regression

Linear regression is a statistical method used in data science and machine learning to establish a linear relationship between an independent variable and a dependent variable When other factors change, the independent variable stays constant, which also serves as a predictor or explanatory variable On the other hand, variations in the independent variable affect the dependent variable Linear regression predicts the value of the dependent variable based on the known values of the independent variable The model is represented by a straight line that fits the data points in the best possible way, which is called the 'best fit line' or 'regression line' [39-43] A sloped straight line or 'best fit line' or 'regression line' is illustrated in Figure 9

Figure 9: A slope straight line or ‘regression line’

Line of regression = Best fit line for a model

There are different types of linear regression The two main types of linear regression are simple linear regression and multiple linear regression [44]

- Simple Linear Regression: the most basic type of linear regression, with one independent and one dependent variable For simple linear regression, use the following equation [41]: y=β0+β1X where:

- Multiple Linear Regression: This involves more than one independent variable and one dependent variable The equation for multiple linear regression is [41]: y=β0+β1X+β2X+………βnX where:

• X1, X2, …, Xp are the independent variables

Finding the best-fit line is the main goal of linear regression, which suggests that the error between the predicted and actual values should be minimized The best-fit line will have the least amount of inaccuracy [41].

Logistic Regression

Definition: A statistical model that models the log-odds of an event as a linear combination of one or more independent variables

Purpose: Used in regression analysis for estimating the parameters of a logistic model Outcome Variable Types: Typically involves a binary dependent variable coded as "0" and "1" but supports categorical and continuous independent variables

Conversion Function: Utilizes the logistic function to convert log-odds to probabilities Extensions: Can be generalized to multinomial and ordinal logistic regression for more than two outcome categories

Estimation Method: Parameters estimated most by maximum-likelihood estimation, lacking a closed-form solution Historical Significance Developed and popularized by Joseph Berkson, starting in 1944, introducing the term "logit"

Figure 10: Key Advantages of Logistic Regression

Logistic regression is a supervised learning algorithm that predicts the probability of a binary outcome (e.g., yes/no, 0/1, true/false) based on independent variables It classifies data into two discrete categories using a logistic function This statistical method is widely used in predictive modeling, estimating the mathematical probability of an instance belonging to a specific category Logistic regression finds applications in various domains, including medical diagnosis (e.g., heart attack prediction), higher education (e.g., university enrollment probability), and fraud detection (e.g., spam email identification) As a baseline model for binary or categorical responses, it offers simplicity and well-established analysis techniques.

Supported Vector Machine Regression

Support Vector Machine Regression (SVR) is a type of Support Vector Machine (SVM) that is used for regression tasks It tries to find a function that best predicts the continuous output value for a given input value SVR can use both linear and non-linear kernels A linear kernel is a simple dot product between two input vectors, while a non- linear kernel is a more complex function that can capture more intricate patterns in the data The choice of kernel depends on the data’s characteristics and the task’s complexity

The scikit-learn package provides the 'SVR' class for performing Support Vector Regression (SVR) with different kernel types Developers can choose between linear or non-linear kernels by specifying the 'kernel' parameter.

‘linear’ or ‘RBF’ (radial basis function) SVR has several hyperparameters that you can adjust to control the behavior of the model For example, the ‘C’ parameter controls the trade-off between the insensitive loss and the sensitive loss A larger value of ‘C’ means that the model will try to minimize the insensitive loss more, while a smaller value of C means that the model will be more lenient in allowing larger errors Like any machine learning model, it’s important to evaluate the performance of an SVR model using metrics like mean squared error (MSE) or mean absolute error (MAE) To fit an SVR model, you can use the fit () method in scikit-learn package

Overall, SVR is a powerful regression algorithm that can be used for both linear and non-linear regression tasks It is particularly useful when dealing with high-dimensional data and when the relationship between the input and output variables is complex.

Random Forest

Random forest is a machine learning algorithm that leverages ensemble learning by constructing a multitude of decision trees during training It is used for classification, regression, and other tasks, where the output for classification is the class selected by most trees, and for regression, it is the mean or average prediction of the individual trees Random forest corrects for decision trees' tendency to overfit by combining the

Random forests, developed by Leo Breiman and Adele Cutler in 2006, are an ensemble learning algorithm that leverages multiple decision trees to enhance prediction accuracy and reduce variance Each decision tree poses a series of questions to divide the data, creating decision nodes and leaf nodes that represent individual decisions The best split for subsetting the data is identified using algorithms like CART, and metrics such as Gini impurity or MSE assess the quality of the segmentation By combining these decision trees, random forests effectively aggregate predictions, leading to improved accuracy and robustness.

The random forest algorithm has three main hyperparameters that need to be set before training These include node size, number of trees, and number of sampled objects From there, the random forest classifier can be used to solve regression or classification problems

The random forest algorithm is made up of a collection of decision trees, and each tree in the ensemble is comprised of a data sample drawn from a training set with replacement, called the bootstrap sample Of that training sample, one-third of it is set aside as test data, known as the out-of-bag (oob) sample, which we’ll come back to later Another instance of randomness is then injected through feature bagging, adding more diversity to the dataset and reducing the correlation among decision trees Depending on the type of problem, the determination of the prediction will vary For a regression task, the individual decision trees will be averaged, and for a classification task, a majority vote—i.e the most frequent categorical variable—will yield the predicted class Finally, the oob sample is then used for cross-validation, finalizing that prediction

Figure 11: Random forest model based on decision tree

Key Benefits of Random Forest:

• Reduced risk of overfitting: Decision trees run the risk of overfitting as they tend to tightly fit all the samples within training data However, when there’s a robust number of decision trees in a random forest, the classifier won’t overfit the model since the averaging of uncorrelated trees lowers the overall variance and prediction error

• Provides flexibility: Since random forest can handle both regression and classification tasks with a high degree of accuracy, it is a popular method among data scientists Feature bagging also makes the random forest classifier an effective tool for estimating missing values as it maintains accuracy when a portion of the data is missing

• Easy to determine feature importance: Random Forest makes it easy to evaluate variable importance, or contribution, to the model There are a few ways to evaluate feature importance Gini importance and mean decrease in impurity (MDI) are usually used to measure how much the model’s accuracy decreases when a given variable is excluded However, permutation importance, also known as mean decrease accuracy (MDA), is another important measure MDA identifies the average decrease in accuracy by randomly permutating the feature values in oob samples

Key Challenges of Random Forest:

Random Forests are powerful ensemble learning algorithms capable of handling extensive datasets, leading to enhanced predictive accuracy However, this computation-intensive process can result in slower data processing due to the independent computation of individual decision trees within the ensemble.

• Requires more resources: Since random forests process larger data sets, they’ll require more resources to store that data

• More complex: The prediction of a single decision tree is easier to interpret when compared to a forest of them.

Adaboost

AdaBoost trains new models based on re-evaluating existing data points, to help new models focus more on data patterns that are being mis-learned, thereby minimizing model loss Specifically, the steps to implement the algorithm are as follows:

• Initialize the initial weights to be equal (equal 1/N) for each data point

A new wi (weak learner) training model is added

Calculate the loss (error) value, from there calculate the confidence score Ci of the newly trained model

Update main model W = W + Ci * Wi

Finally, renumber the point data (Correct predicted data -> increased number, data accuracy -> decreased number)

• Then repeat with the loop adding the next model i + 1

Gradient Boosting

Gradient boosting is a powerful machine learning technique that involves boosting in a functional space, where the target is pseudo-residuals rather than typical residuals used in traditional boosting It creates a prediction model by combining weak prediction models, such as simple decision trees, into an ensemble When decision trees are the weak learners, the resulting algorithm is known as gradient-boosted trees, which often outperform random forests Gradient boosting builds models in a stage-wise manner, allowing optimization of differentiable loss functions and enabling the use of arbitrary loss functions This technique originated from the observation by Leo Breiman that boosting can be interpreted as an explicit regression gradient boosting algorithm Gradient Boosting is a generalized form of AdaBoost Specifically, as follows, still the original optimization problem:

• Cn: confidence score of nth weak learner (also known as weight)

First, I would like to reiterate a bit of theory that you are quite familiar with in neural networks:

The formula for updating model parameters using Gradient Descent, which operates in parameter space, can be converted to a function space perspective to better connect it to the problem being addressed This conversion allows for a more intuitive understanding of how the parameters are adjusted based on the derivative's decreasing direction.

Quite simple, if we consider the series of model boosting as a function W, then each learner function can be considered a parameter w Here, to minimize the loss function

At this point, we can see the following related relationships:

In short, we can summarize the algorithm implementation process as follows:

• Initialize the pseudo-residuals value to be equal for each data point

• Calculate the confidence score Ci of the newly trained model

• Update main model W = W + Ci*Wi

• Finally, calculate the pseudo-residuals value to make a label for the next model

If you notice, AdaBoost's method of updating the weight of data points is also one of the cases of Gradient Boosting Therefore, Gradient Boosting covers more cases

Figure 13: : How Gradient Boosting works

Xgboost

XGBoost (Extreme Gradient Boosting) is an algorithm based on gradient boosting, however, it is accompanied by great improvements in terms of algorithm optimization, in the perfect combination of software and hardware power, helping achieve outstanding results in both training time and memory usage

Open source with ~350 contributors and ~3,600 commits on Github, XGBoost shows its incredible application capabilities such as:

• XGBoost can be used to solve all problems from regression, classification, ranking and solving user-defined problems

• XGBoost is supported on Windows, Linux and OS X

• Supports all major programming languages including C++, Python, R, Java, Scala and Julia

• Supports AWS, Azure, and Yarn clusters and works well with Flink, Spark, and other ecosystems

Figure 14: XGBoost and its applications

Ensemble Model

An ensemble model is a machine learning approach where multiple models work together to make better predictions It combines predictions from different models to improve accuracy and reliability, like how getting feedback from a diverse group of people can help make better decisions Ensemble modeling is widely used in various fields like computer vision, language processing, and financial forecasting

There are different techniques for building ensemble models, including simple and advanced methods Simple techniques include majority voting, where the final prediction is determined by selecting the class label that receives the highest number of votes from the models in the ensemble Advanced techniques involve more complex methods, such as combining models with different sample data, evaluating different factors, or weighing common variables differently

Ensemble model techniques can be categorized into:

• Bagging: Bagging, or bootstrap aggregating, trains multiple models on different subsets of the dataset and combines their predictions to make a final prediction

An example of a bagging algorithm is the Random Forest algorithm

• Boosting: Boosting trains models sequentially, with each model learning from the mistakes of the previous model The final prediction is a weighted combination of the predictions from all the models Examples of boosting algorithms include AdaBoost and Gradient Boosting

• Stacking: Stacking trains multiple models on the same dataset and combines their predictions using a meta-model The meta-model learns to combine the predictions from the base models to make a final prediction

• Blending: Blending is like stacking, but it combines the predictions from the base models using a simple averaging or weighted averaging method instead of a meta-model

These ensemble model techniques help improve the accuracy and reliability of predictions by leveraging the strengths of multiple models and reducing their weaknesses.

A BBREVIATION

LSTM: Long short-term memory

R EFERENCES

1 Siemens, G and P Long, Penetrating the Fog: Analytics in Learning and Education

2 Broadbent, J., Comparing online and blended learner's self-regulated learning strategies and academic performance The Internet and Higher Education, 2017 33: p 24-32

3 Hung, J.-L and K Zhang, Revealing online learning behaviors and activity patterns and making predictions with data mining techniques in online teaching MERLOT

Journal of Online Learning and Teaching, 2008

4 Sedrakyan, G., et al., Linking learning behavior analytics and learning science concepts: Designing a learning analytics dashboard for feedback to support learning regulation Computers in Human Behavior, 2020 107: p 105512

5 Types of Learning Analytics Tools n.d.; Available from: https://learninganalytics.colostate.edu/types/

6 Dửrrenbọcher, L and F Perels, Self-regulated learning profiles in college students:

Their relationship to achievement, personality, and the effectiveness of an intervention to foster self-regulated learning Learning and Individual Differences,

7 Nasseri, M., et al., Applying Machine Learning in Retail Demand Prediction—A

Comparison of Tree-Based Ensembles and Long Short-Term Memory-Based Deep Learning Applied Sciences, 2023 13(19): p 11112

8 What is Learning Analytics Available from: https://www.solaresearch.org/about/what- is-learning-analytics/

9 What Is Learning Analytics? ; Available from: https://www.watershedlrs.com/resources/definition/what-is-learning-analytics/

10 Elias, T., Learning Analytics: Definitions, Processes and Potential January, 2011

11 Veluri, R.K., et al., Learning analytics using deep learning techniques for efficiently managing educational institutes Materials Today: Proceedings, 2022 51: p 2317-

12 Madnaik, S.S., Predicting Students’ Performance by Learning Analytics, in THE

DEPARTMENTS OF COMPUTER SCIENCE May 2020, SAN JOSE STATE

UNIVERSITY: Scholar Works of SAN JOSE STATE UNIVERSITY p 51

13 Team, M Integrating Machine Learning in Education Technology August, 2017;

Available from: https://www.getmagicbox.com/blog/integrating-machine-learning-in- education-technology/

14 Ahad, M.A., G Tripathi, and P Agarwal, Learning analytics for IoE based educational model using deep learning techniques: architecture, challenges and applications

15 Fahd, K and S.J Miah, Effectiveness of data augmentation to predict students at risk using deep learning algorithms Social Network Analysis and Mining, 2023 13(1): p

16 Matz, S.C., et al., Using machine learning to predict student retention from socio- demographic characteristics and app-based engagement metrics Scientific Reports,

17 Souai, W., et al Predicting At-Risk Students Using the Deep Learning BLSTM

Approach in 2022 2nd International Conference of Smart Systems and Emerging Technologies (SMARTTECH) 2022

18 Orji, F.A and J Vassileva, Machine Learning Approach for Predicting Students

Academic Performance and Study Strategies based on their Motivation arXiv preprint arXiv:2210.08186, 2022

19 Sarwat, S., et al., Predicting Students' Academic Performance with Conditional

Generative Adversarial Network and Deep SVM Sensors (Basel), 2022 22(13)

20 Brownlee, J A Gentle Introduction to Ensemble Learning Algorithms April 27, 2021

21 Dietterich, T.G Ensemble Methods in Machine Learning in Multiple Classifier

Systems 2000 Berlin, Heidelberg: Springer Berlin Heidelberg

22 Liu, L., et al., Solving the class imbalance problem using ensemble algorithm: application of screening for aortic dissection BMC Medical Informatics and Decision

23 Team, C Ensemble Methods n.d.; Available from: https://corporatefinanceinstitute.com/resources/data-science/ensemble-methods/

24 Yan, L and Y Liu, An Ensemble Prediction Model for Potential Student

Recommendation Using Machine Learning Symmetry, 2020 12(5): p 728

25 Simplilearn What Is Ensemble Learning? Understanding Machine Learning

26 Kukkar, A., et al., Prediction of student academic performance based on their emotional wellbeing and interaction on various e-learning platforms Education and

27 Comet, T Why is Model Evaluation Important in Machine Learning? November 2,

2022; Available from: https://www.comet.com/site/blog/why-is-model-evaluation- important-in-machine-learning/

28 JORDAN, J Evaluating a machine learning model 21 JUL 2017

29 Mnyawami, Y.N., H.H Maziku, and J.C Mushi, Enhanced Model for Predicting

Student Dropouts in Developing Countries Using Automated Machine Learning

Ngày đăng: 08/10/2024, 01:08

TÀI LIỆU CÙNG NGƯỜI DÙNG

TÀI LIỆU LIÊN QUAN