VIETNAM NATIONAL UNIVERSITY, HANOIINTERNATIONAL SCHOOL STUDENT RESEARCH REPORT PREDICTIVE MODELING FOR STUDENT PERFORMANCE IN EDUCATION: A DATA MINING APPROACH Team Leader: Tran Quoc Dan
Trang 1VIETNAM NATIONAL UNIVERSITY, HANOI
INTERNATIONAL SCHOOL
STUDENT RESEARCH REPORT
PREDICTIVE MODELING FOR STUDENT PERFORMANCE
IN EDUCATION: A DATA MINING APPROACH
Team Leader: Tran Quoc Dang
Hanoi, 2024
Trang 2TEAM LEADER INFORMATION
- Program: Informatics and Computer Engineering
- Address: Ha Dong - Hanoi
- Phone no /Email: 0332192968/20070819@vnu.edu.vn
II Academic Results (from the first year to now)
III Other achievements
1 Received the "Five Good Student" award at the university level in 2020-2021
2 Recognized for outstanding contributions to student union and youth movementactivities in 2021-2022 & 2022-2023
3 Achieved top 20 placement in the IStartup 2023 entrepreneurship competition
4 Delegate/Representative of Vietnam National University in AUN AELP 2023 held in Singapore
-5 Delegate/Representative of Vietnam National University - International School
in TF Scale 2024 - held in Singapore and Vietnam
Trang 36 Achieved a remarkable #638 ranking in the global WiDS Datathon 2021.
7 Top 10 placement in the Cybersecurity track of the JunctionX Hackathon 2023
Team Member: (no more than 5 people)
2 Nguyen Ngoc Van
Quynh
Phuong
Advisor
(Sign and write full name)
Tran Thi Oanh
Hanoi, 2024
Team Leader
(Sign and write full name)
Tran Quoc Dang
Trang 4We are deeply grateful to Mrs Tran Thi Oanh invaluable guidance, support, and
insightful contributions throughout the research process Her expertise and
encouragement have played a fundamental role in the successful completion of thisresearch paper
We would like to express our sincere appreciation for the care and support provided byour teacher From the initial stages of ideation to the final completion phase, she has beeninstrumental in guiding us and providing the necessary motivation to overcome
challenges along the way
Without her unwavering support, we would not have been able to accomplish this
research We extend our heartfelt gratitude for her significant contributions and eagerlyanticipate future collaborations on upcoming projects
Trang 5TABLE OF CONTENTS
CHAPTER 2: INVESTIGATING THE INFLUENCE OF FACTORS ON STUDENT
CHAPTER 3: PRACTICAL APPLICATIONS AND CONSIDERATIONS 33
Trang 7LIST OF FIGURES
Figure 1.Predictive Modeling Framework for Student Performance 21
Figure 4 Educational Data Mining Process diagram 24
Figure 5 The flowchart of the prediction model 25
Figure 7 Total score between female and male (Insight 1) 27
Figure 8.Lunch types comparison (Insight 2) 28
Figure 10 Random Forest Regressor performance
Figure 14 Average scores for standard vs free/reduced lunch
Figure 15 Average scores grouped by parental education level
36
Trang 8Figure 16 Average scores for each race/ethnicity group 37
Figure 17 Average scores for students with/without test prep
Figure 18 Correlations between math scores and continuous features 38
LIST OF TABLE
Trang 9LIST OF ABBREVIATIONS
MAE
Mean Absolute Error
Trang 10English
Traditional educational settings often lack data-driven tools to proactively supportstudent performance This study presents a novel predictive modeling frameworkutilizing machine learning algorithms (e.g., decision trees, random forests) to addressthis challenge The model integrates student demographics, academic history, andteaching methodology data to forecast student outcomes in traditional offlineclassroom settings Results indicate that a student's parental level of education andparticipation in test preparation courses were strong predictors of performance,demonstrating the potential of this framework to enhance decision-making andresource allocation in traditional education These findings highlight the power ofpredictive modeling to personalize learning approaches and provide data-driveninsights to educators working within established offline teaching structures
Vietnamese
Các cơ sở giáo dục truyền thống thường thiếu các công cụ dựa trên dữ liệu để chủ động
hỗ trợ thành tích học tập của học sinh Đề tài nghiên cứu này trình bày một khuôn khổ
mô hình dự đoán mới sử dụng các thuật toán học máy (ví dụ: cây quyết định, rừng ngẫunhiên) để giải quyết thách thức này Mô hình này tích hợp dữ liệu nhân khẩu học, lịch sửhọc tập và phương pháp giảng dạy của học sinh để dự đoán kết quả học tập trong các bốicảnh lớp học truyền thống Kết quả chỉ ra rằng trình độ học vấn của cha mẹ học sinh vàviệc tham gia các khóa luyện thi là những yếu tố dự báo mạnh mẽ về kết quả học tập,cho thấy tiềm năng của khuôn khổ này trong việc nâng cao việc ra quyết định và phân bổnguồn lực trong giáo dục truyền thống Những phát hiện này nhấn mạnh sức mạnh của
mô hình dự đoán trong việc cá nhân hóa các phương pháp học tập và cung cấp các hiểubiết sâu sắc dựa trên dữ liệu cho các nhà giáo dục làm việc trong các cấu trúc giảng dạy
Keywords
Predictive Modeling, Data-Driven Decision Making, Offline Learning, StudentPerformance, Academic History, Teaching Methodology, Parental Education Level, TestPreparation Courses, Resource Allocation, Personalized Learning
Trang 11Tran Quoc Dang 20070819 ICE2020A Informatics and
Computer Engineering
4th
Trang 12I INTRODUCTION
In this chapter, the study introduces the topic of predictive modeling in education,highlighting its significance in addressing student performance challenges It outlines theresearch objectives, focusing on identifying factors influencing student mathperformance and developing predictive models to aid educational decision-making Byframing the research within the context of educational challenges and the potential ofpredictive modeling, this chapter sets the stage for subsequent discussions, emphasizingthe importance of personalized learning and equitable educational outcomes
2 Motivation
Addressing these challenges is crucial for both individual student success and thebetterment of society as a whole Predictive modeling offers a powerful tool to tacklethis issue, yet much research focuses on online or blended learning environments Thisstudy seeks to address this gap by investigating how predictive modeling can empowereducators specifically within traditional offline classroom settings By understanding thefactors influencing student performance in this context, educators can proactively tailortheir instructional strategies and support systems
3 Research Methods
This study employs a quantitative approach, utilizing the "Students Performance inExams" dataset and focusing on variables such as parental education level, lunchprogram participation, and test preparation Decision tree or random forest algorithms
Trang 13will be used to build predictive models for math performance Model interpretability will
be prioritized to provide actionable insights for educators, allowing them to understandnot just which students are at risk, but why
● Algorithms: Decision tree and/or random forest algorithms will be prioritized due
to their ability to handle diverse data types and provide insights into featureimportance
5 Object and Scope of the Study
● Scope of Research
The primary focus is predicting performance in mathematics using the "StudentsPerformance in Exams" dataset
Trang 14Due to potential dataset limitations, the model may not generalize perfectly to allstudent populations or educational contexts.
Trang 15II LITERATURE REVIEW
Provides a comprehensive review of literature relevant to predictive modeling ineducation, synthesizing theoretical and empirical insights It examines previous research
on student performance prediction, identifying key concepts, methodologies, andfindings By summarizing existing knowledge and identifying gaps in the literature, thischapter establishes the theoretical framework for the study, emphasizing the need forfurther investigation into the factors influencing student achievement and the efficacy ofpredictive models in educational settings
● Introduction
Predictive modeling is gaining traction as a powerful tool within the educationallandscape, offering the potential to personalize learning experiences andproactively target student support By leveraging student data and machinelearning algorithms, predictive models can reveal patterns and identify factors thatinfluence academic performance This literature review explores existing research
on predictive modeling in education, focusing on the features commonlyinfluencing student outcomes, the effectiveness of various modeling techniques,and the ethical considerations surrounding their implementation It specificallyexamines these trends in light of this study's focus on parental education,socioeconomic status (SES), and their impact on math performance
● Related Works
Research into the application of predictive modeling within education hasexpanded significantly in recent years A substantial body of work underscoresthe impact of socioeconomic factors (SES) and parental education on studentoutcomes Decision trees and random forests are particularly popular due to theirability to handle complex relationships and their interpretability Let's examinekey trends in this research:
- Focus on Socioeconomic Factors
Trang 16Numerous studies have demonstrated the predictive power of SES indicators, such
as participation in free or reduced-price lunch programs, on student achievement[Durga et al., 2020] Additionally, a strong correlation exists between parentaleducation levels and students' academic performance These studies highlight theneed to consider equity and social determinants when developing predictivemodels
- Varied Algorithms and Outcomes
Researchers have employed a range of machine learning algorithms in educationalprediction tasks Linear models [Cui et al., 2019], support vector machines, andneural networks have been explored alongside decision trees and random forests.These studies have focused on various outcome variables, including overall GPA,success in specific subjects, and likelihood of dropout [Kurni et al., 2023]
- The Need for Nuance and Interpretability
While predictive models hold promise, it's crucial to go beyond simple predictionand aim for interpretability Understanding which factors have the greatestinfluence on student outcomes is essential for designing effective interventions.Some research highlights the potential for bias in predictive models, calling forfairness and transparency in their development [Durga et al., 2020]
- Gaps and Opportunities
While predictive modeling in education has made significant strides, there remainopportunities to enhance its impact on understanding and improving student mathperformance Existing research often suffers from limitations in dataset size anddiversity, potentially hindering the ability to draw generalizable conclusionsacross different student populations and school settings Additionally, there's aneed to expand the range of features investigated Focusing on traditionaldemographics may overlook other influential factors, such as student engagement
Trang 17or access to outside-of-class resources Finally, ensuring models are not onlyaccurate but also interpretable is crucial This allows educators to move beyondpredictions and towards actionable insights that guide targeted interventions.
This study directly addresses these gaps by utilizing a comprehensive dataset thataims to represent a broader spectrum of students The inclusion of variables liketest preparation participation may reveal previously underappreciated factorssignificantly impacting math scores Most importantly, this research emphasizesmodel interpretability This will enable educators to understand the reasoningbehind predictions and translate data-driven insights into customized supportstrategies, ultimately enhancing student success in mathematics
- Contributions of This Study
This study aims to provide a nuanced understanding of the relationship betweenparental education, socioeconomic factors (indicated by lunch type), testpreparation, and student math performance within traditional offline classrooms.Its key contributions are:
+ Quantifying Impact: By using a comprehensive dataset, the modelquantifies the relative influence of these specific factors on math scores.This allows educators to prioritize interventions targeting the areas with thehighest potential for improvement
+ Focus on Equity: Investigating socioeconomic indicators as predictorsbrings attention to the potential disparities in educational outcomes.Understanding how these factors operate in the model is crucial to promoteequitable resource allocation and support systems for disadvantagedstudents
+ Actionable Insights: The emphasis on model interpretability provides
Trang 18predictions are made These insights allow them to tailor their instructionand support strategies to address specific student needs.
● Factors Influencing Student Performance
Educational research consistently highlights the complex ways in whichsocioeconomic factors and parental education level shape student outcomes.Students from lower-SES backgrounds often face disadvantages due to limitedresources, reduced access to learning opportunities, and less potential foracademic support at home Similarly, parental education level can profoundlyimpact a student's ability to receive help outside of school, particularly in subjectslike mathematics, where conceptual understanding is key This study seeks toquantify the impact of these specific factors on math performance, providingdata-driven insights to inform resource allocation and targeted interventions
● Predictive Modeling in Education
The application of predictive modeling in education is an evolving field withpromising results Decision trees and random forests are widely employed due totheir ability to handle diverse data types and their interpretability, both of whichare crucial for understanding the factors driving predictions and ensuring
educators can apply insights [Smith & McKenna, 2013] Other algorithms likelinear regression, support vector machines, and neural networks have also beenexplored [Shahiri et al., 2015] While reported success rates vary, this
underscores the importance of careful dataset selection, feature engineering, andmodel evaluation for educational applications [Kumar et al., 2017]
● Limitations and Gaps
While predictive modeling holds promise, it's crucial to acknowledge limitations
in the existing research Many studies rely on relatively small datasets, potentiallyhindering model generalizability to diverse student populations [Romero &
Trang 19Ventura, 2010] Further, there is a need to move beyond traditional demographics,exploring student motivation and engagement to build more nuanced predictivemodels [Baker & Inventado, 2014] Doing so is vital to ensure that predictionsand subsequent interventions based on those predictions are equitable and avoidperpetuating existing biases.
● Conclusion
Predictive modeling has the potential to transform educational practice byproviding data-driven insights into factors influencing student performance.Existing research highlights the importance of features such as parental education,socioeconomic status, and test preparation Future work can enhance the field byincorporating larger and more diverse datasets, exploring a wider array ofstudent-level features, and continually refining modeling approaches for fairnessand effectiveness This study contributes directly to this evolution, aiming toprovide actionable insights specifically focused on math achievement
● Future Directions
To enhance the robustness and generalizability of the findings, future work couldincorporate a larger and more diverse dataset representing a wider range ofstudent populations
Trang 20III METHODOLOGY
The research methodology and data collection process are described in detail It outlinesthe research design, data sources, variables, and statistical techniques employed in thestudy By providing a clear overview of the study's methodology, this chapter ensurestransparency, replicability, and rigor in the research process It sets the stage for dataanalysis and interpretation, laying the groundwork for the empirical findings presented insubsequent chapters
● Modeling Approach
This study aims to predict student math performance, a classification problem.Random Forest classifiers were employed due to their ability to handle non-linearrelationships and provide insights into feature importance
● Data Preprocessing
Data preprocessing involved addressing missing values, encoding categoricalfeatures, and standardizing numerical features for model compatibility
● Model Training and Evaluation
The models were trained using k-fold cross-validation to prevent overfitting Themodels' performance was evaluated using accuracy, precision, and recall metrics
● Feature Importance Analysis
Feature importance was determined using the built-in feature_importances_
attribute of the Random Forest classifier to identify factors most influential onstudent math performance
Trang 21CHAPTER 1: PREDICTIVE MODELING IN EDUCATION
1 The Need for Predictive Models
Traditional educational systems often face challenges such as the difficulty of identifyingat-risk students early, the inefficient allocation of limited resources, and the lack ofpersonalization in interventions Predictive modeling offers a powerful tool to addressthese challenges By leveraging student data and machine learning algorithms, predictivemodels can uncover patterns, make predictions about future performance, and provideeducators with valuable insights to guide proactive support
Figure 1 Predictive Modeling Framework for Student Performance.
Trang 22Predictive modeling employs a range of data analysis and statistical techniques to predictfuture outcomes Machine learning, a field within artificial intelligence, centers aroundalgorithms that learn patterns from data without being explicitly programmed Thisproject will explore decision trees and random forests These methods offer severaladvantages in the educational context:
● Decision Trees: These models create a tree-like structure of decisions based on
student features, leading to a prediction Their visual nature aids in understandingthe factors most influential in determining student performance
Figure 2 Simple decision tree diagram.
● Random Forests: An ensemble method combining multiple decision trees,
resulting in models that generally achieve higher accuracy and are less susceptible
to overfitting