BỘ GIÁO DỤC VÀ ĐÀO TẠO
TRƯỜNG ĐẠI HỌC KINH TE TP HO CHI MINH
UEH UNIVERSITY
Subject : Data Science Instructor : Ngo Tan Vii Khanh Topic : Heart Attack Analysis & Prediction
Group members Nguyễn Bao Long : 31221026514
Ngô Hiểu Linh : 31221020350 Trần Thuỷ Tiên : 31221020520
—
Trang 2STUDENTS CONTRIBUTION
Name ID Assessment
Nguyễn Bảo Long 31221026514 100% Ngô Hiểu Linh 31221020350 100% Trân Thuỷ Tiên 31221020520 100%
Trang 3
4 Evaluation Metrics 2.2 e.- “(Ả 7
III) Data Collection and Preprocessing -4 ,ÔỎ 7 V SOUPCOS ooo cece tee c ee cec cee ce eee eee ceeeeaeeceecesaeececescaeceeeessaeeeeeeseaaeeseeseaaeeseseeeaeeeeeeeeseceeenseeeeeeeesees 7 2 Data Preprocessing (ETL) Process Using Orange Software QQQQ TQ TQ TS nàn se xe 8 IV) B000 TA e 11
Vỳ_ Analysis with Orange Soffware - - LH.“ HHY HH HT HH KHE KH TH kn 12 1 Overview of AnalySÌS - - s HH ng HH HH nọ HT nọ ghen 12 2 Analysis using k-MeaNS AlQOrith 0 oo .AT.H HH 13
2.1 Number 0 ằe ốỐốỐốỐ 13
2.2 Analyzing the 3 most impactful factors using box plots cece ce ceeceeeeeeeeesenneeee 14 3 Heart Attack Prediction Modeling - S25 c3 231 1n HH ng 2H nen ng ng 17 KT ` N6 || 17
3.1.1 ® cà) TA S 4 17
3.1.2 0.0m -)e-r i51 aắ ,ô,ÔÓÔỎ 18 3.1.3 Considered Predictive Models -LQQ Q01 221211 nH 22H Hn* TH HT TH TH HH KH kh 18 3.1.4 Cross-Validation Method - - Q0 002122212111 12121 111922111111 11T TT HT KH 19 3.1.5 E900 -00e-Ä/ 0 0ó Tố - ‹(4 .ẽ ae 19
3.1.6 Result of Model Testing a4 AẢ 20
3.2 Utilizing the Naive Bayes Algorithm for Predictive Modeling 3.2.1 Data Splitting
3.2.2 Model Training 3.2.3 I0 .-
3.2.4 Limitations and Considerafions -LQ 0112221111111 11111 nH1 11T TH HH ng re 23 3.2.5 0927 ca «e4 4œÖố: 23
VI) 09iai s1 24 REFTERENCES Q Q.00 2n nh nh HT nà tk kh khe na tkt 25
Trang 4ABSTRACTS
Myocardial infarctions, commonly known as heart attacks, stand as a critical medical emergency, casting far-reaching implications on global public health The intricate and multifaceted nature of heart disease presents a considerable challenge for researchers and healthcare professionals striving to develop effective management strategies Within this intricate landscape, a promising avenue arises through the utilization of data-driven methodologies, marking a revolutionary shift in our comprehension, diagnosis, and prevention of heart disease
In recent years, the prevalence of heart attacks has become increasingly pronounced, representing a significant public health concern Statistically, the incidence of myocardial infarctions has exhibited a concerning upward trend, imposing a greater burden on healthcare systems globally This rise underscores the urgent need for innovative approaches to address the complexities of heart disease and underscores the pivotal role of data-driven solutions in reshaping our understanding and response to thisescalating health challenge
Recognizing the severity of heart attacks and their pervasive impact on public health, our focus centers on "Heart Attack Analysis & Prediction." This research aims to leverage data-driven methodologies to contribute to the prevention, early diagnosis, and intervention of heart disease By meticulously analyzing datasets, we endeavor to enhance our comprehension of heart attacks, unravel key risk factors, and construct predictive models that facilitate timely detection and intervention
The correlation between heart diseases and data interference is pivotal in transforming the paradigm of cardiovascular health Employing data-driven insights allows for a proactive approach to heart disease diagnosis, empowering individuals to assume control of their cardiovascular well-being Accurate and timely diagnostic information provides a foundation for informed decision-making regarding lifestyle choices, such as adopting heart-healthy habits, engaging in regular physical activity, managing stresslevels, and discontinuing smoking This proactive integration of data into healthcare not only reduces the burden of heart disease but also fosters improved overall health outcomes.
Trang 5SPECIAL THANKS
We would like to express our graftitude towards Mr Ngô Tấn Vũ Khanh for your guidance during the data science course Under your instruction, our team has gained a solid understanding of the intricate concepts and methodologies in the field Your approach to teaching and commitment to fostering an intellectually stimulating environment have significantly contributed to our academic growth
Throughout the course, your expertise has been instrumental in providing clarity on complex topics and facilitating our exploration of the practical applications of data science Your dedication to cultivating a collaborative learning atmosphere has allowed us to engage in meaningful discussions and delve into real-world scenarios, enhancing our ability to comprehend and apply theoretical concepts
Your unwavering support has empowered our team to navigate challenging aspects of the curriculum and extract valuable insights from the course material The interactive nature of your teachings has not only strengthened our theoretical foundation but has also equipped us with practical skills that will undoubtedly prove beneficial in our future endeavors
In conclusion, we express our gratitude for your mentorship, which has been pivotal in shaping our understanding of data science We are confident that the knowledge gained under your guidance will serve as a solid foundation as we continue to explore and contribute to the evolving landscape of this dynamic field.
Trang 6l) Introduction and Objectives
Heart attacks, also known as myocardial infarctions, are a critical medical emergency and a significant concern in healthcare They occur when the blood flow to the heart muscle is blocked, leading to the death of heart tissue Heart attacks are a leading cause of death worldwide and pose a substantial burden on healthcare systems, making them a significant topic in the field of healthcare The complexity and multifactorial nature of heart disease make it challenging for researchers and healthcare professionals to understand and effectively manage But amidst this grim reality, a beacon of hope emerges: data In recent years, the medical field has witnessed an explosion of data- driven approaches to understanding, diagnosing, and preventing heart disease This data, meticulously collected and analyzed, is fueling groundbreaking research and transforming the way we combat this deadly foe
In response to the seriousness of heart attacks and their impact on public health, we decided to choose the topic "Heart Attack Analysis & Prediction" which aims to contribute to the prevention, diagnosis, and early intervention of heart disease By leveraging data-driven approaches, we seek to enhance our understanding of heart attacks, identify risk factors, and develop predictive models to aid in timely detection and intervention Moreover, a proactive approach to heart disease diagnosis empowers individuals to take charge of their cardiovascular health By having access to accurate and timely diagnostic information, individuals can make informed decisions about their lifestyle choices, such as adopting a heart-healthy diet, engaging in regular physical activity, managing stress levels, and quitting smoking These proactive measures, combined with effective medical interventions, can significantly reduce the burden of heart disease and improve overall health outcomes
Il) Methodology 1 Data Exploration:
- Utilize Orange's data exploration tools to analyze the dataset's structure, identify missing values, and gain insights into the distribution of variables
- Visualize data patterns, correlations, and outliers to inform feature selection 2 Feature Selection:
Employ Orange's feature selection techniques to identify the most influential parameters impacting heart attack probability
- Utilize statistical measures and machine learning algorithms to rank and select relevant features
Trang 7II!) 1
It encompasses key features such as age, sex, chest pain type, resting blood pressure, cholesterol levels, fasting blood sugar, resting electrocardiographic results, maximum heart rate achieved, exercise-induced angina, and previous peak Additionally, it includes ‘oldpeak' (ST depression induced by exercise relative to rest), the slope of the peak exercise ST segment, number of major vessels colored by fluoroscopy, and thalassemia, a blood disorder
Rahman's dataset is particularly valuable for its detailed compilation of these variables, which are crucial in the field of medical research for predicting heart attack risks
This dataset not only serves as a rich resource for data analysis and machine learning projects but also plays a significant role in advancing our understanding of cardiovascular diseases.
Trang 82 Data Preprocessing (ETL) Process Using Orange Software
Trang 9e Step 1: First, extract data from the file, you can view the data's structure, the number of data rows in the file, and the basic data status
Columns (Double click to edit)
Trang 10e© Step 2: Use Feature Statistics to display the data in each column and assess the data's condition
Name Distribution Mean Mode Median Dispersion Man Max Missing
Trang 11e Step 3:Load into the data table
ete
a =7 8 | ds G0
IV) Data Structure
- The analysis is carried out using publicly available data for heart disease The dataset consists of 14 attributes and 303 instances There are 8 categorical attributes and 6 numeric attributes
- Data analysis in healthcare assists in predicting diseases, improving diagnosis, analyzing symptoms, providing appropriate medicines, improving the quality of care, minimizing cost, extending the life span, and reducing the death rate of heart patients
About this dataset
Attribute Description Range of Values Name
1 Age Age of the person in years | 29 to 79
2 Sex Gender of the person 0: Female, 1: Male 3 Cp Chest pain type 1 = Typical Type 1 Angina
2 = Atypical Type Angina 3 = Non-angina pain 4 = Asymptomatic 4 Trtbps Resting Blood Pressure ¡n| 94 to 200
mm Hg
11
Trang 12Results 1 = having ST-T wav abnormality (T wave inversior and/or ST elevation or depress of > 0.05 mV)
2 = showing probable or defini left ventricular hypertrophy b Estes' criteria
8 Thalachh |Maximum Heart Rati 71 to 202 Achieved
9 Exng Exercise Induced Angina |0=No 1=Yes 10 OldPeak |ST depression induced by 1 to 3
exercise relative to rest
11 Slp Slope of the Peak Exercise} 1 to 3 ST segment
12 Caa Number of major vessels |0 to 3 13 Thal Thalassemia 0 = null
1 =fixed defect 2 = normal
3 = reversible defect 14 output Class Attribute 0 = Normal
1 = Patients diagnosed with he disease
V) Analysis with Orange Software
1 Overview of Analysis
e Identification of Impactful Parameters:
o Utilizing the k-means algorithm, we seek to identify clusters among individuals who may be at risk of experiencing heart attacks Our
12
Trang 13objective is to unveil hidden patterns and gain valuable insights into the underlying factors associated with the likelihood of heart attacks o The primary objective is to determine the three most impactful parameters
affecting the possibility of a heart attack
oExplore correlations and relationships between different variables in the dataset to discern key contributors
@ Model Training and Prediction:
o Train a predictive model using machine learning algorithms to forecast the likelihood of a patient experiencing a heart attack
o Evaluate the model's accuracy, precision, and recall to ensure reliable predictions
2 Analysis using k-means algorith 2.1 Number of Clusters e After employing the k-means
clustering algorithm on the * ©“ Orne 2X
dataset and assessing various, er ee ami mm
cluster numbers, the analysis? ™* +° 2 suggests that the optimal un ze sis s88 of clusters is 2 ThiS CONCIUSIONG romaize coms Như
is based on the highest silhouette™ ¬ s score obtained (refer to the, „ to picture on the right), indicating mnmiezos — z> more cohesive clusters at this®
configuration = ? B | 2303 B 20312 e With the optimal number of
clusters identified, our next step is to explore the distinguishing features that separate these clusters To achieve this, we employ Box Plots, where the data is subgrouped based on the ‘Cluster’ feature and the variable ‘output.’
Data see Data Ti Se Hả
File k-Means Box Plot
As we delve deeper into these clusters, the first group comprises patients showing a smaller than 50% probability of experiencing a heart attack (output = 0), indicating a lower risk profile In contrast, the second cluster represents individuals with the opposite trend (output = 1), suggesting a higher likelihood
13
Trang 14of heart attacks Utilizing Box Plots allows us to visually discern the key features contributing to the differentiation between these clusters
20 40 60 80 104 120 140 160 180
focuses on the three most impactful
File Edit View Window Help
factors influencing the outcome Employing Box Plots once again, we
strategically order the variables based ony thaiachn |
their significance in differentiating the ® oldpeak two clusters The standout feature® Sen deemed most relevant to the clusters are
thalachh (maximum heart rate achievgfỗ orsr by relevance to subgroups _ during exercise), oldpeak (ST depressioftPereuns
induced by exercise relative to rest), and
Box Plot - Orange
Variable
slp (the slope of the ST segment during output
exercise) This targeted approach allow® duster us to hone in on key variables crucial fof? a
understanding and distinguishing order by relevance to vanabe outcomes within our dataset
o Thalachh (Maximum Heart Rate):
Thalachh is the highest heart rate achieved during exercise It is commonly used in the context of assessing cardiovascular fitness and risk factors for heart-related issues A higher thalachh during exercise may indicate better cardiovascular fitness, while abnormalities may suggest potential cardiovascular issues
14