Tài liệu hạn chế xem trước, để xem đầy đủ mời bạn chọn Tải xuống
1
/ 259 trang
THÔNG TIN TÀI LIỆU
Thông tin cơ bản
Định dạng
Số trang
259
Dung lượng
5,33 MB
Nội dung
APPLICATION OF DATA MINING TECHNIQUES IN THE PREDICTION OF CORONARY ARTERY DISEASE: USE OF ANAESTHESIA TIME-SERIES AND PATIENT RISK FACTOR DATA Ellen Pitt, B.Sc (Hons), M.B.,B.S (UQ), M.IT (QUT) Dr Richi Nayak Submitted in fulfilment of the requirements for the degree of Master of Information Technology (Research) School of Information Systems Faculty of Science and Technology Queensland University of Technology [2009] ii Keywords Anaesthesia, physiological data, time-series, clustering, feature selection, predictors of outcome, anaesthesia complications, cardiac risk factors, data mining iii iv Abstract The high morbidity and mortality associated with atherosclerotic coronary vascular disease (CVD) and its complications are being lessened by the increased knowledge of risk factors, effective preventative measures and proven therapeutic interventions However, significant CVD morbidity remains and sudden cardiac death continues to be a presenting feature for some subsequently diagnosed with CVD Coronary vascular disease is also the leading cause of anaesthesia related complications Stress electrocardiography/exercise testing is predictive of 10 year risk of CVD events and the cardiovascular variables used to score this test are monitored peri-operatively Similar physiological time-series datasets are being subjected to data mining methods for the prediction of medical diagnoses and outcomes This study aims to find predictors of CVD using anaesthesia time-series data and patient risk factor data Several pre-processing and predictive data mining methods are applied to this data Physiological time-series data related to anaesthetic procedures are subjected to preprocessing methods for removal of outliers, calculation of moving averages as well as data summarisation and data abstraction methods Feature selection methods of both wrapper and filter types are applied to derived physiological time-series variable sets alone and to the same variables combined with risk factor variables The ability of these methods to identify subsets of highly correlated but non-redundant variables is assessed The major dataset is derived from the entire anaesthesia population and subsets of this population are considered to be at increased anaesthesia risk based on their need for more intensive monitoring (invasive haemodynamic monitoring and additional ECG leads) Because of the unbalanced class distribution in the data, majority class under-sampling and Kappa statistic together with misclassification rate and area under the ROC curve (AUC) are used for evaluation of models generated using different prediction algorithms The performance based on models derived from feature reduced datasets reveal the filter method, Cfs subset evaluation, to be most consistently effective although Consistency derived subsets tended to slightly increased accuracy but markedly increased complexity The use of misclassification rate (MR) for model performance evaluation is influenced by class distribution This could be eliminated by consideration of the AUC or Kappa statistic as well by evaluation of subsets with under-sampled majority class The noise and outlier removal pre-processing methods produced models with MR ranging from 10.69 to 12.62 with the lowest value being for data from which both outliers and noise were removed (MR 10.69) For the raw time-series dataset, MR is 12.34 Feature selection results in reduction in MR to 9.8 to 10.16 with time segmented summary data (dataset F) MR being 9.8 and raw time-series summary data (dataset A) being 9.92 v However, for all time-series only based datasets, the complexity is high For most pre-processing methods, Cfs could identify a subset of correlated and non-redundant variables from the timeseries alone datasets but models derived from these subsets are of one leaf only MR values are consistent with class distribution in the subset folds evaluated in the n-cross validation method For models based on Cfs selected time-series derived and risk factor (RF) variables, the MR ranges from 8.83 to 10.36 with dataset RF_A (raw time-series data and RF) being 8.85 and dataset RF_F (time segmented time-series variables and RF) being 9.09 The models based on counts of outliers and counts of data points outside normal range (Dataset RF_E) and derived variables based on time series transformed using Symbolic Aggregate Approximation (SAX) with associated time-series pattern cluster membership (Dataset RF_ G) perform the least well with MR of 10.25 and 10.36 respectively For coronary vascular disease prediction, nearest neighbour (NNge) and the support vector machine based method, SMO, have the highest MR of 10.1 and 10.28 while logistic regression (LR) and the decision tree (DT) method, J48, have MR of 8.85 and 9.0 respectively DT rules are most comprehensible and clinically relevant The predictive accuracy increase achieved by addition of risk factor variables to time-series variable based models is significant The addition of time-series derived variables to models based on risk factor variables alone is associated with a trend to improved performance Data mining of feature reduced, anaesthesia time-series variables together with risk factor variables can produce compact and moderately accurate models able to predict coronary vascular disease Decision tree analysis of time-series data combined with risk factor variables yields rules which are more accurate than models based on time-series data alone The limited additional value provided by electrocardiographic variables when compared to use of risk factors alone is similar to recent suggestions that exercise electrocardiography (exECG) under standardised conditions has limited additional diagnostic value over risk factor analysis and symptom pattern The effect of the pre-processing used in this study had limited effect when time-series variables and risk factor variables are used as model input In the absence of risk factor input, the use of time-series variables after outlier removal and time series variables based on physiological variable values’ being outside the accepted normal range is associated with some improvement in model performance vi Table of Contents Keywords iii Abstract v Table of Contents vii List of Tables xiii List of Figures xv List of Appendices xix List of Abbreviations xxi Statement of Original Authorship xxv Acknowledgements xxvi CHAPTER 1: INTRODUCTION 1.1 Background 1.2 Context 1.3 Research Objective 1.4 Research Questions 1.5 Thesis Outline 1.6 Significant Results 1.7 Other Findings 2.1 CHAPTER 2: LITERATURE REVIEW Coronary Vascular Disease 2.1.1 Impact of cardiovascular disease 2.1.2 Risk factors and associated vascular disease 2.1.3 Diagnostic methods 11 2.1.4 Risk factor modification and revascularisation 23 2.2 Anaesthesia 25 vii 2.2.1 Anaesthesia risk and complications 26 2.2.2 Anaesthesia monitoring 27 2.2.3 Choice of anaesthetic agent 30 2.2.4 Quality assurance 30 2.2.5 Summary 31 2.3 Data Mining Process 32 2.3.1 Data preparation 34 2.3.2 Modelling 38 2.3.3 Evaluation methods 41 2.4 Related Work 43 2.4.1 Issues in data mining medical databases 44 2.4.2 Application in medical domain 45 2.4.3 Summary 50 2.5 Implications 50 CHAPTER 3: RESEARCH DESIGN 53 3.1 Data Acquisition 54 3.2 Data Selection 55 3.2.1 Target variable selection 55 3.2.2 Case selection and segmentation 55 3.2.3 Variable selection 56 3.3 Data Pre-Processing 57 3.3.1 Data exploration 57 3.3.2 Pre-processing tasks 58 3.4 Data Modelling 59 3.4.1 Datasets 60 3.4.2 Feature selection / dimension reduction 60 3.4.3 Modelling methods 60 3.5 Performance Evaluation 62 viii 3.6 Post-Processing 63 3.7 Ethics and Limitations 63 3.7.1 Ethical considerations 63 3.7.2 Limitations 63 3.8 4.1 Conclusion 65 CHAPTER 4: DATA EXPLORATION 67 Database Description 67 4.1.1 Variable groups 67 4.2 Time-Series Data 68 4.3 Demographic and Clinical Characteristics 77 4.3.1 Missing data 78 4.3.2 Gender 79 4.3.3 Age 80 4.3.4 ASA class 81 4.3.5 Case duration 84 4.3.6 Vascular disease 86 4.3.7 Risk factor characteristics 91 4.3.8 Weight distribution 94 4.3.9 Primary diagnosis group 95 4.4 5.1 Conclusion 95 CHAPTER 5: DATA PRE-PROCESSING 97 Data Selection 97 5.1.1 Target selection 97 5.1.2 Variable selection 97 5.1.3 Case selection and risk segmentation 98 5.2 Data Preparation 101 5.2.1 Outlier and noise removal 102 5.2.2 Time-series data reduction 103 ix 5.2.3 Imputation of missing values 116 5.2.4 Feature selection 117 5.2.5 Time-series dimension reduction methods 117 5.3 6.1 Conclusions 120 CHAPTER 6: DATA MODELLING AND ANALYSIS 121 Feature Selection Methods 122 6.1.1 Model accuracy 122 6.1.2 Model complexity 124 6.1.3 Effect of feature selection on other measures of model performance 125 6.2 Datasets 126 6.3 Evaluation Measures 129 6.3.1 Misclassification rate for balanced and unbalanced data 129 6.3.2 Area under ROC curve in balanced and unbalanced data 131 6.3.3 Sensitivity, specificity and predictive values 133 6.3.4 Kappa statistic in unbalanced data 135 6.4 Effect of ASA and its imputation 139 6.5 Prediction Methods 141 6.5.1 Comparison of methods from each class of prediction algorithms 141 6.5.2 Comparison of decision tree and rule based prediction algorithms 145 6.6 Effect of A Priori Risk Stratification 147 6.7 Model Complexity 151 6.7.1 Effect of dataset 151 6.7.2 Effect of pre-processing method and risk factor data 153 6.8 Comparison of Models for Prediction of corVD and anyVD 163 6.9 Effect of Non Coronary Vascular Disease Status on Prediction of corVD 168 6.10 Primary hypotheses and Statistical analyses 170 6.11 Summary of Findings 171 6.12 Summary 173 x Appendices Appendix N: Comparison of methods and dataset for risk category subsets (MR) Comparison of Methods and Datasets for general population and risk categories (MR) General CorVD High risk RF_A RF_F NB RF only 10.098 10.216 LogR 9.831 9.21 SimpleL R MLP 9.082 Nnge Low risk RF_A RF_F 10.274 RF only 17.618 18.923 9.298 18.434 16.803 9.1501 9.0909 19.413 17.292 10.157 9.121 9.0613 17.618 16.15 16.15 7.598 18.21 10.335 10.098 12.002 20.229 22.023 10.239 Very high risk 8.611 RF only 23.695 RF_A A 24.16 7.525 25.703 20.88 7.453 24.899 23.29 7.019 7.308 24.096 25.72 8.3575 8.1052 32.129 26.1 RF_A 19.576 RF only 8.358 8.466 16.476 7.598 7.091 16.313 7.598 6.983 RF_F SMO 9.802 9.802 19.902 19.902 19.902 7.96 24.899 24.9 Ensemble 10.364 8.943 9.091 18.108 18.271 16.966 8.213 6.983 7.019 24.096 26.91 J48 10.424 8.854 9.091 18.76 17.781 18.597 8.2851 7.2359 6.8741 24.096 20.88 RF_A RF_F RF_F RF_F 27.569 25.286 11.035 10.636 RF only 28.514 RF_A 13.592 RF only 11.18 RF_A 14.184 RF only 26.264 RF_A NB RF only 15.601 26.91 LogR 14.746 13.977 13.651 28.059 26.917 27.08 11.18 10.492 10.239 27.711 27.31 SimpleL R MLP 14.747 13.947 13.74 26.427 26.101 27.246 11.179 10.383 10.275 27.309 27.31 15.309 14.332 14.302 26.916 26.754 27.754 11.758 10.89 10.564 32.129 30.92 Nnge 204027 33.605 31.811 34.095 12.156 28.11 14.836 26.917 26.917 26.916 12.084 11.288 33.333 SMO 10.384 11.288 30.121 28.92 Ensemble 14.599 14.51 14.984 25.612 27.569 26.754 11.795 10.999 33.333 30.52 J48 14.806 15.399 13.829 26.264 26.264 24.959 11.541 10.637 33.735 30.52 AnyVD 11.758 219 Appendices Appendix O: Comparison of methods and datasets for risk category subsets (AUC) Comparison of Methods and Datasets for general and high risk populations (AUC) CorVD General High risk Low risk Very high risk RF only RF_ A RF_ F RF only RF_ A RF_ F RF only RF_ A RF_ F RF only RF_A A NB 0.841 0.879 0.88 0.801 0.802 0.819 0.851 0.877 0.876 0.729 0.755 LogR 0.839 0.873 0.875 0.808 0.82 0.83 0.849 0.878 0.878 0.754 0.808 SimpleL R MLP 0.77 0.867 0.874 0.735 0.816 0.835 0.86 0.878 0.873 0.756 0.795 0.841 0.867 0.87 0.786 0.815 0.824 0.862 0.873 0.878 0.729 0.797 Nnge 0.573 0.632 0.657 0.607 0.62 0.574 0.66 0.672 0.63 0.635 0.67 SMO 0.5 0.5 0.598 0.598 0.584 0.5 0.667 0.667 Ensembl e J48 0.747 0.854 0.868 0.791 0.78 0.765 0.807 0.824 0.847 0.739 0.776 0.627 0.703 0.702 0.611 0.73 0.674 0.697 0.721 0.718 0.715 0.744 RF_A RF_F RF_F RF_F 0.788 0.79 0.871 0.875 RF only 0.791 RF_A 0.863 RF only 0.828 RF_A 0.854 RF only 0.785 RF_A NB RF only 0.826 LogR 0.824 0.858 0.859 0.781 0.785 0.786 0.827 0.873 0.871 0.789 0.8 SimpleL R MLP 0.803 0.857 0.859 0.777 0.782 0.781 0.804 0.874 0.872 0.789 0.793 0.825 0.851 0.856 0.772 0.76 0.774 0.829 0.859 0.871 0.765 0.773 Nnge 0.628 0.649 0.661 0.643 0.697 0.874 0.718 0.641 0.684 SMO 0.673 0.724 0.724 0.724 0.697 0.697 0.697 0.684 0.68 Ensembl e J48 0.806 0.852 0.844 0.783 0.781 0.766 0.803 0.862 0.851 0.74 0.738 0.653 0.879 0.783 0.758 0.754 0.744 0.673 0.83 0.794 0.692 0.726 AnyVD 0.791 220 Appendices Appendix P: Comparison of methods and datasets for risk category subsets (Kappa statistic) Comparison of Methods and Datasets for general and high risk population (Kappa statistic) CorVD General High risk Low risk Very high risk RF only RF_A RF_F RF only RF_A RF_F RF only RF_A RF_F RF only RF_AA NB 0.434 0.43 0.432 0.341 0.366 0.362 0.449 0.443 0.44 0.3769 0.399 LogR 0.386 0.35 0.354 0.243 0.316 0.34 0.421 0.389 0.355 0.297 0.449 SimpleLR 0.316 0.352 0.341 0.229 0.305 0.349 0.421 0.406 0.361 0.3367 0.39 MLP 0.331 0.317 0.336 0.394 0.399 0.342 0.383 0.383 0.3581 0.365 Nnge 0.1245 0.311 0.216 0.2619 0.167 0.314 0.377 0.32 0.2598 0.353 SMO 0 0.226 0.226 0.199 0.3589 0.359 Ensemble 0.1315 0.349 0.346 0.223 0.1919 0.262 0.179 0.381 0.399 0.369 0.343 J48 0.153 0.34 0.341 0.307 0.317 0.259 0.277 0.36 0.357 0.3796 0.512 RF only RF_A RF_F RF only RF_A RF_F RF only RF_A RF_F RF only RF_A NB 0.459 0.485 0.524 0.455 0.428 0.474 0.45 0.523 0.532 0.3702 0.421 LogR 0.409 0.481 0.488 0.408 0.428 0.43 0.45 0.495 0.501 0.3907 0.398 SimpleLR 0.409 0.478 0.481 0.447 0.449 0.427 0.45 0.497 0.499 0.4008 0.3982 MLP 0.451 0.48 0.489 0.436 0.426 0.439 0.416 0.476 0.476 0.2823 0.285 Nnge 0.277 0.298 0.327 0.287 0.441 0.497 0.467 0.2831 0.381 SMO 0.416 0.443 0.443 0.443 0.459 0.459 0.459 0.3633 0.368 Ensemble 0.416 0.469 0.452 0.457 0.412 0.432 0.45 0.447 0.465 0.2255 0.318 J48 0.416 0.431 0.476 0.446 0.436 0.465 0.422 0.449 0.4839 0.222 0.318 0.355 AnyVD 221 Appendices Appendix Q: Comparison of methods in prediction of anyVD in selected subsets A: MR for general population and high risk subset, and low and very high risk subsets Misclassification Rate Use of MR for assessment of model performance for general and high risk subsets, anyVD 40 35 30 25 20 15 10 General RF only General RF_A General RF_F High risk RF only High risk RF_A High risk RF_F Misclassification Rate Use of MR for assessment of model performance for low risk and very high risk subsets, anyVD 40 35 30 25 20 15 10 Low risk RF only Low risk RF_A Low risk RF_F Very high risk RF only Very high risk RF_A 222 Appendices B: AUC for general population and high risk subset, and low and very high risk subsets Area under ROC curve Use of AUC for assessment of model performance for general and high risk subsets, anyVD 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 General RF only General RF_A General RF_F High risk RF only High risk RF_A High risk RF_F Area under ROC curve Use of AUC for assessment of model performance for low risk and very high risk subsets, anyVD 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 Low risk RF only Low risk RF_A Low risk RF_F Very high risk RF only Very high risk RF_A 223 Appendices C: Kappa statistic for general population and high risk subset, and low and very high risk subsets Use of Kappa statistic for assessment of model performance forgeneral and high risk subsets, anyVD 0.6 Kappa statisitic 0.5 0.4 General RF only 0.3 General RF_A 0.2 0.1 General RF_F High risk RF only High risk RF_A High risk RF_F Use of Kappa statistic for assessment of model performance for low risk and very high risk subsets, anyVD 0.6 Kappa statistic 0.5 0.4 Low risk RF only 0.3 Low risk RF_A 0.2 Low risk RF_F 0.1 Very high risk RF only Very high risk RF_A 224 Appendices Appendix R: Description of ST classes ST Class minST stC_x (strict baseline criteria) maxST # ST >1 #ST>2 stC_x chang change e 2 >0.99 >2 stC_xx 4 >5 (relaxed baseline criteria) 2 >5 stClass_X 2 (division of elevation into >2 or >1mm >-1.0 >1.99 >1.0 >0.99 >2 >2 stClass_XX -1.0 >2 >0.99 >2 0.99 >2 2 NonDiag stClass_4X Positive >5 NonDiag and requirement for additional count of significant ST Negative