Data analysis and modeling for engineering and medical applications

DATA ANALYSIS AND MODELING FOR ENGINEERING AND MEDICAL APPLICATIONS MELISSA ANGELINE SETIAWAN NATIONAL UNIVERSITY OF SINGAPORE 2009 DATA ANALYSIS AND MODELING FOR ENGINEERING AND MEDICAL APPLICATIONS MELISSA ANGELINE SETIAWAN (B.Tech, Bandung Institute of Technology, Bandung, Indonesia) A THESIS SUBMITTED FOR THE DEGREE OF MASTER OF ENGINEERING DEPARTMENT OF CHEMICAL AND BIOMOLECULAR ENGINEERING NATIONAL UNIVERSITY OF SINGAPORE 2009 ACKNOWLEDGEMENTS First of all, I want to thank God who is always with me during my coursework and research, gives me health and ability for doing all my work, equips me with hope so I can face failures and keep persisting with my research and blesses me in every single day of my life. With all respect, I would like to acknowledge my supervisor, Dr Laksh, for his guidance during my research. I really learnt a lot from him including how to be a good researcher, how to conduct research, how to be creative, how to motivate people and how to be a good teacher. He encouraged me during the difficult times I went through in the course of my research. I would like to acknowledge my parents, my little sister and Yudi who always supports me in prayer, gives advice, cheers me up whenever I felt down, and reminds me to not lose my hope. Thanks for your love, support, advice, concern, encouragement, and prayer. I also want to thank NUS and AUN-SEED Net for giving me the scholarship and opportunity to pursue my M.Eng degree through research. I want to take this opportunity to acknowledge all my labmates, particularly Raghu, who equipped me with professional skills, Yelneedi Sreenivas and Sundar Raj Thangavelu who always came up with jokes and made the situation in our lab so cheerful. Thanks to Kanchi Lakshmi Kiran, May Su Tun and Loganathan for discussions that turned out to be really useful for me. Thank you all for your friendship, I really enjoy our time together in IPC group. i Last but not least, I would like to thank all my best friends who are not mentioned by name explicitly. Nevertheless, I thank each of you for your encouragement, support, suggestions, attention, and friendship. ii CONTENTS Page ACKNOWLEDGEMENTS........................................................................................ i CONTENTS..............................................................................................................iii SUMMARY............................................................................................................ viii NOMENCLATURE .................................................................................................. x LIST OF TABLES................................................................................................... xii LIST OF FIGURES ................................................................................................ xiv 1. INTRODUCTION ................................................................................................. 1 1.1 INFORMATION BASED SOCIETY – RESEARCH BACKGROUND ...................... 1 1.2 ANALYSIS TECHNIQUES IN DATA RICH AREA – PROBLEM DEFINITION ...... 2 1.3 MOTIVATION AND CONTRIBUTIONS ............................................................ 4 1.4 CHALLENGES IN DATA ANALYSIS AND MODELING WORK ......................... 5 1.5 SCOPE OF PRESENT WORK .......................................................................... 5 1.6 ORGANIZATION OF THE THESIS .................................................................... 6 2. SUPERVISED PATTERN RECOGNITION ........................................................ 7 2.1 VARIABLE SELECTION .............................................................................. 10 2.1.1 Fisher criterion ..................................................................................... 11 2.1.2 Entropy method................................................................................... 11 2.1.3 Single variable ranking (SVR)............................................................. 12 2.1.4 Partial Correlation Coeficient Metric (PCCM).................................... 12 2.2 MACHINE LEARNING METHODS .............................................................. 13 iii 2.2.1 Artificial Neural Network (ANN)........................................................ 13 2.2.2 TreeNet ................................................................................................ 13 2.2.3 Classification and Regression Trees (CART)...................................... 14 2.2.4 Linear/Quadratic Discriminant Analysis (LDA/QDA)........................ 16 2.2.5 Variable Predictive Model based Class Discrimination (VPMCD) .... 17 2.2.6 K-nearest neighbour (K-NN) ............................................................... 17 2.2.7 Support Vector Machine (SVM).......................................................... 18 2.3 MODEL VALIDATION ................................................................................ 19 2.3.1 Resubstitution test................................................................................ 20 2.3.2 N-fold Cross-validation ....................................................................... 20 2.3.3 Independent Test.................................................................................. 20 2.3.4 Leave one out cross-validation (LOOCV) test .................................... 21 3. PARTIAL CORRELATION METRIC BASED CLASSIFIER FOR FOOD PRODUCT CHARACTERIZATION ..................................................................... 22 3.1 INTRODUCTION ................................................................................... 22 3.2 METHODS .............................................................................................. 24 3.2.1 Concept of partial correlation coefficients........................................... 24 3.2.2 Discriminating Partial Correlation Coefficient Metric (DPCCM)....... 27 3.2.3 DPCCM Algorithm.............................................................................. 29 3.2.4 DPCCM illustration with Iris data ....................................................... 31 3.2.5 Other classifiers used for comparison.................................................. 34 iv 3.2.6 Validation methods ............................................................................... 36 3.2.6.1 Re-Substitution Test...................................................................... 36 3.2.6.2 Random Sample Validation Test .................................................. 37 3.3 MATERIAL............................................................................................. 37 3.3.1 Datasets ................................................................................................ 37 3.3.2 Implementation .................................................................................... 39 3.4 RESULTS ................................................................................................... 39 4. ANALYSIS OF BIOMEDICAL DATA ............................................................. 46 4.1 INTRODUCTION ................................................................................... 46 4.2 METHODS .............................................................................................. 49 4.2.1 Classification Methods......................................................................... 49 4.2.2 4.3 Variable Selection Methods................................................................. 50 MATERIALS AND IMPLEMENTATION............................................. 51 4.3.1 Datasets ................................................................................................ 51 4.3.1.1 Anesthesia Dataset ........................................................................ 51 4.3.1.2 Wisconsin Breast Cancer (WBC) dataset ..................................... 52 4.3.1.3 Wisconsin Diagnostic Breast Cancer (WDBC) dataset ................ 52 4.3.1.4 Heart Disease dataset .................................................................... 53 4.3.2 Implementation .................................................................................... 53 4.3.3 Model Development............................................................................. 54 4.3.4 Validation Testing................................................................................ 54 4.3.5 Variable Selection................................................................................ 55 4.3.6 Software ............................................................................................... 56 v 4.4 RESULTS ................................................................................................ 56 4.4.1 Parameter Tuning................................................................................. 56 4.4.2 Test set Analysis .................................................................................. 57 4.4.2.1 DOA classification........................................................................ 57 4.4.2.2 Classification with WBC dataset .................................................. 65 4.4.2.3 Classification with WDBC dataset ............................................... 67 4.4.2.4 Heart Disease Identification.......................................................... 68 4.4.3 Variable Selection................................................................................ 69 5. EMPIRICAL MODELING OF DIABETIC PATIENT DATA .......................... 75 5.1 INTRODUCTION ....................................................................................... 75 5.2 FIRST ORDER PLUS TIME DELAY (FOPTD) MODEL...................................... 78 5.3 MATERIALS AND IMPLEMENTATION ................................................ 79 5.3.1 Dataset and Software ........................................................................... 79 5.3.2 FOPTD Implementation....................................................................... 82 5.4 RESULTS AND DISCUSSION .............................................................. 83 5.4.1 Patients with Continuous Insulin Infusion (Group 1) .......................... 83 5.4.2 Patients with Intermittent Insulin Infusion (Group 2).......................... 85 5.4.3 Patients with Blood Glucose Response Affected by Other Factors (Group 3).............................................................................................. 87 5.4.4 Medication Effect................................................................................. 89 5.4.5 Analysis of Home Monitoring Diabetes Data...................................... 92 6. CONCLUSIONS AND RECOMMENDATIONS .............................................. 99 vi 6.1 CONCLUSIONS ............................................................................................... 99 6.2 RECOMMENDATIONS................................................................................... 101 REFERENCES ...................................................................................................... 105 APPENDIX A. CV of the Author.......................................................................... 114 vii SUMMARY Information revolution has slowly but surely turned us into an information based society. As a result, data (as one form or source of information) collection and interpretation holds an important role in obtaining good information. In this thesis, some machine learning techniques are elaborated and applied to some classification problem exists in food industry and medical field. In addition, the use of First Order Plus Time Delay (FOPTD) to model ICU patient blood glucose is also proposed here. In the present study, a newly developed classifier (DPCCM) is utilized to address both Cheese and Wine identification problems and disease identification problems (using WBC and WDBC). Its performance was then compared with other well established classification methods. The comparison results in Cheese and Wine identification problems show that DPCCM has better performance than linear classifiers and comparable result to non-linear SVM classifiers. It also provides good visualization for understanding the specific variable interactions contributing to the nature of each class. DPCCM consistency in its performance is even shown in disease identification problems since it has better performance, in terms of overall accuracy, than other classifier used in this study. To conclude, DPCCM shows better potential to be an efficient data analysis tool for both clinical diagnosis and food product characterization. The performance analysis of machine learning techniques in medical field is also done by applying some of those techniques to do depth of anesthesia (DOA) classification and heart disease identification. According to our analysis, in terms of overall accuracy, CART and QDA are observed to be the best classifier models for viii DOA classification using cardiovascular features and AEP features respectively. Even when classifiers are built using a subset of features, the superiority of CART and QDA in DOA classification using cardiovascular dataset and AEP features respectively is confirmed. Our analysis in heart disease identification study shows that TreeNet gives much better overall accuracy and gives lower class 2 classification performances compared to CART in both overall accuracy and class wise accuracy. The last stage of this study is to model ICU patients’ blood glucose value using FOPTD (First Order Plus Time Delay) as the proposed model. The performance of FOPTD is then compared with Bergman and Chase models. According to the study, FOPTD successfully fits and predicts the actual patient data for all datasets received from the hospital. In addition, its performance is much better than the other two established models not only for good datasets but also for atypical datasets. Moreover, its simplicity makes this model easy to be applied and modified according to the input availability of the dataset. ix NOMENCLATURE A, B, C, X, Z - selected variable in a given system AEP, CV, WBC, WDBC, HEART – subscripts used to identify the name of dataset AEP – Auditory Evoked Potential ANN – Artificial Neural Network CART – Classification and Regression Trees CO – Cost Optimization CoV – Coefficient of Variance DOA – Depth of Anesthesia DPCCM – Discriminating Partial Correlation Coefficient Metric FC – Fisher Criteria FOPTD– First Order Plus Time Delay HR – Heart Rate LDA – Linear Discriminant Analysis M – correlation coefficient matrix MAE – Mean Absolute Error MAP – Mean Arterial Pressure N – data matrices used in training P – data matrices PCCM – Partial Correlation Coefficient Metric PNN – Probabilistic Neural Network QDA – Quadratic Discriminant Analysis SAP – Systolic Arterial Pressure SVM – Support Vector Machines x SVR – Single Variable Ranking VPMCD – Variable Predictive Model based Class Discrimination WBC – Wisconsin Breast Cancer dataset WDBC – Wisconsin Diagnostic Breast Cancer dataset d – number of correlations defined in the system i, j, k – subscripts used to identify the variables k - number of classes l – number of samples in a class n - number of observations p - number of variables r - correlation coefficient r – subscript used to represent reduced dataset test – subscripts used to represent test data matrices used in model validation x – order of partial correlation xi LIST OF TABLES Page Table 3.1 Classification result for case study I (WINE classification).................. 40 Table 3.2 Classification result for case study II (CHEESE classification)............ 41 Table 4.1 Summary of parameter tuning result using validation dataset for anesthesia ............................................................................................... 58 Table 4.2 Summary of parameter tuning result using validation dataset for breast cancer .......................................................................................... 59 Table 4.3 Summary of parameter tuning result using validation dataset for heart disease........................................................................................... 60 Table 4.4 Classification result (correct classification) on test set using cardiovascular features as predictors ..................................................... 60 Table 4.5 Classification results (correct classification) on test set using AEP features as predictors.............................................................................. 61 Table 4.6 Sensitivity and specificity values for each classifier in DOA classification .......................................................................................... 64 Table 4.7 Analysis result for WBC dataset using LDA, CART, TreeNet, DPCCM and VPMCD............................................................................ 66 Table 4.8 Analysis result for WDBC dataset using LDA, CART, TreeNet, DPCCM and VPMCD ........................................................................... 67 Table 4.9 Classification result on heart disease dataset using CART and TreeNet .................................................................................................. 69 Table 4.10 Variables selected from 10 AEP features using different selection methods ................................................................................................ 70 Table 4.11 Variables selected from 3 variables in cardiovascular dataset using different selection methods.................................................................. 70 Table 4.12 Model accuracy using selected variables (AEP dataset) ..................... 72 Table 4.13 Model accuracy using selected variables (cardiovascular dataset)...... 72 Table 5.1 MAE values for training and test samples using data from patients with continuous insulin infusion.................................................................... 84 xii Table 5.2 MAE values for training and test samples using patient data with intermediate insulin infusion................................................................. 86 Table 5.3 MAE values for training and test samples using Group3 patient data... 88 Table 5.4 Range of the parameters for each patient group .................................... 92 Table 5.5 MAE value for training and test samples using home monitoring data......................................................................................................... 94 Table 5.6 Range of estimated parameters for home monitoring data .................... 95 xiii LIST OF FIGURES Page Fig. 3.1 PCCM profiles for IRIS data .................................................................... 32 Fig. 3.2 Variable correlation shade map for each class in CHEESE classification dataset ................................................................................ 43 Fig. 5.1 FOPTD model scheme (MISO system).................................................... 79 Fig. 5.2 Data from Patient 1 who belongs to the first Group................................. 81 Fig. 5.3 Results for the “best” patient data set using the FOPTD model............... 84 Fig. 5.4 Results for the “worst” patient data set using the FOPTD model ............ 85 Fig. 5.5 Results for the “best” patient data set using the FOPTD model (Intermittent Insulin Infusion)................................................................... 86 Fig. 5.6 Model performance on the “best” patient data from Group 3 .................. 88 Fig. 5.7 FOPTD prediction without medication for Patient 27.............................. 89 Fig. 5.8 FOPTD prediction with medication for Patient 27................................... 90 Fig. 5.9 FOPTD prediction without medication for Patient 34.............................. 90 Fig. 5.10 FOPTD prediction with medication for Patient 34................................. 91 Fig. 5.11 Results with the FOPTD model for the patient with the highest MAE (home monitoring dataset) ............................................................ 94 Fig. 5.12 Results with the FOPTD model for the patient with the lowest MAE (home monitoring dataset) ............................................................ 95 Fig. 5.13 Actual glucose and model fit for all 5 home monitoring patients .......... 96 Fig. 5.14 Actual glucose and model prediction for all 5 home monitoring patients .................................................................................................... 97 xiv Chapter 1 Introduction As a general rule, the most successful man in life is the man who has the best information Benjamin Disraeli (1804-1881) Former British Prime Minister 1.1 Information Based Society – Research Background Fishing and hunting marked the first stage in human history where humans were primarily engaged in efforts to fulfill their nutritional needs. Increase in population led to the use of agriculture and domestication of animals. Later, the improvement in their creativity and way of thinking initiated the enhancement of civilization. Concurrently with the invention and utilization of stones, wood and their derivatives, civilization enhancement led to the invention and advancement of technology. One biggest event that marked technological enhancement happened in late 18th century is the industrial revolution (Halsall, 1997; Gascoigne, 2008). In the early stages of industrial revolution, which began in Great Britain (circa 1730), a machine was introduced to the industrial domain through the invention of steam engine. The turning point and great transition from manual labor based industry to machine based manufacturing environment resulted in both positive and negative impact on the society at that time. Continuous development and improvement of machines has facilitated life style transformation in the society (Kelly, 2001). Dr. Earl H. Tilford (2000) writes about an unnoticed impact of industrial revolution which is currently underway – the information revolution. 1 Information revolution has slowly turned us into an information based society. While ‘information’ was always useful for human development, it is becoming a basic need along with food, clothing and shelter. Some facts that highlight the importance of information in today’s drive towards a knowledge based economy are the ubiquitous cell phone and the exponential increase in the use of internet. Ten years ago, cell phone was not that common. Its unaffordable price made it a luxurious item at that time. The escalation of human needs in information has encouraged cell phone manufacturers to provide additional application features, such as radio, internet application (WIFI), Bluetooth, street directory, GPS etc at low cost. Therefore, almost all people own a cell phone nowadays – even in developing countries. In addition, the development of internet has paved way for quicker and reliable information exchange with various information resources and services such as electronic mail, online chatting, file transfer, file sharing, and other World Wide Web (WWW) resources. As reported by internet world statistics usage, the number of internet users has doubled in the last 8 years (2000-2008). In Africa and Middle East, the internet user growth has even increased by 1000% during the same period (Anonymous, 2001). These facts highlight the huge “need” for information among people and provide solid proof that our society is transforming into an “information based society”. As a result of this transformation, data and information have a great effect in decision making in various spheres of human activity. To satiate this hunger for accurate and quick information, methodologies that can generate accurate information from raw data must be developed. 1.2 Analysis Techniques in Data Rich Area – Problem Definition High quality information at a high speed is sought by many people in all walks of life. This is more so with people engaged in business, research, or 2 manufacturing. Before we discuss further about information, its existence and its importance, it will be better for us to define information. The Oxford English Dictionary defines information as things that are conveyed or represented by a particular sequence of symbols, impulses, etc (Oxford, 2005). Based on this definition, we can come to a conclusion that data is one form or source of information. As a consequence, data collection and interpretation holds an important role in obtaining good information. Even 10-20 years ago, data was scarce due to the relative non-availability of analytical instruments. Even if an instrument existed, its ability was very limited and it took quite a long time to get the results. For example, in order to check the existence of cancer cells, the doctor had to take sample cells from the organ and check them for any abnormalities manually (using a microscope). This procedure took even one or two days per sample. The complexity of this conventional method made it overwhelming when the physician had to differentiate between two nearly identical cancers in order to give the right treatment for the patient. Luckily, nowadays, improvements in technology have enabled the collection of samples in a short time. Modern instruments with ability to simultaneously analyze several samples and provide results within minutes are now available. This has resulted in a deluge of data leading to a new problem – the challenge of sifting through this mass of data and extracting useful information from it can be quite formidable. This is true of data sets arising from life sciences, chemistry, pharmaceutics (drug discovery), process operations and even medicine. Methods that can extract useful information from data are needed and are in fact being developed actively by many research groups. 3 1.3 Motivation and Contributions The abundance of data available especially in food engineering and medicine sector has become a significant problem because they contain precious information. Since this information will facilitate the doctor and food engineer to make good decisions which then lead to some improvement in those areas, they have to be extracted from those datasets. The needs of information extraction have become a strong motivation in this research. The research was conducted as a contribution to food engineer and medical practitioner which is finally useful for the society in many aspect of their life especially in food quality and medicine. An excellent classification of food product characterization using data mining technique may help food industry quality control with relatively lower cost than the taster. Hence the production cost could be lower and selling prices could be decreased for the convenient of the consumer. The fact that machine learning technique could accurately be used for disease identification and DOA classification is very important not only for the doctor but also for the patient. The doctor may apply machine learning technique and use the result as a basis to make decisions whether or not the patients need further treatment. In addition, the use of machine learning technique could also be an advantage for the patient because they do not have to take so many medical tests which take a lot of time and very costly. The ability of First Order Plus Time Delay (FOPTD) in modeling ICU patients’ blood glucose value as a function of food, glucose and insulin could help the doctor to predict the amount of glucose and insulin to be administered to the patient to avoid hypoglycemia and hyperglycemia. Hence it will increase the number of survive patient in the ICU. 4 1.4 Challenges in Data Analysis and Modeling Work There are some challenges in doing data analysis and modeling work. The main one relates to dealing with data complexity. The success of data analysis and modeling efforts is highly dependent on the data set itself. Poor quality and/or quantity of data as well as missing data can make data analysis even harder. Some biological and medical datasets are too huge in size. Therefore, it is a bit too hard for some computers to handle this kind of dataset owing to limitations of hardware and software. Unknown noise and disturbances affecting the system can make modeling difficult even if sufficient number of samples is available. In addition, the complexity of the physical, chemical and biological phenomena occurring inside the system accentuates the modeling difficulties. To keep the model simple, data pretreatment methods such as filtering, sample section and variable selection may be needed as well. 1.5 Scope of Present Work Some works related to data analysis and information extraction are addressed in this present study. They are: • Evaluating the performance of a newly developed method (DPCCM) by implementing it on problems from various domains such as food quality and medicine (cancer identification and depth of anesthesia classification) and comparing its performance with some existing leading machine learning methods. • Applying and evaluating selected variable selection methods to improve classifier performance on medical data sets. 5 • Identifying the limitations of existing blood glucose modeling methods in diabetics (surgical ICU patients and patients under home monitoring) and evaluation of a new modeling methodology. Section 1.6 provides more detailed information of this work. This present work mainly focuses on information extraction and data analysis covering food product characterization problems, early identification of some chronic illness, DOA (depth of anesthesia) level maintenance and blood glucose modeling in diabetic patients. Various existing classification, variable selection, and model fitting methods are studied. 1.6 Organization of the Thesis Chapter 2 of this thesis will provide an overview on existing data analysis methods. Both variable selection methods and classification methods are reviewed. For all the methods, basic information about their working and their limitations/advantages are discussed. A newly proposed classification methodology, DPCCM is introduced in chapter 3. Herein, the performance of DPCCM is compared to some existing and established classification methods such as CART, Treenet, and LDA. Chapter 4 discusses data mining in the context of medical applications. Some classification methods are applied and evaluated for early detection of cancer, heart disease identification and for DOA level maintenance during surgery process. The role of variable selection methods in classifier performance is also addressed here. After doing classification and data analysis, in Chapter 5 of the thesis, the challenging task of modeling of blood glucose data from ICU patients and patients under home monitoring are considered. Chapter 6 contains the conclusions, a summary of the contributions and possible future work. 6 Chapter 2 Supervised Pattern Recognition The difficulty of literature is not to write, but to write what you mean; not to affect your reader but to affect him precisely as you wish Robert Louis Stevenson (1850-1894) Scottish essayist, poet and book author Machine learning and data analysis works by learning from historical or past experimental data. Facilitated by supervised pattern recognition, a prediction on the outcome can be done using information available on the attributes (inputs). Currently, many problems in manufacturing, business and medical domains (e.g. process monitoring, disease detection and depth of anesthesia (DOA) estimation) are related to classification problem. For such problems, supervised pattern recognition uses data from past and existing samples in each class and builds discrimination rules/models so that one can distinguish between classes. The aim of constructing the classifiers is to predict to which class the new samples would belong to. With this prediction, the analyst is able to take the best next step (Berrueta et al., 2007). Therefore, data analysis is useful for decision making and can help to improve industrial processes, medical treatment and business outcomes. Some supervised pattern recognition methods exploit inter-class variations existing in the samples to build the classification model. In this case, the classifier tries to identify the main difference between classes. These discriminating conditions are then applied to a new future sample which is then classified accordingly. The Classification and Regression Tree (CART) method applies this 7 approach for classification. On the other hand, methods such as Variable Predictive Model based Class Discrimination (VPMCD) make use of the specific similarities that exist in each class to build the classification model. VPMCD basically tries to find out the similarities that exist between the samples in each class. When a new sample comes, it is checked for its class-specific properties and then categorized into its corresponding class. Berrueta et al. (2007) state that data analysis can be envisioned as 4 algorithmic steps. The first one is data set division. In this step, the complete data set is usually divided into training set and validation set (or test set). The portion of the division is usually 80% for training set and 20 % for test set (or 75% for training set and 25% for test set). The training set is then used to build the classification model and the test set is kept aside for validation purposes. The second step is data pretreatment. This step is done to facilitate the next step namely classification or information extraction and to avoid making wrong conclusions from the dataset (Berrueta et al., 2007). Common data pretreatment methods available for multivariate data analysis include scaling, weighting, missing data handling and variable selection. During the experiment, some features or attributes may be measured and characterized by using different instruments or machines. Also, the variables recorded may have different orders of magnitude. For such cases, weighting and scaling is usually applied to make the input variables have the same basis. In weighting, different weights can be assigned for each variable such that they have appropriate contributions on the output (weighting is related to scaling). Some examples of scaling methods are mean centering (subtracting features value by its variable average value), standardization (dividing the mean centered value by its standard deviation), normalization (dividing all 8 values in each variable by the square root of its sum of squares), and normalization variable (variables are normalized with respect to single variable) (Berrueta et al., 2007). Data received from hospitals and other sources may also contain missing data. Data imputation is one method developed to handle missing data. It replaces the missing value with estimated values. Some techniques replace the missing value with the mean value of the variable (Little and Rubin, 1986; Zhang et al., 2008). However, this method assumes there are no dependencies between the variables and may distort other statistical properties of the data. The other well known imputation method is hot deck imputation. In this method, missing value is replaced with the value from other row which is similar to the row with missing value (Rilley, 1993; Dahl, 2007). Regression imputation and decision tree imputation can also be used to predict missing value. In regression imputation, missing data is predicted by regression equation built using the other variables which contain no missing value. Similarly, for decision tree imputation, a decision tree is built using rows which have no missing value and the variable with missing value acts as the target variable. The missing value is then predicted by applying this decision tree to the row with missing value (Jagannathan and Wright, 2008). Variable selection is needed when we deal with huge datasets so as to minimize the computational time and make model or classifier construction relatively easy. Variable selection will be discussed in detail in section 2.1. In this thesis, we only focus on variable selection method (Chapter 4) and centering method (Chapter 5) because the dataset used is relatively large and there is no missing data in the datasets. The third step is classification model building. In this step, all information contained in the training data set (excluding test set) is used to build the 9 classification model. Once the classification model is constructed, the data analyst proceeds to the last stage which is the crucial validation part. The model obtained from the previous stage is tested using the test data set. The accuracy and other characteristics of the classifier are then noted and reported as “classifier performance”. An elaborate explanation about data analysis algorithms can be seen in sections 2.1 to 2.3. 2.1 Variable selection One biggest challenge faced by almost all classifiers relates to the size of data set. To create a good and robust classifier, we need a data set that is rich in both quality and quantity. Data set with a few samples will give insufficient classification information to the classifier hence its performance will be low. Large data sets, which has many variables, can potentially provide enough information, but the analysis will be time consuming and computationally expensive. Therefore, in problems involving large (in the number of variables) data sets (e.g. micro array data), the most common data pretreatment methods used is variable selection. Only important “discriminating variables” will be processed by the classification algorithm. Variable selection is not an absolute requirement for classifier development or as a matter of fact for any data analysis activity. However, variable selection can sometimes boost the classifier performance especially if it is applied on data set containing noise. Through this step, variables containing noise, redundant information and without discriminating ability are removed from the data set. This reduces the input space so that the building of the classifier model will be easier, faster and even more accurate. In addition, identification of important variables may be able to give better information to perform a more accurate classification (Cheng 10 et al., 2006). It is understood that pretreatment must be done in similar manner on both the training and test sets. We now review some variable selection methods: 2.1.1 Fisher criterion Fisher criterion is defined as the ratio of “between-class” and “inter-class” variances (Wang et al., 2008). This criterion is maximized by Linear Discriminant Analysis (LDA) (Duda et al., 2000) to identify the best separation plane by weighting predictor variables. Therefore, after the plane is built, each variable has its own weight factor. These weight factors are then used as a basis to rank the variables. Since this approach is derived from LDA and Quadratic Discriminant Analysis (QDA) concepts, the chosen variables will be biased towards LDA and QDA classification method. Therefore, this variable selection method will generally boost LDA and QDA performance. However, it is not uncommon for combination of Fisher criterion-classifier other than LDA/QDA to give a good classification result that is even better than the combination of Fisher criterion-LDA/QDA. 2.1.2 Entropy method Entropy, as variable ranking method, is basically a part of the CART algorithm. Since it works in line with CART classifier, the best variable set chosen will provide enough information to CART to perform a good classification. Therefore, it is not surprising that, entropy is usually a useful method for improving CART performance. Like CART, in the first step of this algorithm, an entropy (Ebrahimi et al., 1999) value which signifies the randomness in the variables is calculated for every 11 variable. After that, the variables are ranked based on their entropy value. The greater the entropy value, the more potential a variable has as class separator. 2.1.3 Single variable ranking (SVR) SVR is an univariate approach derived from LDA and QDA. In SVR, a selected predictor variable (only one) is used to build an LDA model which is then tested to determine the classification accuracy. This LDA model building and testing is independently repeated for all the predictor variables so that the classification accuracy for each variable is obtained. The variables are then ranked based on these prediction accuracy values. The SVR approach provides a good measure of variable influence on classification in line with the principle of LDA classification. 2.1.4 Partial Correlation Coeficient Metric (PCCM) In PCCM method, the partial correlation coefficients of orders 0, 1 and 2 are calculated between different pairs of variables. The resulting multivariate associations (in the form of edges on a node in the association network) are then used as a basis for variable ranking (Raghuraj Rao and Lakshminarayanan, 2007a). PCCM as data pretreatment can potentially influence variable interaction based approaches such as VPMCD and Artificial Neural Network (ANN). After applying variable selection method, the training data is then ready to be processed by the chosen machine learning method to build a classification model. Some popular and effective machine learning methods are described next. 12 2.2 Machine Learning Methods Once the data set is ready for further analysis, the training data is subjected to a suitable supervised pattern recognition method to build a classification model. As discussed earlier, the test set data is kept aside during model building. 2.2.1 Artificial Neural Network (ANN) Artificial Neural Network (Razi and Athappilly, 2005; Berrueta et al., 2007) is a widely used black box machine learning method since it is insensitive to noise, has a high tolerance to data complexity and is able to handle the non-linearities in data set quite naturally. ANN comprises of an input layer representing input variable nodes, set of hidden layers with computational neurons and an output layer. The performance of neural network is sensitive to the number of hidden layers used while building the network. Higher number of hidden layers can lead to data overfitting while smaller number of hidden layers can affect prediction accuracy. In this study, we utilize back-propagation neural network in which the weight values (the coefficients of connectivities between nodes) are adjusted during training by propagating the error (difference between the network output and true diagnoses available in training dataset) backward through the network (Statnikov et al., 2005). This learning process will identify the matrix of weights that gives the best fit to training data (Berrueta et al., 2007). 2.2.2 TreeNet TreeNet (Freidman, 1999) applies a slow learning process leading to a network of several (possibly hundreds of) small trees (see Classification and Regression Trees description below). Each of the trees makes a little contribution towards the final model (Raj Kiran and Ravi, 2008). The trees usually have less 13 than 8 terminal nodes and the final model is similar in spirit to a long series expansion (such as a Fourier or Taylor series expansion) - a sum of factors that becomes progressively more accurate as the expansion continues. Therefore, more the number of trees used in building the network, a better fit to the data can be obtained. Since TreeNet is equipped with self-test ability, it is able to prevent overfitting. Some of TreeNet advantages are fast model generation, automatic selection of predictors, simple data pretreatment steps, easy handling of missing values, and robustness to partially accurate data. Technically, TreeNet is equipped with a cost tab which facilitates model building. The basic idea of cost tab is to assign larger cost for misclassification on one particular class than other classes. Hence the model built will give a good accuracy to that particular class. However, it will sacrifice the accuracy of other classes as a consequence. The cost tab is useful when dealing with medical data sets which need more accuracy on one class of patients (e.g. patients with certain disease) than others (e.g. healthy subjects). 2.2.3 Classification and Regression Trees (CART) CART (Breiman et al., 1983) is a supervised pattern recognition method which has been used to extract useful information from not only chemical process datasets (Saraiva and Stephanopoulos, 1992) but also medical record data sets (Kurt et al., 2008). The extracted information is then presented as classification rules in the form of a tree. For situations where the target variable is discrete or categorical (such as DOA level), classification trees are developed and if the target variable is continuous, regression trees are constructed (Deconinck et al., 2005). The existence of classification rules as its outcome gets CART categorized as a white box classifier. It is superior to other classifiers since the rules can be easily applied to classify a new sample to its corresponding class. Therefore, it is 14 not surprising that CART is widely used to generate rules for processes improvement based on historical plant data (Bevilacqua et al., 2003; Tittonell et al., 2008), safety management (Bevilacqua et al., 2008), product quality prediction (Rousu et al., 2003) or to detect cancer early based on medical record data (Spurgeon et al., 2006; Kojima et al., 2008). One of the other advantages of CART as a tree building algorithm is its ability to handle missing data and nonlinear relationships between input and output variables. Given a set of training data, CART will choose a variable which has the potential to be the best separator from feature matrix (X) by doing diversity measurement. There are 3 diversity measurements available in CART and each of them will generate their own tree which differs from one another (Kurt et al., 2008). The tree generated by Gini index tends to separate class with the largest population, followed by the class with next smaller population and so on to the class with the smallest population at the bottom of the tree. The other diversity measurement is entropy. In this method, the entropy value of each variable will be calculated and all variables are then ranked based on their entropy value from the highest to the lowest. The tree (with entropy diversity measure as the basis) is then built by using the variable with highest entropy value as the best separator, continued by using the second best separator and so on. The last method of diversity measurement is twoing method. This method tends to build a tree which is able to separate half of total classes available in the data from the other half at each step. Using the best variable, a rule is then constructed to separate one class from another. This condition will be the initial node for tree building and will be splitted further based on logical outcome of decision for the condition. This binary splitting process will recursively proceed from the top of the tree to the bottom of the tree 15 until the population of the terminal node is nearly homogenous. The tree built is now called as maximal tree which may suffer from overfitting especially in high dimensional datasets with multivariate interactions between variables. In order to overcome this problem, the tree must be pruned using some approach. Here, we employ minimal cost pruning method which will prune the branches in a manner that does not significantly affect the accuracy of prediction with the tree. To select the optimal pruned tree for classification of new samples, either cross-validation test, or validation with fresh data test can be utilized. Like TreeNet, CART is also equipped with cost tab to facilitate application handling where higher prediction accuracies are sought for some specific classes. 2.2.4 Linear/Quadratic Discriminant Analysis (LDA/QDA) Linear Discriminant Analysis (LDA) (Duda et al., 2000; Roggo et al., 2007) is the most common machine learning technique used for classification. LDA weighs all variables to identify separating planes between classes by maximizing the ratio of “between-class variance” and “within-class variance”. The main assumption used in LDA is that class conditionals follow Gaussian distribution (Wang et al., 2008). Since LDA is a linear classifier, LDA’s performance is generally very good for linearly separable datasets. However, the presence of overlapping samples belonging to different classes which cannot be separated linearly on a descriptor space, affects LDA’s performance. Another technique available for classification is Quadratic Discriminant Analysis (QDA). QDA (Duda et al., 2000; Roggo et al., 2007) is developed to handle situations wherein the classes are not linearly separable. As a non-linear classifier, QDA constructs a parabolic boundary that maximizes “between-class variance” and minimizes “within-class variance” in projected scores. The 16 assumption that class conditionals follow Gaussian distribution is still used in QDA. However, unlike LDA, it tolerates differences in covariance matrices for the various classes (Wang et al., 2008). LDA and QDA will generally exhibit a good performance in problems which have more number of samples than variables (Berrueta et al., 2007). 2.2.5 Variable Predictive Model based Class Discrimination (VPMCD) VPMCD, proposed recently by (Raghuraj Rao and Lakshminarayanan, 2007b), is a parametric supervised pattern recognition method. During the development of this classifier model, the main assumption used is that predictor variables are dependent on one another and each class exhibits a unique pattern of variable dependence. VPMCD belongs to the family of classifiers that uses mathematical equations to define classification boundary between classes. For each class, VPMCD develops a model for every variable as a function of the other variables. As a result, each class has a unique system characterization in terms of specific inter-variable interaction models which can be exploited further to classify new samples. 2.2.6 K-nearest neighbour (K-NN) K-nearest neighbour based classifier (Cover and Hart, 1967) makes use of Euclidean distance to classify a new object (Bagui et al., 2003; Statnikov et al., 2005). In the case involving strongly correlated variables, correlation based measures are used instead of Euclidean distance. The new object will be assigned in the class to which majority of K nearest objects to the new object belong. K is usually odd (K=3 is frequently preferred). Preprocessing data (variable scaling) is strongly encouraged to avoid the effect of different scales of the variables. 17 Compared to other classifiers, K-NN is mathematically simpler, free from statistical assumptions and its effectiveness is independent of the spatial distribution of classes. However, similar to LDA, the performance of K-NN will be poor if the samples for existing classes are not equally distributed (Berrueta et al., 2007). 2.2.7 Support Vector Machine (SVM) SVM (Vapnik, 1995) is one of the most powerful established classification algorithms in supervised pattern recognition literature. Its performance, in classification, is comparable and even superior to other existing classifiers. Since it is insensitive to dimensionality, its ability in handling a large scale classification problem (many variables and many samples) is acknowledged. Furey et al. (2000) and Guyon et al. (2002) have noted superior SVM performance in dealing with classification problems in biomedical area on data sets involving large number of variables and very little samples. In its basic form, SVM can only be applied to solve binary classification problems. It constructs a hyperplane that maximizes the width margin between the classes. A new sample will be assigned to the class based on the area it falls into (Statnikov et al., 2005). Since most of problems existing in the real world are made up of multiple category, the question of applying such a powerful algorithm for solving multiclass problems was considered by many researchers. Some algorithms have been developed over the last several years to enable SVM implementation on multicategory problems. Examples include: One versus Rest (OVR) and One versus one (OVO). These approaches are detailed below. Explained in detail by (Kressel, 1999), One versus Rest (OVR) is the simplest algorithm proposed for multiclass SVM. In this algorithm, one k-class 18 problem is broken into k binary-class problems. The classification is then done by constructing a separation between class 1 and the others, class 2 and the others and so on until class k and the other classes. The sample will be assigned to the class with the furthest hyperplane. The disadvantages of this approach are that it is computationally expensive and has no theoretical justification (Statnikov et al., 2005). In the one versus one (OVO) approach, one separation plane which maximizes the margin between two classes is built for every pair of classes. Therefore, for the k-class problem, [k*(k-1)/2] planes need to be constructed. A new sample will subjected to all [k*(k-1)/2] classifiers which results in [k*(k-1)/2] label predictions. The sample is classified to the class which has the largest number of votes (Statnikov et al., 2005). After the model is created, some tests are applied to check the accuracy and robustness of the classifier. This stage is called validation step and is explained below. 2.3 Model Validation The final model obtained from the model building step is then applied to test dataset. The results of this test provide a realistic estimate of the classifier performance in predicting the class to which a new sample belongs to. It is a valid metric to decide which classifier is suitable to solve the problem at hand. It is important to know that the performance of classifier is highly dependent on the data set. For one dataset, method A may turn out to be the best but for another data set, method B may work better than method A. 19 As stated above, once the classifier model is developed using any of the techniques described in section 2.2, the validity of the model is gauged using test data. Two different classifier testing methods are usually used to compare the performances of different techniques. 2.3.1 Resubstitution test Resubstitution test can provide a measure of self consistency of the model. In this case, all data are used to build a model. After the model is built, it is tested on the same dataset that was used for model building. Most of the classifiers will indicate a very good performance when subjected to the resubstitution test. However, it is not a good testing criterion as it does not provide any indication of the generalizing capability of the classifier. 2.3.2 N-fold Cross-validation In N-fold cross-validation test, the dataset is randomly divided into N sets of data. The classification model is then built by using (N-1) sets of data and tested on the 1 set of data that was excluded during model building. This data division-model building-test procedure is repeated N times and usually the mean accuracy and standard deviation of accuracy are reported as the outcome of this N-fold test. Nfold cross-validation is usually used to choose the optimum classification model in some classification methods. The model obtained from this test is usually robust enough to be applied to new samples because it has considered data randomness during the modeling step. 2.3.3 Independent Test An independent test is done as the final step of the classifier building effort. After the final model is obtained based on training data, it is tested on a fresh test 20 set. This, in most cases, would be a portion of the original dataset which was excluded during model building. This type of validation justifies the stability of the algorithm in that the effect of new data points on the performance of the classifier is considered (Duda et al., 2000). 2.3.4 Leave one out cross-validation (LOOCV) test Basically, LOOCV algorithm is similar to cross-validation test. In LOOCV, 1 sample is taken out from the dataset for testing. The classifier model is built by using the remaining (N-1) samples and the model is then tested on the 1 excluded sample. This algorithm is applied repeatedly so that every single sample becomes a test sample. The average accuracy is calculated as the outcome of LOOCV test and it represents the overall performance of the classifier. The performances on the selected data sets are compared based on the percentage of correct classification, both for individual classes and for all classes put together (overall classification accuracy). Overall, chapter 2 thoroughly discusses data analysis algorithm, summarizes some data pretreatment techniques, and elaborates commonly used variable selection methods, classification algorithms and model validation methods. 21 Chapter 3 Partial Correlation Metric Based Classifier for Food Product Characterization Food is the moral right of all who are born into this world Norman Borlaug (1914) American Scientist 3.1 INTRODUCTION Identification and classification of products into different categories is an important and a significant problem in food industries. General applications like spoiling yeast growth modeling (Evans et al., 2004), data analysis in food applications (Berrueta et al., 2007), HACCP implementation in food industries (Bertolini et al., 2007) and food authentication (Toher et al., 2007) have benefited from discriminant analysis research. The classification problems are characterized by special challenges such as multivariate feature space, presence of different types of attributes (binary, discrete and continuous) and multiple-class datasets. Many methods have been attempted to address these issues (Tominaga, 1999; Berrueta et al., 2007). The main objective of these supervised algorithms is to learn the relationship between the measurable variables (observed based on physico-chemical attributes) and different pre-defined product characteristics of the system (classes based on quality indicators). These relationships, in the form of mathematical models, set of rules or statistical distributions are then used to predict the class of the new set of measurements made on the same system. The performance efficiency of any classification method depends largely on the type of dataset. Sample classes that can be linearly separated (Tominaga, 1999) 22 on a descriptor space can be effectively classified using Linear Discriminant Analysis (LDA). Suitable linear decision boundaries can be designed to distinctly group the samples on either side of the boundary. In complex multivariate datasets, characteristic of many chemometrics applications, the class data points show overlapping clusters when projected on a lower dimensional space. During training, suitable straight lines or hyper-planes cannot be designed to effectively distinguish the observations belonging to different classes. Methods built in orthogonal feature space (linearly independent variables) fail to capture the inter-variable dependencies leading to specific class structure and hence linear hyper-plane classifiers, like LDA cannot always separate groups distinctly. Model-based statistical methods like discriminant partial least squares (DPLS) (Tominaga, 1999; Chiang and Braatz, 2003), decision rule based classification trees, advanced machine learning techniques like Artificial Neural Networks (ANN) (Razi and Athappilly, 2005) and Support Vector Machines (SVM) (Vapnik, 1995; Granitto et al., 2007) have been successfully employed for non-linear classification problems. The discriminating ability of these classifiers depend either on variations in variables across different classes (LDA/SVM/decision tree) or on the extent of associations between different features and output variables (ANN/DPLS). For effective classification of linearly inseparable, multivariate data, these two factors measured in terms of class to class dissimilarities and intra-class associations between variables need to be utilized simultaneously. The new Partial Correlation Coefficient Metric (PCCM) based classification technique, used in this chapter, attempts this balanced approach of data classification. The basic idea adopted is to model the possible inter-variable 23 relations (in the form of inference metric) for each class in the training data based on the higher order partial correlations between them. These metrics, defined for each class in the training set, model the intra-class attribute relations for individual classes. The sample to be tested is then embedded into each class model and new inter-variable correlations structure is measured. The proximity of the new variable interaction structure to the individual class models is used as classification criteria. The PCCM methodology and the new classification approach are studied here with respect to classification of food products and quality characterization. 3.2 METHODS 3.2.1 Concept of partial correlation coefficients The Pearson correlation coefficient (r) defines the linear association between continuous random variables and has been widely employed in literature (Sokal and Rohlf, 1995; Timm, 2002), for many variable interaction mapping problems. However, the correlation coefficient alone cannot distinguish direct and indirect relationships between variables. Consider, for example, two variables A and B. The association between A and B can occur in different ways such as direct relationship A B, both co-regulated by a third variable C (i.e. C or indirect relationship A C B and C A) B. The regular correlation coefficient r defined on the two variables A and B does not differentiate between these types of relations and marks A and B as being related or not related. The partial correlation coefficient brings out this difference separating the indirect relations or path relations. The correlation between two variables is said to be conditioned on the third or a specific set of other variables when the effects of those variables are filtered from A and B before calculating the coefficient. Hence, 24 partial correlation rAB/C highlights the existence of correlation between A and B if the effect of the conditioned variable C is deleted. The order of the partial correlation coefficient is zero if the correlation is directly defined between A and B without conditioning on any variable. The order is x when the correlation is calculated after conditioning on x number of different variables other than A and B (Sokal and Rohlf, 1995). Eqs. (1) through (3) give the general definition for the first three orders of partial correlations. zeroth-order correlation: rAB = cov ( A, B ) var ( A) var ( B) (1) first-order partial correlation: rAB / Z = [rAB − (rAZ rBZ )] (1 − r )(1 − r ) 2 AZ 2 BZ (2) second-order partial correlation: rAB / XZ = [rAB / X − (rAZ / X rBZ / X )] (1 − r )(1 − r ) 2 2 AZ / X BZ / X (3) The correlation measure rAB and partial correlation measures rAB/Z and rAB/XZ exhibit symmetric property (i.e. rAB = rBA, rAB/Z = rBA/Z and so on) and these coefficients are bounded between values -1 and 1 (Sokal and Rohlf, 1995). Hence, instead of evaluating a full correlation matrix (with redundant entries) the intervariable association structure can be represented as a single array of unique values of correlation coefficients representing a definite order of variable combinations. 25 Such a vector of coefficients is referred here as the Partial Correlation Coefficient Metric (PCCM). This PCCM vector stores a definite pattern and strengths of intervariable associations for a given system. In a system with p variables, 0th order PCCM will have [p*(p-1)/2] elements, 1st order PCCM will have [p*(p-1)*(p-2)/2] elements and 2nd order PCCM will have [p*(p-1)*(p-2)*(p-3)/4] elements in the vector. Partial correlation coefficient has been used in literature to infer direct and indirect associations between random measurements (Eisen et al., 1998; Steuer et al., 2003; Baba et al., 2004; de la Fuente et al., 2004). Most recently, Raghuraj and Lakshminarayanan utilized partial correlation structure to select a set of important features for classification (Raghuraj Rao and Lakshminarayanan, 2007a) and multivariate calibration (Raghuraj Rao and Lakshminarayanan, 2007c) applications. The focus in the present study is to adopt the concept of partial correlation metric as a discriminating model for sample classification applications. To our understanding this approach is the first of its kind in data classification, especially for chemometrics applications in food technology. The general statement of the classification problem can be formulated as follows. Consider a system N [n x p; k] in which n observations belonging to k different classes of the system are obtained by measuring p variables. The objective of the discriminant analysis is to develop a classifier (using the observations in N) by modeling each of the k classes. The adequacy of the classifier is then tested based on its ability to predict the classes of samples in N (self-consistency or resubstitution test) and to predict the classes of new set of samples Ntest [m x p; k], which were not used during modeling (independent sample test). The methodology 26 adopted to achieve this objective using PCCM based discriminant analysis, is explained in the next section. 3.2.2 Discriminating Partial Correlation Coefficient Metric (DPCCM) The underlying principle of DPCCM method is to build distinct variable interaction structure for each class using training data (N). These individual class models are represented by a characteristic vector of calculated partial correlation coefficients between an identified sequence of all the variable pairs. The intervariable correlation coefficient vector (Ri , i = 1, 2, 3, …, k) of each class are stored in a single model structure in the form of DPCCM, Mmodel [k x d], where d is the number of partial correlations defined between pairs of variables by conditioning on other variables. For example, d = p*(p-1)/2 for 0th order and d = p*(p-1)*(p-2)/2 for 1st order partial correlations between variables. Mmodel [k x d] represents the learnt classifier model for the entire system N, which can be then used to predict the class of a new observation given the values of its p measurements. It must be highlighted here that the basic assumption made during this training step is that all the samples belonging to specific class in the dataset N consistently represent the characteristics of that class and the group samples do not contain any outliers. In case of applications where the training data samples are inconsistent within each class, a suitable outlier detection step can be employed as precautionary preprocessing step before building the metric, Mmodel. When a new observation from the sample matrix Ntest is to be classified, it is appended as an additional row into the model data (i.e. in N) for each class and the above procedure is repeated using the expanded dataset to obtain a new correlation structure, R for that class (using the same order of partial correlations as used during modeling). This is repeated by embedding sample observation into the data set for each class to obtain sample 27 DPCCM, Msample [k x d]. Each row of Msample represents the vector Ri, (i = 1, 2, 3, …, k), each computed after embedding the sample observation in respective class data. Each row in Msample is then compared for its similarity with corresponding row in Mmodel, using the standard Pearson’s correlation between the two vectors. Since the inter-variable association structures are captured in terms of scale free qualitative measures of correlation coefficients, we again utilize the correlation coefficient similarity index instead of any scale based measure (like Euclidian distance). The sample observation is classified into class i (i = 1, 2, 3, …, k), if the correlation between row ‘i’ of Msample and row ‘i’ of Mmodel is maximum. Since the PCCM algorithm captures all the inter-variable relations, it is conjectured that the final DPCCM Mmodel obtained on the training data represents a variable interaction discriminatory model to be used for sample testing. The DPCCM classification analysis for new samples is built on the hypothesis that if the sample is embedded with the right class while rebuilding the DPCCM for sample analysis, the rows of Msample will not differ significantly as compared to Mmodel. In other words, if the inter-variable correlations are distinct for each class, then a test sample belonging to a particular class will be an outlier for other classes and hence will break the correlation structure for those classes, while retaining the original structure for the class it belongs to. Since the class specific variable association structure (PCCM) is designed using correlation between all possible pairs of variables, the effect of outlier increases with increase in the number of system variables, p while testing for new samples. Higher order PCCM, if used with threshold values, identify and eliminate the indirect relations. This enhances the accuracies when applied to network inference problem at the expense of computational effort (de la Fuente et al., 2004). 28 One can start from zeroth order PCCM and gradually improve the network using higher order PCCM. However, DPCCM uses the full PCCM without eliminating the entries based on statistical significance of the correlations. The premise is, even the less significant correlations are necessary components of inter-variable association structure and can be useful distinguishing factors during the sample prediction step. The new sample observation belonging to a particular class must have both, the strong and the weak correlations between variables consistently appearing in the corresponding row of Msample. If the insignificant variable correlations in Mmodel become significant in Msample, it will contribute further to the discriminating ability of the model and hence will improve the classifier performance. In the present analysis, the algorithm uses different order for DPCCM to map the attributes. The order which gives the best discriminating results (during re-substitution test) is utilized as Mmodel for that particular application. This is attributed to the fact that, for applications where variables are not strongly correlated, higher order DPCCM may not affect the results positively. On the other hand, for applications where the variables are highly interdependent, increase in the order of DPCCM will improve the classification results. The following section gives a step by step algorithm for DPCCM classification analysis. 3.2.3 DPCCM Algorithm DPCCM training: Step 0: Read training data matrix N [n x p; k]. Pre-process to detect and remove the outlier samples from each group. Select the order (0, 1, or 2) for calculating PCCM. 29 Step 1: Split the matrix N [n x p] into Gi (i = 1, 2, …, k) separate group matrices with orders l1 x p, l2 x p, …,lk x p respectively, where li is number of observations for the ith class. Step 2: For each group matrix Gi calculate all possible sets of partial correlation coefficients using Eqs. (1), (2) or (3) depending on order selected in Step 0. Store the correlation coefficient arrays Rj, j = 1, 2, …, d as the rows of DPCC Metric. Mmodel is thus a k x d matrix with the ith row comprising the partial correlation coefficients (of selected order 0, 1 or 2) for class i. Re-substitution test for optimizing the order: Initiate Ntest = N Step 3: Select the test dataset, Ntest [m x p; k] for sample prediction. Select a test sample reading Y [1 x p] and augment the row in each of the group matrices Gi starting with first group. With Y embedded in each group matrix, repeat step 2 to obtain new rows in DPCC Metric, Msample Step 4: Calculate the correlation coefficient between corresponding rows of Mmodel and Msample Step 5: Determine the row ‘i’ (i = 1, 2, …, k) for which the correlation is highest and classify Y as belonging to that class. Repeat steps 3 to 5 for all test samples in Ntest. Step 6: Calculate the percentage of samples in Ntest that are correctly predicted. Repeat steps 1 to 6 using PCCM order 0, 1 and 2. Optimize the DPCCM order based on the highest accuracy of prediction. DPCCM sample testing: Read test set to be predicted, Ntest Step 7: Select Mmodel for the order optimized in step 6. Repeat steps 3 to 5 with given test set as Ntest and predict the classes for each sample. 30 3.2.4 DPCCM illustration with Iris data The concept of inter-variable correlations metric and DPCCM algorithm are illustrated with a well studied dataset on Iris flower classification. This, flower taxonomy dataset originally studied by (Fisher, 1936) is available at (http://www.ics.uci.edu/~mlearn/databases/). The dataset consists of 150 Iris flower samples (n =150) belonging to three different groups (k = 3 ; labeled Setosa, Virginica and Versicolor) with four measurements on each flower (p = 4 ; Sepal Length - SL, Width - SW, Petal Length –PL and Width - PW). For the present analysis, one sample belonging to Setosa group is separated for testing (Ntest [1 x 4 ; Setosa]) and the remaining 149 samples are used as training set N [149 x 4 ; 3]. Figure 3.1, brings out the concept of class specific inter-variable correlation structures and working principle of DPCCM method. We select 0th order PCCM measure for comparing different groups. The samples (in N) belonging to each class are separated and correlations are defined between each pair of variables (as shown in x-axis of Fig. 3.1) using Eq. (1). Rows of the PCCM metric, Mmodel (shown using solid lines in Fig. 3.1), represent the six inter-variable correlations for a particular group of flowers (shown with different markers for each group). As observed, each group of flowers shows distinct PCCM profile. SL and SW are correlated better in Setosa group compared to others, whereas SL - PL are highly correlated in Virginica and Versicolor flowers. Correlation between SW-PL and SW-PW bring better separation between the three groups. Overall, it is evident that 0th order PCCM measure can capture the unique inter-variable patterns in each group and hence can be utilized to distinguish samples belonging to different groups. The same set of correlations is re-calculated for all the three groups, by inserting test sample Ntest into respective group data in N. The correlation profiles for the new 31 sets with embedded test sample represent the rows of Msample (shown as dashed lines in Fig. 3.1). 1 M model-SETOSA 0.9 M model-VERGINICA M model-VERSICOLOR 0.8 M sample-SETOSA 0th. order partial correlation M sample-VERGINICA 0.7 M sample-VERSICOLOR 0.6 0.5 0.4 0.3 0.2 0.1 0 SL_SW SL_PL SL_PW SW_PL SW_PW PL_PW variable pairs Fig. 3.1 PCCM profiles for IRIS data. Rows of Mmodel (solid lines) and Msample (dotted lines) for each group (differentiated with different markers) of flowers are plotted for comparison. Correlation metric profiles in Mmodel and Msample are similar for SETOSA flower group, indicating the class of the selected test sample flower. Correlation metric breaks when test sample is embedded into other two groups due to class mismatch. The PCCM profile in Msample corresponding to ‘Setosa’ (dash line with ‘O’ markers) is very similar to the PCCM profile in Mmodel for ‘Setosa’ (solid line with ‘O’ markers). On the contrary, the PCCM profiles for other two groups in Mmodel, differ significantly from the respective profiles in Msample. The correlation 32 between corresponding rows of Mmodel and Msample are computed to be 0.9997, 0.7720 and 0.5570 for ‘Setosa’, ‘Virginica’ and ‘Versicolor’ groups respectively. Based on this PCC metric similarity score, DPCCM classifies sample in Ntest as ‘Setosa’ type flower. It must be also observed that a single sample when included during PCCM calculation with other group, disturbs the inter-variable correlations significantly even if there are 50 other homogenous samples in that group. For example, SW-PL and SW-PW correlations are higher in Mmodel, but show lower correlation values (in Msample) when non-homogenous sample is embedded. It is also interesting to observe lower correlations between PL-PW in Mmodel have shown higher correlations in Msample, establishing the importance of retaining all correlations in differentiating the groups. We presume, this variation in PCCM between Mmodel and Msample profiles is mainly due to the sensitivity of correlation measure to an outlier. This difference should be more prevalent for higher variable dimension data, as we define more inter-variable correlations. We also tested the effect of partial correlation order on the distinct PCCM patterns. With the same set of N and Ntest data, the 1st order PCCM profiles in Mmodel and Msample are correlated as 1.0000, 0.7804, and 0.5862 for each group respectively. Similar analysis with 2nd order PCCM gives group wise correlations as 1.0000, 0.9477 and 0.7768. Comparing the inter-group differences in these Mmodel-Msample similarity scores for each PCCM order, we can conclude that 0th order inter-variable correlations provide highest distinction between groups for Iris data. With these encouraging observations, we further explore the extension of DPCCM classification method to different chemometrics problems and compare its classification performance with other established classifiers. 33 3.2.5 Other classifiers used for comparison The DPCCM technique is applied to two case studies and the results are compared with that from established classification algorithms like LDA, CART, Treenet and SVM. These methods are discussed in detail in Chapter 2. However, briefly description of each method is given below to help the reader with ready information. LDA (Duda et al., 2000) is the most commonly used linear classifier developed by Fisher (1936). The LDA classifier provides a linear boundary separating the two classes. The classification function for this boundary is designed by maximizing the ratio of inter-group variance to the intra-group variance of projected scores. It has advantages such as being quick and accurate for linearly separable classes but performs poorly for data with overlapping class profiles. CART (Breiman et al., 1983) is a decision tree based classifier which is also called as binary recursive partition method. Classification tree is built by splitting the data into two branches using the best attribute or variable as separator variable (node). The best attribute used to define a decision rule on a node is prioritized based on one of the impurity measures such as Gini index, entropy or using ‘twoing’ method (Kurt et al., 2008). The split node is called as “parent node” and the resulting nodes as “child nodes”. The splitting process is continued from top to bottom and tree construction stops at the ‘terminal nodes’ which contain data samples with nearly homogenous class. The advantages of CART algorithm include: (i) easily interpretable and implementable rules (ii) needs very little data pretreatment, (iii) ability to handle both numerical and categorical data and (iv) ability to handle missing data. On the contrary, it can overfit a classifier model for 34 training data, especially for high dimensional datasets with multivariate interactions between variables. Treenet (Freidman, 1999) is a network of several hundred small decision trees with each of them having a small contribution in building the overall model (Raj Kiran and Ravi, 2008). Each minimal tree usually has less than 8 terminal nodes. Apart from having the advantages of CART algorithm, Treenet approach provides a more generalizable classifier model. Treenet approach has been successfully used mainly in financial data analysis and recently for soil characterization (Brown, 2007). SVM (Vapnik, 1995) is established as the most advanced and robust classifier for many applications ranging from character recognition to cancer diagnosis. It provides an effective tool to distinguish non-linearly separated classes (overlapping or embedded classes). It projects the original feature vectors onto a new, linearly separable vector space using variable transformation functions called ‘kernels’. Since it finally uses only the support vector features in the projected space the SVM model is almost independent of the number of attributes in the original data. Hence its performance is easily scalable, giving it immediate advantage compared to other methods especially for complex classification problems (p >> n). On the contrary, it suffers from computational effort for datasets with a large number of samples in N. Such cases require increased number of support vectors for classification further complicating the rigorous optimization algorithms employed during model building. As it is basically a binary classifier, its extension to multi-class problem needs additional mathematical formulation. Compared to the above methods, the new DPCCM approach proposed in this chapter provides a new classifier which does not seek for decision boundaries, 35 analyzes the data in original variable space without having to employ any iterative optimization algorithms and is able to simultaneously attempt multi-class problem. Hence the new approach attempts to eliminate most of the limitations associated with existing methods as discussed above. The classification performance of the DPCCM method is established in comparison with the existing methods based on the validation tests explained below. 3.2.6 Validation methods Once the classifier model is developed using any of the techniques, the validity of the model is performed using test data. Detail information of validation techniques have been given in section 2.3. However, the brief information of validation methods employed in this chapter is given below to help the reader with ready access to the concept. Two different classifier testing methods are used to compare the performances of different techniques. The performances on the selected datasets are compared based on the percentage of correct classification, both for individual classes and for overall classification. The concepts and algorithm steps discussed in this chapter can be used for further investigations and evaluation using other performance measures available in literature (Baldi et al., 2000). 3.2.6.1 Re-Substitution Test All samples in the training dataset are re-substituted back into the model as validation samples. This test is commonly used to check the self-consistency of the classifier. However, it is a test that does not provide the right indication of the classifier’s ability in correctly classifying new data samples (i.e. those that are not used during training). 36 3.2.6.2 Random Sample Validation Test A fixed percentage of training samples is randomly selected and set aside (to serve as test samples) and the remaining data is used to design the classifier. Then the smaller pre-selected subset of test sample is used for classifier verification. The prediction accuracy is evaluated only on the sample test data. This “split-train-test” procedure is repeated several times and the average of accuracies in these runs is reported. This type of validation justifies the stability (Duda et al., 2000) of the algorithm in that the sense of effect of new data points on the performance of the classifier is considered. 3.3 MATERIAL 3.3.1 Datasets Though the algorithm explained in section 3.2 can be in general applied to any chemometric classification problem, we demonstrate its specific application to food quality monitoring. Two important food product characterization datasets are presented here as case studies to implement and analyze the performance of new classifier. Case study I: Wine classification data (WINE) Wine product quality recognition data (available at http://www.ics.uci.edu/~mlearn/databases/wine/) (Asuncion and Newman, 2007) provides a significant chemometrics classification problem to benchmark the new method. This problem is also statistically challenging as, in this dataset, the samples are not uniformly distributed among the different classes. Beltrán et al. (2006) used LDA, QDA, PNN and ANN to characterize similar dataset on Chilean wines with spectral measurements. The samples in the dataset are obtained from chemical 37 analysis of 178 wine samples, produced in the same region in Italy but derived from three different cultivators (3 class problem). The quantities of 13 constituents (features) found in each of the three types of wines are analytically measured as descriptors. De-noised and well-processed observational data is used for training the classifier model in order to classify the given unknown sample into one of the three classes of wines. 20% of the 178 samples selected randomly from original data, are set aside for cross validation. Thus, the system used for analysis is N ~ [n = 143 x p = 13; k = 3] and Ntest ~ [m = 35 x p = 13; k = 3]. Case study II: Cheese classification data (CHEESE) A food quality characterization dataset studied by Granitto et al. (2007) is used as the second experimental dataset. This dataset with multiple classes, higher number of attributes and fewer samples in each group is a challenging classification problem. It also tests the feasibility of using DPCCM approach to difficult chemometrics applications. The dataset consists of 60 samples from 6 classes of Nostrani cheese (10 samples each class). They are ‘‘Puzzone di Moena”, ‘‘Spressa delle Giudicarie”, ‘‘Vezzena”, ‘‘Nostrano del Primiero”, ‘‘Nostrano della Val di Non” and ‘‘Nostrano della Val di Sole”. There are 35 sensory attributes (based on physical, chemical and visual characteristics of cheese samples) measured for each sample. Thus, the system considered for classification is N ~ [n = 48 x p = 35; k = 6]. For cross validation analysis, 20% of the given data (60 samples) is separated and used as test data: Ntest ~ [m = 12 x p = 35; k = 6]. 38 3.3.2 Implementation The DPCCM algorithm discussed in section 3.2.3 was coded and executed in MATLAB (MATLAB, 2005). The order of PCCM to be used during DPCCM analysis is provided as input parameter. Built-in MATLAB functions are used for LDA and CART algorithms. A separate MATLAB code provided at http://asi.insarouen.fr/~arakotom/toolbox/index.html by Canu et al. (2005) was used for multiclass SVM analysis. Treenet classification result is obtained using TreeNet® software developed by Salford Systems (USA) (Freidman, 1999; Salford Systems, 2007a). Partial correlations of order 0, 1 and 2 are attempted to verify the efficiency of DPCCM. The order which gives best classification result (during re-substitution test) is selected for further analysis. No parameters were tuned for LDA except that ‘diagonal’ LDA was adopted whenever the datasets were non-positive definite. Cost criteria were adjusted during model building using CART and Treenet. The cost function with best re-substitution result was adopted for cross validation performance test. Simple RBF (Radial Basis Function) was used for SVM kernel with polynomial coefficient c and γ as tuning parameters during training. 3.4 RESULTS Results for the above case study problems are presented in Tables 3.1 and 3.2 respectively. Percentage correct predictions for individual classes are shown in the first few columns of the Table (with column labels as ‘class’ followed by class number). Overall classification results are indicated in the last column with the percentage of test samples that are correctly classified. For cross validation test, the results shown are average prediction accuracy over 100 experiments for each class 39 along with standard deviation for the overall prediction accuracy. DPCCM performances for selected order are indicated as DPCCM(order). Results shown for comparison methods are obtained using the datasets, N and Ntest, identical to that used for DPCCM during the two tests. Table 3.1 Classification result for case study I (WINE classification) Test type Re-substitution Method class 1 class 2 class 3 overall LDA 100 100 100 100 CART 96.61 97.18 97.92 97.19 Treenet 100 100 100 100 SVM 100 100 100 100 DPCCM(0) 91.52 100 97.92 96.63 DPCCM(1) 96.61 100 97.92 98.32 DPCCM(2) 100 100 100 100 LDA 100 97.07 99.44 98.65 ± 2.02 a CART 92 87.29 93.67 90.91 ± 4.93 Treenet 99.15 94.3 100 97.44 ± 0.67 SVM 99.23 98.00 95.11 97.65 ± 2.4 DPCCM(2) 94.55 100 100 98.23 ± 1.52 Cross validation a Overall accuracy is reported as average accuracy over 100 iterations ± standard deviation Table 3.1 indicates the comparative performance of DPCCM for WINE data. For re-substitution test, DPCCM has learnt the variable interactions and modeled the classes distinctly with 2nd order PCCM, predicting the samples completely. Improvement in performance with increase in order of partial 40 correlations indicates the presence of multivariate interactions and indirect relationships between the variables. Hence, second order partial correlation based classification, DPCCM(2), is used during cross-validation tests. Other classifiers also provide complete classification accuracy. Decision rules using conditions on numerical values of the variables can lead to classifier over-fitting as observed in the case of CART. CART has significantly poor cross validation result as compared to re-substitution test. The difference between the re-substitution test and cross validation test results are not significantly different for DPCCM indicating the stability of the new method. For this dataset with non-uniform class sample distribution, the DPCCM method has provided performance matching that of well established methods like SVM and Treenet. Table 3.2 Classification result for case study II (CHEESE classification) Test type Re-substitution Method No Pr Pu So Sp Ve overall LDA 100 100 100 100 100 100 100 CART 100 80 100 100 100 90 95 Treenet 100 100 100 100 100 100 100 SVM 100 100 100 100 100 100 100 DPCCM(0) 100 90 100 100 100 100 98.33 DPCCM(1) 100 100 100 100 100 100 100 DPCCM(2) 100 100 100 100 100 100 100 LDA 78.5 81.5 86 53 100 64.5 77.33 ± 10.33 a CART 77 57.5 53.5 31 98.5 44.5 61.67 ± 9.91 Treenet 87 66 73.5 34.5 94.5 49.5 67.50 ± 4.21 SVM 96 76 66 74 100 86 83.00 ± 10.83 DPCCM(1) 100 70 90 70 100 70 83.33 ± 7.85 Cross validation a Overall accuracy is reported as average accuracy over 100 iterations ± standard deviation 41 For CHEESE dataset, the classification results are outlined in Table 3.2. During re-substitution test, DPCCM performance improved with 1st and 2nd order partial correlation. This indicates multivariate dependencies between variables which characterize the heterogeneity between different classes of product. To keep the computational effort low, DPCCM(1) was used during cross-validation tests. DPCCM and SVM methods provide the least error during cross-validation test. All the classes are learnt and predicted quite accurately during the random sample testing. 12 samples randomly selected from original set are used as Ntest set during cross-validation runs and DPCCM on an average always predicts 10 of them correctly (~83% accuracy). The standard deviation for the method is also smaller compared to LDA, CART and SVM which establishes the robustness of the method. The new approach provides improvement over the original study carried out on cheese dataset (Granitto et al., 2007) using Random Forest (77.1±11.1) and DPLS (74.3±13) classification approaches. Methods like LDA and CART provide relatively poor performance for cross validation test indicating the inability of these methods to effectively discriminate overlapping classes. Another important advantage of this approach is that the variables are observed in their measured state and are not projected on the new space as in PCA, DPLS or SVM. Hence, it will be easier to achieve a straightforward investigation based on meaningful physico-chemical influence of variables on different quality of products. DPCCM approach provides a good visualization of intra-class variable associations and inter-class dissimilarities in correlation patterns based on original variables themselves. Fig. 3.2 shows variable correlation shade map for each group in CHEESE dataset. We can observe that each type of cheese sample is characterized by a pattern of variable correlations. For example, type 1 cheese (No) 42 has strong association between variables Aci, Ama and Pic (Granitto et al., 2007), whereas for type 2 (Pr) cheese, good correlation exists between sample variables Ar, Fru and Ade. These plots not only provide class specific important features but also indicate how distinct the classes are and possibility of class overlapping. Cheese type 1 (No) and type 5 (Sp) look similar in their association whereas type 2 (Pr) and type 3 (Pu) form similar variable interaction profiles. Such information can be effectively used in sensor selection to select important variables for quality analysis of particular type of product. Type 1 Type 2 Type 3 10 10 10 20 20 20 30 30 30 10 20 30 10 Type 4 20 30 10 Type 5 10 10 20 20 20 30 30 30 20 30 10 20 30 Type 6 10 10 20 30 10 20 30 Fig. 3.2 Variable correlation shade map for each class in CHEESE classification dataset. Each of the 35 measured variables (as columns) are correlated with all the other variables (as rows). The white shade implies full correlation (r = 1) and black color indicates no correlation (r = 0) and other gray shades in between. All the diagonals are white representing the self correlation for each variable. Each type of cheese sample shows distinct inter-variable association patterns. 43 It must be highlighted that DPCCM addresses the multiclass multivariate classification problem with one PCCM model for each class without seeking any decision boundary (unlike LDA), working only with the correlations between variables (independent of scale of the measurements) and without projecting the variables on new descriptor space (unlike binary SVM classifier). Another important factor in which DPCCM scores over other methods is its simplicity in implementation without having to tune many parameters (except selecting the optimum order of partial correlation based on three re-substitution runs). DPCCM does not employ rigorous optimization algorithms. Hence, if the system considered has distinct inter-variable correlation structure for different classes (which are more likely to occur in high dimensional, multivariate chemometrics applications) the DPCCM approach offers an efficient classification tool. It must be pointed out that for high dimensional data with higher order conditional dependencies between variables (for example characterization using spectral measurements), the computational time can increase significantly. In our observation on a desktop computer (with 2.4GHz CPU and 2 GB RAM), 0th order DPCCM is as fast as LDA for any application and higher order DPCCM can train and test samples within 20 seconds for systems with 100 variables. For classification problems with p > 100, one can implement DPCCM in conjunction with suitable variable selection algorithms (Raghuraj Rao and Lakshminarayanan, 2007a). The performance of the DPCCM classifier may also be affected if few classes in the system exhibit similar inter-variable associations or no correlations at all. This singular situation may not arise in chemometrics applications where different physical, chemical and visual measurements and unique association patterns between them are often the basis of specific characteristics of the system. 44 With further improvements like incorporating nonlinear correlation measures, selecting different order PCCM for different classes and incorporating significance of correlations during classifier development, DPCCM promises to be a powerful tool for solving complex classification problems. In this chapter, DPCCM performance is analyzed using the two classification case studies and is compared with well established classifiers. DPCCM performs better than linear classifiers and comparable to non-linear SVM classifiers. This new method can potentially eliminate some of limitations of existing methods and also provides good visualization for understanding the specific variable interactions contributing to the nature of each class. 45 Chapter 4 Analysis of Biomedical Data Be as smart as you can, but remember that it is always better to be wise than to be smart Alan Alda (1936) 4.1 INTRODUCTION Machine learning applications for medical purposes has received considerable attention (Magoulas and Prentza, 2001). Integration of machine learning techniques into medical environment has enhanced the accuracy and reliability of medical diagnosis resulting in improved patient care. This is mainly because many medical problems, especially those which are related to classification of samples into their corresponding class based on measurement of certain attributes, can be well handled by using machine learning techniques. Some of machine learning applications in medicine include early screening for gastric and oesophageal cancer (Liu et al., 1996), lung cancer cell identification (Zhou et al., 2002; Polat and Günes, 2008), classification of normal and restrictive respiratory conditions (Mahesh and Ramakrishnan, 2007), classification for personalized medicine with high dimensional data (Moon et al., 2007), breast cancer diagnosis (Sahan et al., 2007) and artery disease (Kurt et al., 2008). Here, other important areas of medical application, such as prediction of the depth of anesthesia, heart disease and breast cancer identification that can largely benefit by classification approaches (Linkens and Vefghi, 1997; Mahfouf, 2006; Sharma and Paliwal, 2008) are addressed. 46 Anesthesia is usually employed as one of surgical procedures to remove all sensations of pain. During the surgery, the dose and infusion rate of anesthetic drug has to be controlled to maintain depth of anesthesia (DOA) at a level that is safe for the patient as well as deep enough to remove the sensation of pain. Many studies have established a well-controlled anesthesia with PID or other advanced controllers (e.g. adaptive controllers) (Elkfafi et al., 1998; Jiann Shing et al., 1999). However, these controllers need a good estimate of the patient’s DOA level to decide on the right dosage of anesthetic drug to be administered. Therefore, the determination of the correct DOA level is a crucial factor in obtaining a well controlled anesthesia. In this study, DOA level is determined using classification techniques. Multiple patient data such as recorded patient’s auditory evoked potential (AEP) features and cardiovascular features as well as known DOA level (awake, Ok/light, Ok, and Ok/deep as determined by the anesthesiologist) available from published literature are used to build the classification models. Once constructed, the classifiers can be used to classify the DOA level reliably into the four classes based only on AEP or cardiovascular measurements (Nayak and Roy, 1998; Nunes et al., 2005). The classification analysis is separately carried out using two different DOA datasets, one using AEP features and the other using cardiovascular features which include heart rate (HR), systolic arterial pressure (SAP) and mean arterial pressure (MAP). These two independent datasets provide distinct patient samples to train and test the classifier models. They also facilitate the selection of important features in the data set for reducing the complexity of the classifiers and/or improving the accuracy of DOA classification. 47 The second case study considered here concerns breast cancer identification. According to US cancer statistic working group (2007), breast cancer is the most common cancer diagnosed in women and is the second leading cause of cancer death among women in US. However, this cancer has a high chance to be cured. Jerez-Aragonés et al. (2003) noted that 97% of breast cancer patients survive for five years if the cancer is early detected and treated. This fact highlights the importance of early detection of cancer followed by early treatment. Some past studies have shown that machine learning methods can play an important role in these efforts (Bagui et al., 2003; Hong and Cho, 2008; Liu and Huang, 2008). Using information provided by some measurable cell attributes or microarray data information from many normal cells and cancer cells, machine learning methods are able to build classification models. When a patient comes to the hospital for diagnosis, the doctor has only to extract some cells and process it with a microarray analyzer. The results obtained from microarray analysis are then processed by the classification model to determine the existence and severity of cancer in the patient. In medical data analysis, especially those related to illness identification, patient misclassification may have a fatal impact. For example, when people with cancer disease are classified as being healthy, they will receive no cancer treatment. This may then increase illness severity and may even lead to death. Therefore, to make this study reliable, classifier performance is compared based not only on overall accuracy, but also based on class-wise performance. Two available online datasets, namely the Wisconsin Diagnostic Breast Cancer dataset (WDBC) and Wisconsin Breast Cancer dataset (WBC), are used in the case studies for breast cancer classification/identification in this chapter. 48 The final case study that will be covered in this chapter is on heart disease identification. American Heart Association records show that heart disease has become the leading cause of death in the United States and indeed in most of the developed countries. Therefore, it will be of interest to check the capability of data analysis techniques in correctly classifying patients with heart disease. Such early detection (based on classification techniques) can help in initiating timely medical treatment and in reducing heart-related deaths. In this study, we process the data collected on some patient attributes using classification techniques to distinguish patients with heart disease from normal people. The results are then compared to obtain the most suitable classifier for heart disease identification. Even here, type 1 misclassification case, wherein a patient with heart disease is classified as “healthy”, has to be kept as low as possible. As a result, comparison of classifier performance is done based on both overall accuracy and class-wise performance. There are many classification techniques available in machine learning literature that can be attempted to solve all these problems. These techniques provide different advantages but also have data-specific limitations. The main objective of the present study is to find out the best classifier to predict DOA level and identify the existence of cancer and heart disease through a performance comparison of some popular classification methods. 4.2 METHODS 4.2.1 Classification Methods In this study, ANN, TreeNet, CART, LDA, and VPMCD or DPCCM were used to predict DOA level during surgery and cancer identification. Their 49 performance are then compared each other to decide the best classifier for each case. The detail information related to these techniques is thoroughly discussed in section 2.1 - section 2.5 and section 3.2.3. 4.2.2 Variable Selection Methods Just as in regression models, the classifier models can benefit from the selection of important variables. When the classifier is constructed based on a subset of original variables, it helps to reduce the complexity of the model and computational effort without compromising on classifier performance. Noninclusion of certain nuisance variables (characterized by high noise and without having any discriminating value) can even enhance the performance of the classifier (Flores et al., 2008). In this work, several variable selection methods are used to rank the predictor variables according to their importance in classifying the samples. Once this ranking is available, the final classifier is built using only the most important variables (here, we choose the top 50% of the variables after ranking them using different variable selection methods) and examine the improvement in classification accuracy without any parameter re-tuning. The first method uses the Fisher criteria (FC) to rank the variables. Fisher criterion is defined as the ratio of “between class” and “inter-class” variances (Wang et al., 2008). This criterion is maximized by LDA (Duda et al., 2000) to identify the best separation plane by weighting predictor variables. Fisher ranking method basically uses these weights to rank the variables. In order to check the independent effect of each variable on classification, we have also adopted single variable ranking (SVR) approach. In this univariate approach, a selected (single) predictor variable only is used to build a LDA model which is then tested to determine the classification accuracy. This LDA model building and testing is 50 independently repeated for all the predictor variables so that the classification accuracy for each variable is obtained. The variables are then ranked based on the prediction accuracy values. FC and SVR approaches provide good measures of variable influence on classification in line with the principle of LDA classification. To establish similar advantage for other classifiers working on different principles, we adopt two other variable selection methods. Entropy measure which is useful for CART and partial correlation based variable selection approach (PCCM) which can potentially influence variable interaction based approaches of VPMCD and ANN. For entropy method, variables are ranked based on their entropy measures (Ebrahimi et al., 1999) signifying the randomness in data for that variable. In PCCM method, the partial correlation coefficients of orders 0, 1 and 2 are calculated between different pairs of variables and the resulting multivariate associations (in the form of edges on a node in the association network) is used as a basis for variable ranking (Raghuraj Rao and Lakshminarayanan, 2007a). Though these specific techniques can potentially influence particular classifier performances, we analyze the performance of all variable selection methods with all the classifiers. This is mainly to achieve the objective of selecting the best combination of variable selection method and the classifier. 4.3 MATERIALS AND IMPLEMENTATION 4.3.1 Datasets 4.3.1.1 Anesthesia Dataset The problem of classifying and predicting DOA level (Mahfouf, 2006) can be attempted using either AEP features or cardiovascular features as predictors. The difference in the number of samples for each class makes classification difficult and 51 challenging for this dataset. The analysis is done separately using two different datasets obtained from Prof. Mahfouf (Nunes et al., 2005; Mahfouf, 2006). The datasets are collected in Royal Hallamshire Hospital in Sheffield, UK. The readers are directed to (Nunes et al., 2005; Mahfouf, 2006) for further information about the datasets. The first one uses 10 AEP features while the second one uses 3 cardiovascular features (heart rate (HR), systolic arterial pressure (SAP) and mean arterial pressure (MAP)). The classification problem involves four classes i.e. DOA levels (awake, Ok/light, Ok, and Ok/deep) and consists of 414 samples which correspond to the number of patients during the surgery. Thus the classification datasets considered are N1 ~ [n = 414 x p = 10; k = 4] for AEP features dataset and N2 ~ [n = 414 x p = 3; k = 4] for cardiovascular features dataset. 4.3.1.2 Wisconsin Breast Cancer (WBC) dataset WBC dataset collected by (Wolfberg and Mangasarian, 1990) is available online in a public domain database (Asuncion and Newman, 2007). By considering 9 cell attributes information such as mitoses, clump thickness and so on, cells are then classified into 2 classes (malignant cancer cells and benign cancer cells). There are a total of 699 samples in this dataset with 65.5% of them being benign cells and the remaining 34.5% are malignant cell samples. Some missing data occurred in 16 records of patients hence these 16 samples are excluded from the analysis. The size of this system is M ~ [683 samples x 9 predictors; 2 classes]. 4.3.1.3 Wisconsin Diagnostic Breast Cancer (WDBC) dataset WDBC dataset collected by (Wolfberg and Mangasarian, 1990) is available for public use in http://archive.ics.uci.edu. This is quite a big dataset compared to WBC with less number of samples. 30 real-valued attributes information from 569 52 samples (357 samples taken from benign cells and the remaining data taken from malignant cells) without any missing value are archieved in the website. The size of this system is P ~ [569 samples x 30 predictors; 2 classes]. 4.3.1.4 Heart Disease dataset Heart disease dataset relates to a 2 class problem. It consists of 13 attributes from 270 observations (150 patients not having heart disease and 120 patients having heart disease). This dataset is analyzed to classify patients with heart disease and without heart disease and the result is then used to predict the presence of heart disease in new patients. The total system size is O ~ [270 samples x 13 predictors; 2 classes]. The existence of 4 different types of attributes adds some challenges in analyzing this dataset. Class 1 shows the absence of heart disease and class 2 shows the presence of heart disease. No missing value exists in this dataset. 4.3.2 Implementation Since our classifications models need to be validated, data splitting process into training set and test set was done. The training set is used to build classifier models and all classifiers were built on the same training datasets. The test set is kept separately and only be used for (pure) validation of classifier performance. It is not used during modeling or parameter tuning. Prior to the start of analyzing anesthesia dataset, the cardiovascular features dataset was randomly divided into training set (2/3 of data) M ~ [n = 276 x p = 10; k = 4] and test set (1/3 of data), S ~ [n = 138 x p = 10; k = 4]. A similar 2/3-1/3 split was performed on the AEP features dataset. However, breast cancer (WBC and WDBC) and heart disease datasets are split differently. A training set which consists of 80% of total samples 53 is used to build the classification model and the other 20% is kept for validation purpose. 4.3.3 Model Development Every classification method has its own set of user-defined parameters. The performance of any method depends significantly on identifying suitable values for these tuning parameters. To this end, the training set is divided randomly into M1 ~ 80% of training set and M2 ~ 20% of training set. M1 is used to build a model (with a certain choice for parameters) followed by validation on M2. The data split, model building and validation are repeated 50 times for each classifier. This procedure is executed with different parameter values and the best parameters are chosen based on the optimization of specific criteria e.g. high classification accuracy. Such parameter tuning has been done for all classifiers used in this work. The mean (µ) and standard deviation (σ) of classification accuracies (over 50 iterations) are calculated for the best model for each method and the coefficient of variation (CoV) is calculated using the following equation. CoV = 100 x σ /µ (4.1) If the CoV value is less than 20%, the classifier with the tuned parameters is considered for further analysis. If not, the parameter tuning step is repeated for this classifier until stable model parameters are obtained. All steps in model development are done for cardiovascular parameters dataset, AEP features dataset, WBC dataset, WDBC dataset and heart disease dataset. 4.3.4 Validation Testing After the stable parameter values are obtained for each classifier, the final classifier is built using the entire training dataset (M) and the best parameter values. 54 The model is finally tested on the test dataset (S) to get the accuracy of the model. Since the number of samples for each class is not equal, overall accuracy is not the best metric to compare performance of the classifiers. Therefore, the analysis is expanded by doing class-wise comparison and calculating sensitivity and specificity. The formula used for calculating sensitivity and specificity for each specific class can be seen in Podgorelec et al. (2005). After the test samples in S are subjected to validation of the classifier model, the sensitivity and specificity percentages are calculated for all the classes and the average value is reported as the indication of classifier performance. Sensitivity shows the probability of correct classification when the negative case is absent. On the other hand, specificity shows the probability of correct classification when the negative case is present (Liu et al., 1996). 4.3.5 Variable Selection Variables are ranked using variable ranking methods discussed in section 2.1. Variable selection is only applied on both cardiovascular features dataset and AEP features dataset. After all the variables are ranked, the two and five most important variables are retained for each method for cardiovascular features dataset and AEP features dataset respectively. As a result, the size of the dataset is reduced to Mr ~ [n = 276 x p = 2; k = 4] (for cardiovascular features dataset) and [n = 276 x p = 5; k = 4] (for AEP features dataset) for training set. Sr ~ [n = 138 x p = 2; k = 4] (for cardiovascular features dataset) and [n = 138 x p = 5; k = 4] (for AEP features dataset) for test set. After performing variable selection, 6 different sets of Mr and Sr are collected (one for each variable selection method) because each of the methods has different variables as their respective top 5 variables. For 55 cardiovascular dataset, some variable selection methods give the same top 2 variables. The analysis is then continued by building a model using Mr and the best parameter is obtained at the model development step. The model is then tested on the respective Sr set. The model building and testing is done using 6 different sets of Mr and Sr for every classifier. Therefore, there are 24 combinations of dataset and classifier in this analysis (6 datasets x 4 classifiers). 4.3.6 Software CART and TreeNet classification is done using software developed by Salford Systems, USA (Salford Systems, 2007a; Salford System, 2007b). MATLAB’s (MATLAB, 2005) built-in function “classify” is used for LDA and QDA. A MATLAB implementation of the DPCCM, VPMCD algorithm (Raghuraj Rao and Lakshminarayanan, 2007b) and neural network algorithm are used for building the VPMCD, DPCCM and ANN classifiers. Variable selection methods used in the analysis (except entropy) were coded in MATLAB (MATLAB, 2005). CART software is used to rank the variable using entropy ranking method. The developed MATLAB codes can be made available to interested readers upon request. 4.4 RESULTS 4.4.1 Parameter Tuning The parameter tuning results are presented in Table 4.1 for DOA classification, Table 4.2 for breast cancer identification and Table 4.3 for heart disease identification. The tables show the settings of the best parameters obtained from the tuning procedure and also the coefficient of variation for each classifier 56 based on 50 cross validation tests. As can be seen in Tables 4.1 to 4.3, TreeNet and CART show their stability in the context of random data sampling. However, the number of parameters that have to be tuned in these classifiers is more than that in other classifiers since misclassification cost for each class in CART and TreeNet have to be optimized as well. As comparison, for DOA classification case, we have to tune 15 parameters for CART, 14 parameters for Treenet, 2 parameters for VPMCD, 1 parameter for ANN and none for Discriminant Analysis (DA). Therefore, it can be concluded that obtaining a good model using CART and TreeNet is a significantly time-consuming activity. Cost optimization (CO) is included in CART and TreeNet analysis (Table 4.1-Table 4.3) in order to reduce misclassification cases that can lead to undesired effect. For example, when DOA level 2 which should be ‘ok’ is misclassified as DOA level 4 (ok/deep), the controller may reduce the amount of anesthetic drug. As a result, patient’s DOA could drop to level 1 (awake state) and result in a condition that is harmful for the patient. The CO ensures some level of fault tolerance in the closed loop system that includes the anesthesiologist (or automatic controller), patient, measuring system, classifier and other hardware elements. 4.4.2 Test set Analysis 4.4.2.1 DOA classification The results of classifier testing on the test dataset are shown in Table 4.4 for the case where all the cardiovascular features are used as predictors and in Table 4.5 when all the AEP features are employed as predictors. The leftmost column shows the type of classifier used and the second column (titled “class”) shows the number of correctly classified sample for each class. The third column shows the 57 total number of samples which are correctly classified and the last column shows percent overall accuracy for each classifier. Table 4.1 Summary of parameter tuning result using validation dataset for anesthesia AEP features dataset Methods Best Parameters VPMCD Model type: linear + Coef. of Best Parameters Variation CoV CoV 16.9 % Model type: quadratic + interaction Number of independent Number of independent variables: 2 variables: 1 No tuned parameters QDA (diagquadratic) Network consists of 700 Coef. of Variation interaction LDA/ TreeNet Cardiovascular features dataset 15.47 % No tuned parameters 8.7 % 5.11 % (diaglinear) 1.2 % Network consists of 700 trees, and minimum number trees, and minimum of training observations in number of training terminal nodes = 5 observations in terminal Class weight: unit nodes = 5 Cost: Class weight: balanced 4 misclassified as 3 : 2 Cost: 0.3 % 3 misclassified as 2 : 1.3 4 misclassified as 3 : 1.5 CART Splitting method: entropy 4.18 % Splitting method: entropy Priors : learn Priors : equal Minimum cases in parent Minimum cases in parent node: 3 node: 3 Cost: Cost: 4 misclassified as 3 : 3 4 misclassified as 3 : 2 0.34 % 2 misclassified as 4 : 2 ANN Number of hidden layers: 3 5.1 % Number of hidden layers: 7.4 % 3 58 Table 4.2 Summary of parameter tuning result using validation dataset for breast cancer Methods WBC dataset Best Parameters VPMCD LDA/QDA Model type: Quadratic WDBC dataset Coef. of Variation CoV CoV 2.30 % Model type: Linear Number of independent Number of independent variables: 4 variables: 2 No tuned parameters Network consists of 700 Coef. of Variation 1.82 % (Linear) TreeNet Best Parameters No tuned parameters 4.54 % 2.03 % (Linear) 0.19 % Network consists of 700 trees, and minimum trees, and minimum number of training number of training observations in terminal observations in terminal nodes = 3 nodes = 3 Class weight: balanced Class weight: unit Cost: Cost: 2 misclassified as 1 : 2 3 misclassified as 2 : 1.3 0.18 % 4 misclassified as 3 : 1.5 CART DPCCM Splitting method: Gini 0.27 % Splitting method: entropy Priors : mix Priors : mix Minimum cases in parent Minimum cases in parent node: 3 node: 5 Cost: Cost: 2 misclassified as 1 : 2 1 misclassified as 2 : 5 Order: 1 2.47 % Order: 1 3.05 % 2.55 % 59 Table 4.3 Summary of parameter tuning result using validation dataset for heart disease Methods Heart disease dataset Best Parameters Coef. of Variation CoV TreeNet Network consists of 700 trees, and 0.79 % minimum number of training observations in terminal nodes = 3 Class weight: balanced Cost: 2 misclassified as 1 : 3 CART Splitting method: Entropy 1.66 % Priors : mix Minimum cases in parent node: 5 Cost: 2 misclassified as 1 : 3 Table 4.4 Classification result (correct classification) on test set using cardiovascular features as predictors Classifier Class# Total # % accuracy 1 2 3 4 VPMCD 0 7 12 36 55 39.86 LDA 2 7 16 60 85 61.59 TreeNet 0 10 9 53 72 52.17 CART 0 11 7 73 91 65.94 ANN 0 0 8 52 60 43.48 9 28 22 79 138 Total Samples # shows the number of samples that correctly classified 60 Table 4.5 Classification results (correct classification) on test set using AEP features as predictors Classifier Class# Total # accuracy (%) 1 2 3 4 VPMCD 0 4 21 34 59 42.75 QDA 0 1 14 82 97 70.29 TreeNet 0 2 22 38 62 44.93 CART 0 9 11 75 95 68.84 ANN 0 8 18 63 89 64.49 3 15 23 97 138 Total Samples # shows the number of samples that correctly classified As can be seen in Table 4.4, according to our study, for dataset which used cardiovascular features as predictors, CART gives the best overall accuracy by correctly classifying 91 samples out of 138 samples. CART, which does not consider interactions amongst predictors while constructing the classifier, has considerably better performance than the other classifiers. On this dataset, TreeNet gives lower accuracy than CART. VPMCD gives a very low accuracy for this dataset possibly because interaction between predictor variables might not be significant here. Similar to VPMCD, ANN classifier is based on modeling. Therefore, their accuracies are almost similar. It is interesting that VPMCD can predict class 2 and class 3 samples better than ANN, while ANN can make a better prediction on class 4 samples. This may happen with ANN because the number of class 4 samples is much more than for other classes. This reason is also supported by ANN’s poor ability in classifying class 2 samples. On the other hand, VPMCD which tries to capture all class profiles 61 while building the classifier models, can classify some of class 2 and 3 samples, but its performance in classifying class 4 samples is lower than other classifiers. LDA, which takes into account the weightage on variables while constructing the separating plane, provides the second best performance. Three of the classifiers presented here (LDA, CART and TreeNet) perform better than the results reported in Mahfouf (2006) and Nunes et al. (2005). The results in these earlier studies on the same datasets, reported an overall accuracy of 46.5% using a fuzzy relation classifier for the same training and test sets. LDA is seen to provide low overall accuracy compared to CART. This happens because LDA performs poorly compared to CART on class 4 samples and class 2 samples while it does better on classes 1 and 3. With the number of samples in class 4 being too high, the overall accuracy of LDA turns out to be lower than CART. All classifiers perform their best in classifying class 4 samples and poorly on class 1 samples. The results for classifiers built using AEP features are presented in Table 4.5. In Mahfouf (2006), it is reported that fuzzy relation classifier gives an overall accuracy of 61%. In the present analysis, CART and QDA (which gives higher accuracy than LDA) provide better prediction accuracies for the same training and test sets with AEP features. The results in Table 4.5 indicate a pattern where no classifier is able to correctly classify any of the class 1 samples. This may be because the number of class 1 samples is too small in the training set (10 samples out of 276). Small number of samples in training set will result in inadequate learning by any classifier. Therefore, it is difficult to model the class 1 profile and classify new samples correctly. 62 QDA shows its capability as the best predictor, in terms of overall accuracy, for DOA classification by classifying 97 new samples correctly to their corresponding class. For class 2 samples, QDA can only classify 1 out of 15 samples correctly. Thus, QDA cannot classify class 1 and class 2 as well as it is able to correctly classify classes 3 and 4. In this case, CART has slightly lower overall accuracy than QDA while its sensitivity and specificity is slightly higher (see second column of Table 4.6). VPMCD models the class 3 samples better than the remaining classifiers, even though its overall performance is lower. 21 of the 23 samples of class 3 (91%) are correctly identified by VPMCD while testing on AEP feature dataset. This is better than any other classifier performance for class 3 samples. Although ANN’s performance in classifying class 2 samples is as good as CART, its performance in class 3 and class 4 classifications is slightly lower than CART. For DOA classification using AEP features as predictors, TreeNet performance is significantly lower than CART. While TreeNet performs better or as good as CART on class 1 and 3, it does very poorly on class 2 and 4 samples as compared to CART. From Tables 4.4 and 4.5, it can be observed that almost all classifiers, excluding TreeNet, gives higher accuracy using AEP features dataset compared to cardiovascular dataset. This also highlights the fact that AEP features are better predictors in classifying depth of anesthesia than cardiovascular features. In a surgical setting, DOA level 1 and DOA level 4 are considered as the most crucial conditions which have to be classified correctly. In such situations, QDA is the recommended classifier if DOA classification is done using cardiovascular features because it is the only classifier which is able to classify class 1 samples correctly. In 63 addition, its performance in classifying class 4 samples is only slightly lower than CART (see Table 4.4). Also, if the emphasis of diagnosis is on achieving the highest class 4 accuracy, QDA with AEP features could be a suitable classifier choice. These observations highlight that there is no single classifier which has best performance satisfying different objectives of DOA decision making. Given a specific performance objective, the classifiers need to be tuned and chosen accordingly. This conclusion is further supported by the analysis using class sensitivity and specificity measures. Table 4.6 Sensitivity and specificity values for each classifier in DOA classification AEP Features Cardiovascular Classifier Sensitivity Specificity Sensitivity Specificity VPMCD 38.26 81.97 31.28 80.47 LDA/QDA 38.02 84.82 48.97 88.35 TreeNet 37.04 81.57 35.93 81.58 CART 46.29 84.85 40.88 85.23 ANN 49.14 88.29 25.55 79.02 All results are presented in percentage (%) Table 4.6 shows specificity and sensitivity value for all classifiers in both datasets (cardiovascular parameters and AEP features). As can be seen, QDA has the highest value of sensitivity and specificity for cardiovascular parameters dataset while ANN holds the highest value of sensitivity and specificity for AEP features dataset. The class specific performance of different classifiers is clearly evident from these results. The sensitivity and specificity value for VPMCD, TreeNet and QDA are quite similar for AEP features dataset while the overall accuracy of QDA is significantly higher than TreeNet and VPMCD. This is an important observation for present DOA classification problem as the selection of best classifier needs to be based on class specific objectives instead of overall classification accuracy. 64 4.4.2.2 Classification with WBC dataset Table 4.7 shows the classification result on WBC dataset with class 1 representing patients with benign cancer and class 2 representing patients with malignant cancer. As mentioned earlier, with biomedical data, it is very important to also consider class-wise accuracy in comparing classifier performance. For this case, class 2 accuracy is more important than class 1 accuracy because cancer patient needs medication and treatment as soon as possible. If cancer patients are wrongly classified as “healthy”, they will not receive any medication at least until the illness becomes quite obvious and serious when it may be too late to be cured. Therefore, classifier performance should be deemed better if it has higher prediction accuracy for class 2. As can be seen in Table 4.7, according to our study, TreeNet not only gives the best overall accuracy by correctly predicting 95.56% of total test samples but also has the ability to perfectly identify all cancer patients that exist in the test dataset. This fact confirms the superiority of TreeNet compared to other classifiers in breast cancer identification based on attributes that are available in WBC dataset. However, TreeNet’s performance in classifying class 1 is not as good as it is with class 2 samples. This is mainly because the cost set (see Chapter 2) during model construction makes TreeNet give different weighs to each decision tree existing in the model. In other words, it adjusts the parameters in TreeNet model in such a way that the resulting model is very good for classifying class 2 samples. As a consequence, the information retained in the model is insufficient to correctly classify class 1 samples. 65 Table 4.7 Analysis result for WBC dataset using LDA, CART, TreeNet, DPCCM and VPMCD LDA Resubstitution Testing VPMCD DPCCM CART TreeNet Class 1 98.20 99.77 95.05 99.55 98.65 Class 2 92.89 94.14 100 100 100 overall 96.34 97.80 96.78 99.71 99.12 Class 1 96.59 94.32 90.91 93.18 93.18 Class 2 87.23 93.62 100 97.87 100 overall 93.33 94.07 94.07 94.81 95.56 All results are presented in percentage (%) As shown in Table 4.7, CART gives the same performance as TreeNet in classifying healthy subjects into class 1. On the other hand, it fails to classify some of cancer patients correctly. Therefore its overall performance is slightly lesser than TreeNet. Interestingly, VPMCD and DPCCM give the same overall accuracy albeit with different class-wise performance. Similar to TreeNet, DPCCM is able to classify class 2 samples very well. Its accuracy for class 2 samples is much better than VPMCD. On the contrary, VPMCD has better class 1 performance than DPCCM. According to our analysis, LDA gives the lowest performance in both overall accuracy and class 2 accuracy. This might be due to the fact that the samples in those classes may not follow Gaussian distribution (Wang et al., 2008). In addition, the presence of class overlapping profile will be another disadvantage for LDA in building separation plane. Since the number of class 1 samples is larger than class 2 samples, LDA has the best performance in class 1 classification. 66 4.4.2.3 Classification with WDBC dataset Table 4.8 shows the classification results for the WDBC dataset with class 1 representing patients with malignant cancer and class 2 representing people without cancer. Similar to WBC case study, class 1 accuracy must receive more attention than class 2 accuracy since a cancer patient needs medication and treatment as soon as possible. As presented in Table 4.8, DPCCM holds the highest and perfect value for all overall, class 1 and class 2 accuracy. In other words, DPCCM is able to perfectly classify all samples into their corresponding class. This result puts the DPCCM proposed in Chapter 3 in better light - it may prove to be a good classifier not only in food applications but also in biomedical applications. Table 4.8 Analysis result for WDBC dataset using LDA, CART, TreeNet, DPCCM and VPMCD LDA Resubstitution Testing VPMCD DPCCM CART TreeNet class 1 92.45 82.08 96.23 100 100 class 2 99.44 96.36 96.36 99.16 100 overall 96.84 91.04 96.31 99.47 100 class 1 95.24 88.10 100.00 100 92.86 class 2 100 97.18 100.00 88.73 100 overall 98.23 93.81 100.00 92.92 97.35 All results are presented in percentage (%) Unlike in WBC dataset, LDA provides a good performance in classifying class 2 samples for this data set. This could be because the data points are linearly separable. A reasonably good performance by VPMCD (Table 4.8) using a linear model in classifying class 2 samples strongly indicates linear separability of the dataset. As observed in Table 4.8, LDA performance on class 2 classification is better than its performance on class 1 classification. During model building, LDA 67 set weights to all variables in such a manner that makes its performance on class 2 predictions perfect. This reduces its ability to predict class 1 samples accurately. The other classifier which gives a good performance in class 2 classification is TreeNet. The ensemble of decision tree confirms its superiority to single decision tree (CART) by giving better overall and class 2 accuracy. However, its performance on class 1 is still lower than CART. This may happen because, during model construction, TreeNet assigned weight factor on each variable so as to classify class 2 samples well. As a consequence, it fails to classify some class 1 samples correctly. CART, a classifier with the lowest overall accuracy, is able to perfectly classify class 1 samples. On the other hand, it has poor performance in predicting class 2 samples. Based on these results, it can be concluded that cancer identification using WDBC dataset is preferably done by using DPCCM which gives the highest random testing accuracy. 4.4.2.4 Heart Disease Identification Table 4.9 shows the classification result on heart disease dataset with class 1 representing the absence of heart disease (healthy) and class 2 representing the presence of heart disease. Since this analysis involves some categorical variables, classifiers which use mathematical equations in building the classification model (e.g. LDA, DPCCM, and VPMCD) cannot be used. Therefore, the classification is only done by CART and TreeNet. As can be seen in Table 4.9, according to our analysis, TreeNet gives much better overall accuracy and gives lower class 2 classification performances compared to CART. CART performance on class 1 classification is much poorer than TreeNet. As with the earlier biomedical case studies, class 2 accuracy must be afforded higher priority when comparing classifier performance. Thus, CART 68 seems to be a better classifier for heart disease identification since it can “recognize” a patient with heart disease better than TreeNet. Table 4.9 Classification result on heart disease dataset using CART and TreeNet Resubstitution Testing CART TreeNet class 1 91.33 92 class 2 100 85.83 overall 95.19 89.26 class 1 73.33 86.67 class 2 87.5 83.33 overall 79.63 85.19 All results are presented in percentage (%) 4.4.3 Variable Selection Variable selection method is only applied to AEP features and cardiovascular features dataset for DOA classification. The variables selected by the different methods for the AEP features dataset are shown in Table 4.10 and variables selected for cardiovascular parameters dataset are tabulated in Table 4.11. Different selection algorithms select different sets of variables as important even though they all start from the same dataset. This indicates the differences in the existing variable ranking methods and stresses the importance of selecting a specific technique for a given problem. After doing variable selection, the analysis is continued by building all classifiers based on each set of selected variables. The analysis results are tabulated in Table 4.12. In some cases, the classifiers are seen to have better performance when developed based on a subset of variables. In other cases, contrary results are observed. Poorer performance may occur when the variable subset selection method 69 is not compatible with the classifier. In these cases, the selected variables fail to give enough information to the classifier in order to make a good separation (McCabe, 1984). In addition, by decreasing the number of predictor variables, some information may be lost. Table 4.10 Variables selected from 10 AEP features using different selection methods Methods Ranking of single variable Variables selected 4, 3, 1, 2, 6 PCCM based ranking Order 0 2, 9, 5, 6, 7 Order 1 5, 2, 9, 1, 4 Order 2 5, 1, 2, 3, 4 Entropy 5, 9, 6, 8, 4 Fisher Criteria 4, 3, 1, 9, 5 Table 4.11 Variables selected from 3 variables in cardiovascular dataset using different selection methods Methods Ranking of single variable Variables selected SAP and MAP PCCM based ranking Order 0 HR and SAP Order 1 HR and SAP Order 2 HR and SAP Entropy Fisher Criteria HR and SAP SAP and MAP 70 The best result after variable selection is achieved by QDA (in AEP features dataset) that employs Single Variable Ranking (SVR) as the method for selecting variables. 102 test samples are correctly classified to their corresponding class out of 138 samples after selecting only the five best variables. This result also confirms the consistency of QDA as the best method for DOA classification, in terms of overall accuracy, using AEP features data. This is not surprising because single variable ranking method applies LDA concept to rank the variables. Therefore, the variables selected contain most of information needed for LDA and QDA classification. As a result, the single variable ranking gives the best classification result for QDA. It is observed that none of the variable selection methods improve the performance of CART. CART is a variable-based classifier which needs information contained in variables. By decreasing the number of variables involved in classification, CART probably has insufficient information to separate those classes. As a result, its performance is lower with variable subset selection. All variable selection methods are also applied to the cardiovascular dataset. The selected variables, tabulated in Table 4.11, are used to build the classifiers without retuning any of the parameters. The classifiers are then validated on test dataset and the results are presented in Table 4.13. For this dataset, only 3 classifiers benefit from the variable selection procedure. The accuracy of VPMCD, CART and ANN increase by 36.4%, 4.4% and 31.67% respectively compared to classification with all the variables. It is noteworthy that the combination of VPMCD and PCCM based variable selection significantly increased the classification accuracy for this dataset. 71 Table 4.12 Model accuracy using selected variables (AEP dataset) No SVR selection VPMCD QDA Treenet CART ANN PCCM PCCM PCCM (order=0) (order=1) (order=2) Entropy Fisher 59 44 79 53 51 96 58 (42.75 %) (31.88 %) (57.25 %) (38.4 %) (36.96 %) (69.57 %) (42.03 %) 97 102 28 99 100 27 101 (70.3 %) (73.9 %) (20.3 %) (71.7 %) (72.5 %) (19.6 %) (73.2 %) 62 65 81 70 68 68 63 (44.93 %) (47.1 %) (58.7 %) (50.72 %) (49.28 %) (49.28 %) (45.65 %) 95 85 91 65 70 91 92 (68.84 %) (61.59 %) (65.94 %) (47.1 %) (50.72 %) (65.94 %) (66.67 %) 89 90 97 89 89 97 89 (64.49 %) (65.22 %) (70.29 %) (64.49 %) (64.49 %) (70.29 %) (64.49 %) Table 4.13 Model accuracy using selected variables (cardiovascular dataset) No SVR selection VPMCD LDA Treenet CART ANN PCCM PCCM PCCM (order=0) (order=1) (order=2) Entropy Fisher 55 48 75 75 75 75 48 (39.86 %) (34.78 %) (54.35 %) (54.35 %) (54.35 %) (54.35 %) (34.78 %) 85 77 (61.59 %) (55.80 %) (58.70 %) (58.70 %) (58.70 %) 72 70 70 70 70 70 70 (52.17 %) (50.72 %) (50.72 %) (50.72 %) (50.72 %) (50.72 %) (50.72 %) 91 95 80 80 80 80 95 (65.94 %) (68.84 %) (57.97 %) (57.97 %) (57.97 %) (57.97 %) (68.84 %) 60 79 77 77 77 77 79 (43.48 %) (57.24 %) (55.79 %) (55.79 %) (55.79 %) (55.79 %) (57.24 %) 81 81 81 81 77 (58.70 %) (55.80 %) Our experience with the DOA dataset emphasizes the necessity of employing a case specific classifier and a suitable preprocessing technique. The 72 performances of the classifiers are definitely data specific and no single method should be overemphasized. It is also important that different methods be tried and sound procedures be employed to determine the best user-defined parameters for the methods and also the best subset of variables. In this study, DOA classification is performed using CART, TreeNet, VPMCD, ANN and LDA/QDA. The comparison study is performed with the objective of determining the best classifier i.e. the capability to correctly classify new samples into their corresponding classes. According to our analysis, in terms of overall accuracy, CART and QDA are observed to be the best classifier models for DOA classification using cardiovascular features and AEP features respectively. Even when classifiers are built using a subset of features, the superiority of CART and QDA in DOA classification using cardiovascular dataset and AEP features respectively is confirmed. The utility of DPCCM and other advanced machine learning tools like CART and TreeNet in handling data from medical domain and extracting information from them is checked by applying DPCCM to WBC and WDBC. DPCCM as well as TreeNet not only give the best overall accuracy on the test data set but are also able to classify all cancerous cells perfectly to their respective classes in the WBC dataset. This indicates the promising performance of DPCCM for medical applications. In addition, DPCCM appears to be the most suitable classifier on WDBC case study since it can perfectly classify all test samples to their corresponding class. This study confirms the ability of DPCCM as a strong classifier since its performance is not only good for food product datasets but is also good for biomedical datasets. 73 In this chapter, the performance of the classifiers is also examined using heart disease data sets. The existence of categorical data in this dataset precludes some classifiers because of their inability to handle categorical data. Therefore, this study was conducted using only TreeNet and CART. Based on our results on heart disease classification, CART is the recommended classifier for heart disease identification since patients with heart disease must be identified correctly for medical treatment. On the other hand, if the objective is to identify healthy patients, TreeNet can be applied to the dataset. 74 Chapter 5 Empirical Modeling of Diabetic Patient Data All models are wrong, some models are useful George Box 5.1 INTRODUCTION One major component of critical care in Intensive Care Unit (ICU) is the regulation of blood glucose in patients. Patients in ICU experience psychological trauma and extreme stress. The challenge is to achieve tight glycaemic control and avoid abnormal conditions such as hyperglycemia and hypoglycemia. Hyperglycemia is commonly observed in critically ill patients regardless of their past medical history. The effect of hyperglycemia on death rate in ICU patients was first observed in the surgical ICU of Leuven University Hospital (Van den Berghe, 2003). Van den Berghe (2003) showed that tight glucose control can reduce ICU patients’ mortality rate up to 45%. Based on this study, The American Association of Clinical Endocrinologists recommended 80 and 110 mg/dl as the lower and upper limit of blood glucose value for intensive care patients (Kelly et al., 2006; Umpierrez et al., 2007; Kitabchi et al., 2008; Tamaki et al., 2008). Studies have shown that poor glycaemic control can lead to vascular complications such as blindness, renal dysfunction, nerve damage, multiple organ failure, myocardial infarction, limb amputation and even death in the case of type 1 and type 2 diabetic patients (Taylor et al., 2006; Vanhorebeek et al., 2006; Chase et al., 2008; Kitabchi 75 et al., 2008). Thus, the regulation of blood glucose is of utmost importance for all ICU patients as well as patients suffering from type 1 or type 2 diabetes. Two broad approaches are available for blood glucose regulation in diabetic patients. Practitioners generally prefer protocols (i.e. rule based administration of insulin, oral drugs and/or oral glucose (for treating hypoglycemia)) while researchers and academics have mainly focused on feedback controllers designed based on control theory. The relative ease of implementation and the lack of proven track record in treating diabetics with automatic control have made hospitals prefer protocol based methods over automatic feedback-based control in ICU patients. Many established protocols are available in the literature (Taylor et al., 2006; Tamaki et al., 2008) and many ICUs prefer using their own in-house developed protocols to control blood glucose levels in patients under their care. A drawback with existing protocols is that they fail to explicitly consider variations in insulin levels, effectiveness of insulin utilization, glucose absorption and other patientspecific factors (Chase et al., 2005). Therefore, it would be ideal if the protocol is designed, optimized and personalized considering these patient-specific factors. In this context, patient-specific models, constructed from patient data collected during the early stages of ICU stay, can be beneficially used by physicians and caregivers for improved ICU care. The modeling of blood glucose in ICU patients is complicated owing to noisy measurements, infrequent sampling, lack of reliable insulin or glucose infusion profiles, known/unknown disturbances related to patient condition (stress, sepsis, etc.) and unrecorded events (therapeutic drugs taken for the medical conditions for which the patient is in ICU). Many model structures available in the literature have been reviewed by Ramprasad (2004). Here, some of the representative and popular models are 76 reviewed. V.W.Bollie (1961) set a mark in modeling study of diabetics by developing a two state linear model consisting of one differential equation each for glucose and insulin. Ackerman et al. (1965) proposed a similar model structure for glucose-insulin dynamics in healthy person. Although these two models oversimplify the physiological glucose and insulin effects, the interaction effect of glucose and insulin were successfully captured by the models. Bergman et al. (1981) developed a model with three differential equations that represent insulin production and infusion, insulin storage in a remote compartment, and glucose input and insulin utilization in a second compartment. The model takes a remote compartment concept for insulin storage to account for the time delay between insulin injection and its utilization (Lam et al., 2002). A model consisting of glucose subsystem, glucagon subsystem and insulin subsystem was presented by Cobelli et al. (1982). Glucose and glucagon subsystems are both modeled using single-compartment and the insulin subsystem is represented by a 5-compartment model. This non-linear model utilized the threshold function to describe the saturation behavior observed in biological sensing. Cobelli and Mari (1983) validated this model in a glucose regulation case study. Puckett (1992) modeled the human body as two blood-pool system representing insulin and glucose concentrations. The model included nonlinear metabolic behavior of the glucose insulin system as well as carrier mechanism and diffusion pathways which improve the accuracy of glucose and insulin removal from the blood stream. However, high frequency dynamics are neglected by the steady state compartments represented in this model. Puckett and Lightfoot (1995) 77 then improved the model by accounting for intra- and inter-patient variability (Ramprasad, 2004). More recently, a model was presented by Chase et al. (2005). The model was developed by doing some modification to the Bergman model by bringing in insulin utilization, insulin losses and saturation dynamics into the model. Each of the above models have their own advantages and disadvantages. In this chapter, we propose the use of the simple first order plus time delay (FOPTD) model to fit and predict ICU patients’ blood glucose level. To the best of our knowledge, the FOPTD model has not been employed for modeling the blood glucose dynamics in ICU patients. This is one novelty in the present work. In addition, the FOPTD model based structure is flexible and extendable to model any additional phenomena that may become important. 5.2 First Order plus Time Delay (FOPTD) Model First order plus time delay (FOPTD) finds application in many problems related to process dynamics and control. It is a commonly used model structure to capture the dynamic behavior of chemical engineering processes. Its simplicity and ability to characterize plant dynamics makes FOPTD very useful, especially in designing feedback control systems (Ogunnaike and Mukati, 2006; Fedele, 2008). The FOPTD model for a single input single output system is given by Eq. 5.1 where y(s) represents process output and u(s) represents the process input. K is the steady state gain, τ is the time constant and θ represents the time delay. y(s) = K e −θ s u ( s ) τs + 1 (5.1) 78 In this study, the FOPTD model is used to fit and predict blood glucose level of ICU patients as functions of insulin (intravenous & bolus) and glucose (intravenous and oral) inputs. The FOPTD model is not capable of representing any phenomenological aspects of blood glucose dynamics; rather, it is a correlational model that is able to capture and express the effect of external inputs (exogenous insulin, meal, patient state etc.) on the blood glucose level. Despite its simplicity, the FOPTD structure is capable of capturing patient-specific blood glucose dynamics. Furthermore, the model parameters can be easily interpreted by the physician and readily employed for treating the patients making it very attractive for practical applications. To accommodate the effect of the different inputs, we employ a multi-input single output (MISO) model structure with each dynamics modeled as a FOPTD subsystem (see Fig. 5.1). Intravenous Glucose Oral Glucose K 1 τ 1s + 1 K 2 τ 2s + 1 e u (s) e −θ 2 s u ( s ) K 3 e τ 3s + 1 Insulin −θ 1s −θ 3s + BSL deviation u (s) Noise Fig. 5.1 FOPTD model scheme (MISO System) 5.3 MATERIALS AND IMPLEMENTATION 5.3.1 Dataset and Software The datasets were collected in the surgical ICU of the National University Hospital (NUH), Singapore between January and July 2008. Blood glucose values (recorded once every 4 hours), amounts of intravenous glucose, orally administered 79 glucose and the insulin infused (via bolus and intravenous route) were recorded. At the ICU in NUH, the physicians endeavor to maintain the blood glucose levels in patients between 6 to 8 mmol/l. This is a standard practice in many ICUs and is considered to be a good compromise between tight glycemic control and the risk of hypoglycaemia. The hypocount is monitored every 4 hours. Based on the hypocount levels, an insulin infusion rate is fixed. This infusion is then supplemented with boluses of insulin based on a sliding scale protocol. As an example, the protocol used in Nutritional Support Service (Memphis) can be seen in Dickerson et al. (2008). Data from 19 ICU patients were made available. Based on the continuity of insulin infusion and patients’ response to insulin, the cohort was classified into three categories. Seven of the patients were given continuous insulin infusion and their blood glucose response to insulin was as expected. Five patients needed only intermittent insulin infusions and in the remaining 7 patients, the blood glucose response was affected by factors other than insulin infusion (unnoted events that results in unreasonably high or low blood glucose values and abnormal response to insulin). Data from a typical patient belonging to the first group is shown in Figure 5.2. The leftmost column indicates the blood sugar level sampling time (the date and the exact time). In columns 2 and 3, the blood sugar level is provided in two different units. The type and dose of insulin supplied to the patient is noted in the fourth column while the fifth column contains administered glucose information (given intravenously or in the meal form depending on the patient’s consciousness and condition at that time). 80 Figure 5.2 Data from Patient 1 who belongs to the first Group 81 5.3.2 FOPTD Implementation All datasets were divided into training samples (first 80% of the data) and test samples (last 20% of the data). The data set containing the test samples was then kept aside for model validation. The training sets were used to build the multiinput single output FOPTD model with oral glucose, intravenous glucose and insulin as the inputs and deviation in blood glucose (BG) values as the output. BG deviation value is then added to the blood glucose equilibrium value (mean of two previous BG values (Chase et al., 2008)) to get BG model predicted values (ĝ). The mean absolute prediction error (MAPE = measured BG – model predicted values) is then calculated. The best prediction from the model is possible for the set of parameter values for which the MAPE is minimum. This set of parameters can be arrived at using genetic algorithm (GA) with MAPE as the objective function. In GA, a population of different sets of parameter values are initialized and updated by using genetic principles such as crossover and mutation operators until the stopping criteria for optimization is fulfilled. Finally, the GA tool will provide parameter values which give the smallest MAE as the result. Bounds on the parameters are set using physical reasoning – for example, the time constants and time delays are nonnegative, the gains of the insulin inputs to blood glucose are negative and the gains of glucose inputs to blood glucose are positive. After all model parameters are obtained, the model was applied to whole dataset (training and test data sets) and both predicted value (ĝ) and actual value (g) of blood glucose were plotted versus time. In such plots, the first 80% of the dataset shows the fitting ability of the model and the last 20% data samples shows its predictive ability. 82 5.4 RESULTS AND DISCUSSION 5.4.1 Patients with Continuous Insulin Infusion (Group 1) The first pool of the cohort was given continuous insulin infusion and their blood glucose value increase/decrease with the decrease/increase of insulin. Validation results show that FOPTD model captures the dynamics of all the 7 patients with mean absolute error (MAE) value less than 2.1mmol / L (see Table 5.1). Using some pre-selected datasets, Chase et al (2008) claim their maximum MAE is 2.9mmol/L for patients with continuous insulin infusion which is larger than obtained with the FOPTD model without any data pre-selection. The model fit and prediction results for the patients with the lowest and highest MAE are plotted in Fig 5.3 and Fig 5.4 respectively. In these figures, solid lines and dotted lines represent data fitting and model validation respectively. Table 5.1 shows that the MISO FOPTD model structure gives considerably low MAE value not only in training samples but also in test samples for all patients who receive continuous insulin infusion. The small MAE differences between training and test set indicate the stability and consistency of the model performance in handling both old and new data. As shown in Figures 5.3 and 5.4, the FOPTD model is able to track the patient response to a significant degree. These results compare very favorably to results obtained with first principles based models (Loganathan et al., 2008). However, none of the models were able to capture some of the highs and lows seen in patient data. Unmeasured variables like additional medications administered during the trial, stress level, existence of infection etc., may have contributed to such aberrations. 83 Table 5.1 MAE values for training and test samples using data from patients with continuous insulin infusion MAE training MAE test Pat 1 1.7648 1.8687 Pat 2 2.734 1.8208 Pat 22 1.2148 0.9306 Pat 34 1.43 1.6637 Pat 1B 2.6416 1.8084 Pat 30 1.8401 2.0282 Pat 25 1.1701 0.9754 Predicted and Actual Blood Glucose Value for Pat 22 22 Validation Training Actual 20 Blood glucose (mmol/L) 18 MAE=0.9306 16 14 12 10 8 6 4 2 0 2000 4000 6000 Time (min) 8000 10000 12000 Fig. 5.3. Results for the “best” patient data set using the FOPTD model 84 Predicted and Actual Blood Glucose Value for Pat 30 18 Validation Training Actual MAE=2.0282 16 Blood glucose (mmol/L) 14 12 10 8 6 4 2 0 2000 4000 6000 8000 10000 Time (min) 12000 14000 16000 Fig. 5.4. Results for the “worst” patient data set using the FOPTD model 5.4.2 Patients with Intermittent Insulin Infusion (Group 2) Patients with a blood glucose response that is relatively stable are supplied insulin intermittently (i.e. only in case of need). The robustness of the modeling procedure can be better tested with such patients since the insulin input (perturbations) is relatively less compared to patients in group 1. As stated earlier, five patients fall under this category. The results of the “best” of the 5 patients based on MAE are shown in Fig 5.5. Table 5.2 gives the MAE values for this pool of patients. The low value of MAE, shown in Table 5.2, confirms the robustness of the MISO FOPTD model in handling intermittent insulin infusion. In addition, Figure 5.5 portrays the ability of our model to capture the dynamics of patient blood glucose data and shows extremely good performance in the test samples as well. 85 Table 5.2 MAE values for training and test samples using patient data with intermediate insulin infusion MAE training MAE test Pat 6 1.3549 0.9564 Pat 13 0.6401 0.5883 Pat 16 1.1165 0.4997 Pat 27 0.9564 0.9751 Pat 32 1.3771 0.8679 Predicted and Actual Blood Glucose Value for Pat 16 13 Validation Training Actual 12 Blood glucose (mmol/L) 11 MAE=0.4997 10 9 8 7 6 5 0 1000 2000 3000 Time (min) 4000 5000 6000 Fig 5.5. Results for the “best” patient data set using the FOPTD model (Intermittent Insulin Infusion). 86 5.4.3 Patients with Blood Glucose Response Affected by Other Factors (Group 3) The usual practice in building and testing a proposed model (with a given structure) has been to select a consistent cohort from a large pool of patients and examine the data from this group (Chase et al., 2005). However, in real practice, the medical team often comes across extreme and challenging cases in the ICU. A model structure which is robust enough to handle multiple medical interventions and a broader range of patient dynamics is needed. In this study, efforts were made to include a group of 7 patients with complex blood glucose response. In this pool of patients, we have patients that exhibit severe hyper/hypoglycemic tendencies as well as those who needed frequent medication for treating conditions such as allergies, stress, and cardiogenic shock treatment. The MAE values for all cases belonging to this group are shown in Table 5.3 and the model performance results for the “best” case (the least MAE) from this pool are shown in Fig 5.6. As shown in Table 5.3, the identified FOPTD models are associated with low MAE values for each patient. The high accuracy of FOPTD model is shown not only in fitting part but also in validation part. The ability of the proposed MISOFOPTD model structure in handling such datasets confirms its robustness and reliability in modeling blood glucose dynamics especially for ICU patients. 87 Table 5.3 MAE values for training and test samples using Group3 patient data MAE training MAE test Pat 12 1.4652 0.9606 Pat 14 1.9856 1.543 Pat 19 2.476 3.1174 Pat 21 1.7299 0.9052 Pat 23 1.1382 0.9993 Pat 24 1.7418 1.7784 Pat 26 1.962 3.205 Predicted and Actual Blood Glucose Value for Pat 26 18 Validation Training Actual 16 MAE=3.205 Blood glucose (mmol/L) 14 12 10 8 6 4 0 1000 2000 3000 Time (min) 4000 5000 6000 Fig. 5.6 Model performance on the “best” patient data from Group 3 88 5.4.4 Medication Effect One of the key limitations of the existing models in the literature is the inability of such models to account explicitly for the effects of medication and other medical conditions that may occur during trials (Lam et al., 2002; Chase et al., 2008). The general argument put forward is that the parameters in existing models would take care of such dynamics. Such claims have largely been unsubstantiated as yet. The predictions using such models can be poor and may end up in missing out a hypo/hyperglycemic episode (the former being more serious for patient health). In the proposed MISO-FOPTD structure, any medication effects or medical conditions can included in a straightforward manner by including them as additional inputs with a suitable model structure (e.g. FOPTD) relating them to the blood glucose output. Here, we have included medication data as an additional input to the FOPTD structure considered in Figure 5.1. The effect of medication is studied using 2 patient datasets for whom medication data were available. The simulation results are very promising and are shown in Figures 5.7, 5.8, 5.9 and 5.10. Predicted and Actual Blood Glucose Value for Pat 27 w/o medication 22 Validation Training Actual 20 Blood glucose (mmol/L) 18 MAE=0.9751 16 14 12 10 8 6 4 2 0 0.5 1 1.5 Time (min) 2 2.5 3 4 x 10 Fig. 5.7 FOPTD prediction without medication for Patient 27 89 Predicted and Actual Blood Glucose Value for Pat 27 w/ medication 22 Validation Training Actual 20 Blood glucose (mmol/L) 18 MAE=0.7818 16 14 12 10 8 6 4 2 0 0.5 1 1.5 Time (min) 2 2.5 3 4 x 10 Fig. 5.8 FOPTD prediction with medication for Patient 27 Predicted and Actual Blood Glucose Value for Pat 34 22 Validation Training Actual 20 Blood glucose (mmol/L) 18 MAE=1.6637 16 14 12 10 8 6 4 2 0 0.2 0.4 0.6 0.8 1 1.2 Time (min) 1.4 1.6 1.8 2 x 10 4 Fig. 5.9 FOPTD prediction without medication for Patient 34 90 Predicted and Actual Blood Glucose Value for Pat 34 w/ medication 22 Validation Training Actual 20 18 Blood glucose (mmol/L) MAE=1.6529 16 14 12 10 8 6 4 2 0 0.2 0.4 0.6 0.8 1 1.2 Time (min) 1.4 1.6 1.8 2 4 x 10 Fig. 5.10 FOPTD prediction with medication for Patient 34 From Fig. 5.7, it can be seen that the FOPTD model, without using the medication data doesn’t capture the hyperglycemic episodes. However, from Fig. 5.8, wherein the results correspond to the FOPTD model with medication, the hyperglycemic data is captured very well. It has to be noted that the model with medication predicts a non-existing hypoglycemia (at time~7500 min). This would force the medical staff to decrease the insulin infusion, which in-turn will increase blood glucose. Hence, here in this case, it works out to be harmless to the patient. As can be seen from Fig. 5.9 and Fig. 5.10, the inclusion of medication effect in the model allows it to capture the lows around time~1000 min better than the model which does not take medication data into account. The same phenomena are observed at time~8000 min and at time~18000 min. In addition, the predictive ability of the model which takes medication into consideration is better than the 91 model which does not take medication into consideration. This is confirmed by the MAE values shown in Figures 5.7 through 5.10. In Table 5.4, the parameter ranges obtained for the different patients are summarized. The values are reasonable but the range is rather wide (even taking patient-to-patient variability into account). More work needs to be done to verify this aspect of the problem. What we have succeeded here is in showing that the FOPTD model produces acceptable and adequate results that matches those obtained with first principles based models (see Loganathan et al., 2008). Table 5.4 Range of the parameters for each patient group Group 1 Group 2 Group 3 K1 0.00005 to 190.78 0.00005 to 2.023 0.109 to 12.574 τ1 (min) 0.375 to 119 0.00005 to 2.333 0.00005 to 5.228 θ1 (min) 0.875 to 59 0.00005 to 7.225 0.00005 to 1.094 K2 0.00005 to 1.129 0.332 to 0.981 0.075 to 2.631 τ2 (min) 0.25 to 67.269 0.291 to 4.58 0.5 to 4.949 θ2 (min) 1.123 to 16.874 0.00005 to 0.961 0.00005 to 1.078 K3 -0.046 to -0.004 -0.177 to -0.00005 -0.00005 to -0.063 τ3 (min) 0.5 to 112 0.562 to 36.624 0.235 to 100 θ3 (min) 0.461 to 49 1.116 to 18.98 0.001 to 29.908 5.4.5 Analysis of Home Monitoring Diabetes Data To check the robustness of the MISO FOPTD structure for purposes of blood glucose modeling, the methodology described above was applied to patient data that came from home monitoring. Thus, this is non-ICU data provided by Dr 92 Tibor Deutsch (Applied Logic Laboratory, Hungary). The data made available to us was on 5 patients and consisted of three inputs (glucose, short acting insulin and intermediate acting insulin) and one output (blood glucose values recorded 6 times daily around after patients’ meal time over a period of 2 years). The results of model building and validation for the patients with the highest and the lowest MAE are given in Fig. 5.11and Fig. 5.12 respectively. Table 5.5, Fig. 5.11 and Fig. 5.12 indicate that the MISO-FOPTD structure is a very promising tool to capture the dynamics of blood glucose dynamics in home monitored diabetic patients as well. Table 5.5 shows that the FOPTD model results in considerably low MAE value not only in training set but also in test set for all datasets studied. Small MAE differences between training and test set show the stability and consistency of FOPTD performance in handling both old and new data. However, as can be seen from Figures 5.13 and 5.14, the MISO-FOPTD model predictions have low correlations with actual measured data. This is a point of concern and must be addressed in future work. The mismatch between model prediction and actual data may be due to other factors (such as illness, stress in daily life, etc.) which are not captured by the model. Table 5.6 summarizes the range of estimated model parameters in home monitoring datasets. In addition to the advantages of FOPTD, as can be seen from Table 5.6, all parameters obtained from this model lie inside the reasonable boundaries. The time constant for all input is still less than 90 minutes and the time delay values do not exceed 30 minutes. These are considered to be reasonable and realistic values. 93 Table 5.5 MAE value for training and test samples using home monitoring data MAE training MAE test Pat 10 1.164 1.2374 Pat 214 0.981 0.331 Pat 913 0.562 0.335 Pat 117 1.162 1.154 Pat 45 1.0436 0.947 Predicted and Actual Blood Glucose Value for Pat 10 14 Validation Training Actual Blood glucose (mmol/L) 12 MAE=1.2374 10 8 6 4 2 0 0 0.5 1 1.5 2 Time (min) 2.5 3 3.5 5 x 10 Fig. 5.11 Results with the FOPTD model for the patient with the highest MAE (home monitoring dataset) 94 Predicted and Actual Blood Glucose Value for Pat 214 12 Validation Training Actual 10 Blood glucose (mmol/L) MAE=0.331 8 6 4 2 0 0 0.5 1 1.5 2 2.5 3 Time (min) 3.5 4 4.5 5 5 x 10 Fig. 5.12 Results with the FOPTD model for the patient with the lowest MAE (home monitoring dataset) Table 5.6 Range of estimated parameters for home monitoring data Home monitoring data K3a* -68.874 to -0.00005 τ3a (min) 0.0005 to 18.896 θ3a (min) 0.011 to 6 K2 0.00005 τ2 (min) 0.289 to 82.04 θ2 (min) 0.266 to 28.578 K3b* -57.5 to -0.00005 τ3b (min) 0.023 to 18.919 θ3b(min) 1.697 to 6.29 * a and b refers to short and intermediate acting insulin 95 Predicted and Actual Blood Glucose Value for Pat 214 Predicted and Actual Blood Glucose Value for Pat 913 160 100 140 Predicted Blood Glucose(mg/dL) Predicted Blood Glucose(mg/dL) Corr = 0.1307 Corr = 0.3511 90 80 70 60 50 40 100 80 60 40 30 20 20 30 40 50 60 70 Actual Blood Glucose(mg/dL) 80 90 20 20 100 160 40 60 80 100 120 140 Actual Blood Glucose(mg/dL) 160 180 200 Predicted and Actual Blood Glucose Value for Pat 10 Predicted and Actual Blood Glucose Value for Pat 117 250 180 Corr = 0.4114 Corr = 0.2955 200 Predicted Blood Glucose(mg/dL) 140 120 100 80 60 40 150 100 50 20 0 0 20 40 60 80 100 120 140 Actual Blood Glucose(mg/dL) 160 180 200 0 0 50 100 150 Actual Blood Glucose(mg/dL) 200 250 Predicted and Actual Blood Glucose Value for Pat 45 200 Corr = 0.1082 180 Predicted Blood Glucose(mg/dL) Predicted Blood Glucose(mg/dL) 120 160 140 120 100 80 60 40 20 40 60 80 100 120 140 160 Actual Blood Glucose(mg/dL) 180 200 220 Figure 5.13 Actual glucose and model fit for all 5 home monitoring patients 96 Predicted and Actual Blood Glucose Value for Pat 10 Predicted and Actual Blood Glucose Value for Pat 913 160 90 Corr = 0.136 Corr = 0.4674 140 80 Predic ted Blood Gluc ose(m g/dL) Predicted Blood Glucose(mg/dL) 85 75 70 65 60 55 50 55 60 65 70 Actual Blood Glucose(mg/dL) 75 80 60 60 80 100 120 140 Actual Blood Glucose(mg/dL) 160 180 70 75 75 70 Corr =0.1435 Predicted Blood Glucose(mg/dL) 160 40 Predicted and Actual Blood Glucose Value for Pat 214 Predicted and Actual Blood Glucose Value for Pat 45 Predicted Blood Glucose(mg/dL) 80 20 20 85 180 140 120 100 80 60 Corr = 0.2719 65 60 55 50 45 40 40 20 20 100 40 50 45 45 120 40 60 80 100 120 140 Actual Blood Glucose(mg/dL) 160 180 200 35 35 40 45 50 55 60 Actual Blood Glucose(mg/dL) 65 Predicted and Actual Blood Glucose Value for Pat 117 160 Corr =0.2452 Predicted Blood Glucose(m g/dL) 140 120 100 80 60 40 20 20 40 60 80 100 120 140 Actual Blood Glucose(mg/dL) 160 180 Figure 5.14 Actual glucose and model prediction for all 5 home monitoring patients To summarize, in this chapter, the use of a MISO-FOPTD structure has been proposed and evaluated to model ICU patients’ blood glucose level. FOPTD is applied to data from 19 ICU patients and is seen to give satisfactory result in fitting and predicting blood glucose values. In addition, its simplicity enables FOPTD to be easily extended when additional input variables become available. The FOPTD 97 model was also applied to data collected from home monitored diabetes patients and promising results were obtained. 98 Chapter 6 Conclusions and Recommendations I do the very best I know how- the very best I can; and I mean to keep on doing so until the end Abraham Lincoln (1809-1865) Former US President 6.1 Conclusions In Chapter 3, the performance of new classifier, DPCCM is tested on two food product classification case studies. The performance of DPCCM is compared with well established classifiers such as LDA, CART, TreeNet and SVM. In the wine case study, DPCCM performance is comparable to LDA and is better than other classifiers. It is noteworthy that, in this case, there is an improvement in performance with increase in the order of partial correlations. This fact indicates the presence of multivariate interactions and indirect relationships between the variables. In the cheese classification problem, DPCCM gives the best classification result and it is comparable to SVM. Also, the use of original variables without projecting them to new dimensional space is a positive aspect of DPCCM. The utility of DPCCM and other advanced machine learning tools like CART and TreeNet in handling data from medical domain and extracting information from them is checked by applying DPCCM to WBC and WDBC. DPCCM as well as TreeNet not only give the best overall accuracy on the test data set but are also able to classify all cancerous cells perfectly to their respective 99 classes in the WBC dataset. This indicates the promising performance of DPCCM for medical applications. In addition, DPCCM appears to be the most suitable classifier on WDBC case study since it can perfectly classify all test samples to their corresponding class. This study confirms the ability of DPCCM as a strong classifier since its performance is not only good for food product datasets but is also good for biomedical datasets. This thesis also examined the feasibility of DOA classification using DPCCM, CART, TreeNet, VPMCD, ANN and LDA/QDA. The comparison study was performed with the objective of determining the best classifier i.e. the capability to correctly classify new samples into their corresponding classes. According to our analysis, in terms of overall accuracy, CART and QDA are observed to be the best classifier models for DOA classification using cardiovascular features and AEP features respectively. Even when classifiers are built using a subset of features, the superiority of CART and QDA in DOA classification using cardiovascular dataset and AEP features respectively is confirmed. Another interesting fact that came out of this study is the significant performance improvement after applying variable selection method in both cardiovascular and AEP features datasets. This also highlighted the importance of variable selection in DOA analysis. Overall, the analysis indicated the lack of generality of methods and highlighted the necessity of designing case specific decision support system based on best performing classifier and variable selection method. In this thesis, the performance of the classifiers is also examined using heart disease data sets. The existence of categorical data in this dataset precludes some classifiers because of their inability to handle categorical data. Therefore, this study 100 was conducted using only TreeNet and CART. Based on our results on heart disease classification, it can be concluded that CART is the most suitable classifier for heart disease prediction using all attributes available in the heart disease dataset. CART is able to predict patients with heart disease more accurately than TreeNet. However, CART performance on predicting the other class tends to be poorer than TreeNet. Based on these results, CART is the recommended classifier for heart disease identification since patients with heart disease must be identified correctly for medical treatment. On the other hand, if the objective is to identify healthy patients, TreeNet can be applied to the dataset. A new First Order plus Time Delay (FOPTD) model for capturing the dynamics of blood glucose in ICU patients has also been proposed and evaluated in this thesis. The FOPTD model structure was applied to data sets obtained from ICU patients’ as well as from diabetes patients under home monitoring. The results show that FOPTD model gives a considerably low MAE value and is able to predict the blood glucose values in the patient data. In addition, it is simple and the model can be easily applied for controller tuning. Also, it offers the luxury of including additional phenomena such as the effect of medication without any difficulty. When compared with the results reported in the literature, with 1 hour sampling frequency and the pre-processing of consistent patient cohort from a larger pool, the FOPTD model gives comparably accurate results. 6.2 Recommendations To date, machine learning has been widely used especially to solve problems related to classification in medicine and food product quality. However, according to our knowledge, machine learning application in other aspect of studies such as industrial process improvement and business application has not been 101 thoroughly explored. Some studies done by (Filipic and Junkar, 2000; Chen and Hsiao, 2008) have shown the use of data mining approach in those two aspects. Therefore, in future, one could attempt the application of the new developed method (DPCCM) and other classifiers in those fields. Hybrids of existing classifiers may become an interesting field to be explored further. The idea is to use the first classifier for variable selection and the second classifier for solving classification problem. This hybrid system will be beneficial in applications characterized by large number of variables and small number of samples. Filipic and Junkar (2000) and Sahan et al. (2007) have successfully applied this idea in improving k-nearest neighbor accuracy for WDBC datasets. However, according to our best knowledge, this hybrid system has not been used in food identification problems. Confidence interval calculation for classification problem is another aspect that could be studied further. Confidence interval following classifier accuracy could give some information about the classifier’s reliability. This could be very important when dealing with biomedical data. Classification in dynamic mode could also be considered as future work. The idea is to update the classification model using new data samples so that the accuracy of the model can be maintained for a longer time period. For the study of ICU patients’ blood glucose data done in chapter 5, the hypocounts were taken every 4 hours. As a result, the dynamics of the patients’ blood glucose value is hard to be accurately captured since we do not know what happened in between. Therefore, frequent sampling is really needed to increase the model accuracy and to make the model suitable for tight glycaemic control. Further study on a larger pool of patients with more frequent monitoring of blood glucose 102 needs to be done to validate the structure of the model and to determine other inputs which may affect blood glucose values. The possibility to integrate FOPTD model with first principles model is another issue that may be worth exploring. Since first principles models capture some specific phenomena, they are not amenable to expansion (via addition of new differential equations or new terms in existing equations) when new uncharacterized variables/phenomena are encountered. Therefore, one can think of developing hybrid models – using first principles model to capture essential phenomena (structural support to the modeling problem) augmented by FOPTD models for the new inputs. This hybrid method could be very promising for blood glucose modeling. 103 References Ackerman, E., L. C. Gatewood, J. W. Rosevear and G. D. Molnar (1965). "Model Studies of Blood Glucose Regulation." The Bulletin of Mathematical Biophysics 27: 21. Anonymous (2001, 28 November 2008). "Internet World Stats: Usage and Population Statistic." from . Asuncion, A. and D. J. Newman (2007). UCI Machine Learning Repository. University of California, Department of Information and Computer Science, Irvine, CA. Baba, K., R. Shibata and M. Sibuya (2004). "Partial correlation and conditional correlation as measures of conditional independence." Australian and New Zealand Journal of Statistics 46 (4): 657-664. Bagui, S. C., S. Bagui, K. Pal and N. R. Pal (2003). "Breast cancer detection using rank nearest neighbor classification rules." Pattern Recognition 36(1): 25-34. Baldi, P., S. Brunak, Y. Chauvin, C. A. F. Andersen and H. Nielsen (2000). "Assessing the accuracy of prediction algorithms for classification: an overview." Bioinformatics 16(5): 412-424. Beltrán, N. H., M. A. Duarte-Mermoud, M. A. Bustos, S. A. Salah, E. A. Loyola, A. I. Peña-Neira and J. W. Jalocha (2006). "Feature extraction and classification of Chilean wines." Journal of Food Engineering 75(1): 1-10. Bergman, R. N., L. S. Phillips and C. Cobelli (1981). "Physiologic Evaluation of Factors Controlling Glucose Tolerance in Man." Journal of clinical investigation 68: 1456. Berrueta, L. A., R. M. Alonso-Salces and K. Héberger (2007). "Supervised pattern recognition in food analysis." Journal of Chromatography A 1158(12): 196-214. Bertolini, M., A. Rizzi and M. Bevilacqua (2007). "An alternative approach to HACCP system Implementation." Journal of Food Engineering 79: 1322-1328. Bevilacqua, M., M. Braglia and R. Montanari (2003). "The classification and regression tree approach to pump failure rate analysis." Reliability Engineering & System Safety 79(1): 59-67. Bevilacqua, M., F. E. Ciarapica and G. Giacchetta (2008). "Industrial and occupational ergonomics in the petrochemical process industry: A regression trees approach." Accident Analysis & Prevention 40(4): 1468-1479. 104 Bollie, V. W. (1961). "Coefficients of Normal Blood Glucose Regulation." Journal of Applied Physiology 16: 783. Breiman, L., J. H. Friedman, R. A.Olshen and C. J. Stone (1983). Classification and Regression Trees. Monterey, CA, Wadsworth International Group. Brown, D. J. (2007). "Using a global VNIR soil-spectral library for local soil characterization and landscape modeling in a 2nd-order Uganda watershed." Geoderma 140(4): 444-453. Canu, S., Y. Grandvalet, V. Guigue and A. Rakotomamonjy (2005) . SVM and Kernel Methods Matlab Toolbox. Perception Systèmes et Information. Chase, G. J., X.-W. Wong, I. Singh-Levett, L. J. Hollingsworth, C. E. Hann, G. M. Shaw, T. Lotz and J. Lin (2008). "Simulation and initial proof-of-concept validation of a glycaemic regulation algorithm in critical care." Control Engineering Practice 16(3): 271-285. Chase, J. G., G. M. Shaw, J. Lin, D. C. V., C. Hann, T. Lotz, G. C. Wake and B. Broughton (2005). "Targeted Glycemic Reduction in Critical Care Using Closed-Loop Control." Diabetes Technology and Therapeutics 7: 274. Chen, L.-H. and H.-D. Hsiao (2008). "Feature selection to diagnose a business crisis by using a real GA-based support vector machine: An empirical study." Expert Systems with Applications 35(3): 1145-1155. Cheng, H. D., X. J. Shi, R. Min, L. M. Hu, X. P. Cai and H. N. Du (2006). "Approaches for automated detection and classification of masses in mammograms." Pattern Recognition 39(4): 646-668. Chiang, L. H. and R. D. Braatz (2003). "Process monitoring using causal map and multivariate statistics: fault detection and identification." Chemometrics and Intelligent Laboratory Systems 65(2): 159-178. Cobelli, C., G. Federspil, G. Pacini, A. Salvan and C. Scandellari (1982). "An integrated mathematical model of the dynamics of blood glucose and its hormonal control." Mathematical Biosciences 58(1): 27-60. Cobelli, C. and A. Mari (1983). "Validation of mathematical models of complex endocrine-metabolic systems. A case study on a model of glucose regulation." Medical and Biological Engineering and Computing 21(4): 390399. Cover, T. and P. Hart (1967). "Nearest neighbor pattern classification." Information Theory, IEEE Transactions on 13(1): 21-27. Dahl, F. A. (2007). "Convergence of random k-nearest-neighbour imputation." Computational Statistics & Data Analysis 51(12): 5913-5917. 105 de la Fuente, A., N. Bing, I. Hoeschele and P. Mendes (2004). "Discovery of meaningful associations in genomic data using partial correlation coefficients." Bioinformatics 20(18): 3565-3574. Deconinck, E., T. Hancock, D. Coomans, D. L. Massart and Y. V. Heyden (2005). "Classification of drugs in absorption classes using the classification and regression trees (CART) methodology." Journal of Pharmaceutical and Biomedical Analysis 39(1-2): 91-103. Dickerson, R. N., C. E. Swiggart, L. M. Morgan, G. O. Maish Iii, M. A. Croce, G. Minard and R. O. Brown (2008). "Safety and efficacy of a graduated intravenous insulin infusion protocol in critically ill trauma patients receiving specialized nutritional support." Nutrition 24(6): 536-545. Dr. Earl H. Tilford, J. (2000). THE INFORMATION REVOLUTION AND NATIONAL SECURITY T. E. Copeland. Duda, R. O., P. E. Hart and D. G. Stork (2000). Pattern Classification. New York, John Wiley. Ebrahimi, N., E. Maasoumi and E. S. Soofi (1999). "Ordering univariate distributions by entropy and variance." Journal of Econometrics 90(2): 317336. Eisen, M. B., P. T. Spellman, P. O. Brown and D. Botstein (1998). "Cluster analysis and display of genome-wide expression patterns." Proceedings of the National Academy of Sciences of the United States of America 95(25): 1486314868. Elkfafi, M., J. S. Shieh, D. A. Linkens and J. E. Peacock (1998). "Fuzzy logic for auditory evoked response monitoring and control of depth of anaesthesia." Fuzzy Sets and Systems 100(1-3): 29-43. Evans, D. G., L. K. Everis and G. D. Betts (2004). "Use of survival analysis and Classification and Regression Trees to model the growth/no growth boundary of spoilage yeasts as affected by alcohol, pH, sucrose, sorbate and temperature." International Journal of Food Microbiology 92(1): 55-67. Fedele, G. (2008). "A new method to estimate a first-order plus time delay model from step response." Journal of the Franklin Institute In Press, Corrected Proof. Filipic, B. and M. Junkar (2000). "Using inductive machine learning to support decision making in machining processes." Computers in Industry 43(1): 31-41. Fisher, R. A. (1936). "The use of multiple measurements in taxonomic problems." Annual Eugenics 7: 179-188. 106 Flores, M. J., J. A. Gámez and J. L. Mateo (2008). "Mining the ESROM: A study of breeding value classification in Manchego sheep by means of attribute selection and construction." Computers and Electronics in Agriculture 60(2): 167-177. Freidman, J. H. (1999). "Greedy Function Approximation: A Gradient Boosting Machine; technical report on Treenet." Furey, T. S., N. Cristianini, N. Duffy, D. W. Bednarski, M. Schummer and D. Haussler (2000). "Support vector machine classification and validation of cancer tissue samples using microarray expression data." Bioinformatics 16(10): 906-914. Gascoigne, B. (2008). "History of The Industrial Revolution." from http://www.historyworld.net/wrldhis/PlainTextHistories.asp?historyid=aa37. Granitto, P. M., F. Gasperi, F. Biasioli, E. Trainotti and C. Furlanello (2007). "Modern data mining tools in descriptive sensory analysis: A case study with a Random forest approach." Food Quality and Preference 18(4): 681-689. Guyon, I., J. Weston, S. Barnhill and V. Vapnik (2002). "Gene Selection for Cancer Classification using Support Vector Machines." Machine Learning 46(1): 389-422. Halsall, P. (1997). Internet Modern History Sourcebook Hong, J.-H. and S.-B. Cho (2008). "A probabilistic multi-class strategy of onevs.-rest support vector machines for cancer classification." Neurocomputing 71(16-18): 3275-3281. Jagannathan, G. and R. N. Wright (2008). "Privacy-preserving imputation of missing data." Data & Knowledge Engineering 65(1): 40-56. Jerez-Aragonés, J. M., J. A. Gómez-Ruiz, G. Ramos-Jiménez, J. Muñoz-Pérez and E. Alba-Conejo (2003). "A combined neural network and decision trees model for prognosis of breast cancer relapse." Artificial Intelligence in Medicine 27(1): 45-63. Jiann Shing, S., D. A. Linkens and J. E. Peacock (1999). "Hierarchical rulebased and self-organizing fuzzy logic control for depth of anaesthesia." Systems, Man, and Cybernetics, Part C: Applications and Reviews, IEEE Transactions on 29(1): 98-109. Kelly, J. L., I. B. Hirsch and A. P. Furnary (2006). "Implementing an Intravenous Insulin Protocol in Your Practice: Practical Advice to Overcome Clinical, Administrative, and Financial Barriers." Seminars in Thoracic and Cardiovascular Surgery 18(4): 346-358. 107 Kelly, M. (2001). "Overview of the Industrial Revolution - Industrial Revolution." Kitabchi, A. E., A. X. Freire and G. E. Umpierrez (2008). "Evidence for strict inpatient blood glucose control: time to revise glycemic goals in hospitalized patients." Metabolism 57(1): 116-120. Kojima, T., K. Yoshikawa, S. Saga, T. Yamada, S. Kure, T. Matsui, T. Uemura, Y. Fujimitsu, M. Sakakibara, Y. Kodera and H. Kojima (2008). "Detection of Elevated Proteins in Peritoneal Dissemination of Gastric Cancer by Analyzing Mass Spectra Data of Serum Proteins." Journal of Surgical Research In Press, Uncorrected Proof. Kressel, U. (1999). Pairwise classification and support vector machines. Advances in Kernel Methods: Support Vector Learning. MA, MIT Press. Kurt, I., M. Ture and A. T. Kurum (2008). "Comparing performances of logistic regression, classification and regression tree, and neural networks for predicting coronary artery disease." Expert Systems with Applications 34(1): 366-374. Lam, Z. H., K. S. Hwang, J. Y. Lee, J. G. Chase and G. C. Wake (2002). "Active insulin infusion using optimal and derivative-weighted control." Medical Engineering & Physics 24(10): 663-672. Linkens, D. A. and L. Vefghi (1997). "Recognition of patient anaesthetic levels: neural network systems, principal components analysis, and canonical discriminant variates." Artificial Intelligence in Medicine 11(2): 155-173. Little, R. J. A. and D. B. Rubin (1986). Statistical Analysis with Missing Data. new york, john wiley. Liu, K.-H. and D.-S. Huang (2008). "Cancer classification using Rotation Forest." Computers in Biology and Medicine 38(5): 601-610. Liu, W. Z., A. P. White, M. T. Hallissey and J. W. L. Fielding (1996). "Machine learning techniques in early screening for gastric and oesophageal cancer." Artificial Intelligence in Medicine 8(4): 327-341. Loganathan, P., S. Lakshminarayanan and R. G. Pandu. (2008). Blood Glucose Patient Modelling using First Principle Model. Singapore, ChBE NUS. Magoulas, G. D. and A. Prentza (2001). "Machine learning in medical applications." Lecture notes in artificial intelligence 2049: 300-307. Mahesh, V. and S. Ramakrishnan (2007). "Assessment and classification of normal and restrictive respiratory conditions through pulmonary function test and neural network." Journal of Medical Engineering & Technology 31(4): 300 - 304. 108 Mahfouf, M. (2006). Intelligent systems modeling and decision support in bioengineering Massachusetts, Artech house. MATLAB (2005). 7.0.4 (Release 14). McCabe, G. P. (1984). "Principal Variables." Technometrics 26: 137-144. Moon, H., H. Ahn, R. L. Kodell, S. Baek, C.-J. Lin and J. J. Chen (2007). "Ensemble methods for classification of patients for personalized medicine with high-dimensional data." Artificial Intelligence in Medicine 41(3): 197-207. Nayak, A. and R. J. Roy (1998). "Anesthesia control using midlatency auditory evoked potentials." IEEE Transactions on Biomedical Engineering 45(4): 409421. Nunes, C. S., M. Mahfouf, D. A. Linkens and J. E. Peacock (2005). "Modelling and multivariable control in anaesthesia using neural-fuzzy paradigms: Part I. Classification of depth of anaesthesia and development of a patient model." Artificial Intelligence in Medicine 35(3): 195-206. Ogunnaike, B. A. and K. Mukati (2006). "An alternative structure for next generation regulatory controllers: Part I: Basic theory for design, development and implementation." Journal of Process Control 16(5): 499-509. Oxford, U. (2005). Oxford Advanced Learner’s Dictionary. S. Wehmeier, Oxford University Press. Podgorelec, V., P. Kokol, M. M. Stiglic, M. Hericko and I. Rozman (2005). "Knowledge discovery with classification rules in a cardiovascular dataset." Computer Methods and Programs in Biomedicine 80(Supplement 1): S39-S49. Polat, K. and S. Günes (2008). "Principles component analysis, fuzzy weighting pre-processing and artificial immune recognition system based diagnostic system for diagnosis of lung cancer." Expert Systems with Applications 34(1): 214-221. Puckett, W. R. (1992). Dynamic modelling of Diabetes Mellitus. Department of Chemical Engineering, University of Wisconsin-Madison. PhD. Puckett, W. R. and E. N. Lightfoot (1995). "A model for multiple subcutaneous insulin injections developed from individual diabetic patient data." Am J Physiol Endocrinol Metab 269(6): E1115-1124. Raghuraj Rao, K. and S. Lakshminarayanan (2007a). "Partial correlation based variable selection approach for multivariate data classification methods." Chemometrics and Intelligent Laboratory Systems 86(1): 68-81. Raghuraj Rao, K. and S. Lakshminarayanan (2007b). "VPMCD: Variable interaction modeling approach for class discrimination in biological systems." FEBS Letters 581(5): 826-830. 109 Raghuraj Rao, K. and S. Lakshminarayanan (2007c). "Variable interaction network based variable selection for multivariate calibration." Analytica Chimica Acta 599(1): 24-35. Raj Kiran, N. and V. Ravi (2008). "Software reliability prediction by soft computing techniques." Journal of Systems and Software 81(4): 576-583. Ramprasad, Y. (2004). Model Based Controllers for Blood Glucose Regulation in Type 1 Diabetics. Chemical and Biomolecular Engineering. Singapore, National University of Singapore. M.Eng: 92. Razi, M. A. and K. Athappilly (2005). "A comparative predictive analysis of neural networks (NNs), nonlinear regression and classification and regression tree (CART) models." Expert Systems with Applications 29(1): 65-74. Rilley, M. (1993). "Data-analysis using hot deck multiple imputation." The Statistician 42(3): 307-313. Roggo, Y., P. Chalus, L. Maurer, C. Lema-Martinez, A. Edmond and N. Jent (2007). "A review of near infrared spectroscopy and chemometrics in pharmaceutical technologies." Journal of Pharmaceutical and Biomedical Analysis 44(3): 683-700. Rousu, J., L. Flander, M. Suutarinen, K. Autio, P. Kontkanen and A. Rantanen (2003). "Novel computational tools in bakery process data analysis: a comparative study." Journal of Food Engineering 57(1): 45-56. Sahan, S., K. Polat, H. Kodaz and S. Günes (2007). "A new hybrid method based on fuzzy-artificial immune system and k-nn algorithm for breast cancer diagnosis." Computers in Biology and Medicine 37(3): 415-423. Salford System (2007b). CART San Diego, California. Salford Systems (2007a). TreeNet San Diego, California, Salford system. Saraiva, P. M. and G. Stephanopoulos (1992). "Continuous process improvement through inductive and analogical learning." AIChE Journal 38(2): 161-183. Sharma, A. and K. K. Paliwal (2008). "Cancer classification by gradient LDA technique using microarray gene expression data." Data & Knowledge Engineering 66(2): 338-347. Sokal, R. and F. J. Rohlf (1995). Biometry: The principles and practice of statistics in biological research. New York, W. H. Freeman & co. Spurgeon, S. E. F., Y.-C. Hsieh, A. Rivadinera, T. M. Beer, M. Mori and M. Garzotto (2006). "Classification and Regression Tree Analysis for the 110 Prediction of Aggressive Prostate Cancer on Biopsy." The Journal of Urology 175(3): 918-922. Statnikov, A., C. F. Aliferis, I. Tsamardinos, D. Hardin and S. Levy (2005). "A comprehensive evaluation of multicategory classification methods for microarray gene expression cancer diagnosis." Bioinformatics 21(5): 631-643. Steuer, R., J. Kurths, O. Fiehn and W. Weckwerth (2003). "Observing and interpreting correlations in metabolomic networks." Bioinformatics 19(8): 1019-1026. Tamaki, M., T. Shimizu, A. Kanazawa, Y. Tamura, A. Hanzawa, C. Ebato, C. Itou, E. Yasunari, H. Sanke, H. Abe, J. Kawai, K. Okayama, K. Matsumoto, K. Komiya, M. Kawaguchi, N. Inagaki, T. Watanabe, Y. Kanazawa, T. Hirose, R. Kawamori and H. Watada (2008). "Efficacy and safety of modified Yale insulin infusion protocol in Japanese diabetic patients after open-heart surgery." Diabetes Research and Clinical Practice 81(3): 296-302. Taylor, B. E., M. E. Schallom, C. S. Sona, T. G. Buchman, W. A. Boyle, J. E. Mazuski, D. E. Schuerer, J. M. Thomas, C. Kaiser, W. Y. Huey, M. R. Ward, J. E. Zack and C. M. Coopersmith (2006). "Efficacy and Safety of an Insulin Infusion Protocol in a Surgical ICU." Journal of the American College of Surgeons 202(1): 1-9. Timm, N. H. (2002). Applied Multivariate Analysis. New York, Springer. Tittonell, P., K. D. Shepherd, B. Vanlauwe and K. E. Giller (2008). "Unravelling the effects of soil and crop management on maize productivity in smallholder agricultural systems of western Kenya--An application of classification and regression tree analysis." Agriculture, Ecosystems & Environment 123(1-3): 137-150. Toher, D., G. Downey and T. B. Murphy (2007). "A comparison of modelbased and regression classification techniques applied to near infrared spectroscopic data in food authentication studies." Chemometrics and Intelligent Laboratory Systems 89(2): 102-115. Tominaga, Y. (1999). "Comparative study of class data analysis with PCALDA, SIMCA, PLS, ANNs, and k-NN." Chemometrics and Intelligent Laboratory Systems 49(1): 105-115. Umpierrez, G. E., A. Palacio and D. Smiley (2007). "Sliding Scale Insulin Use: Myth or Insanity?" The American Journal of Medicine 120(7): 563-567. Van den Berghe, G. (2003). "Insulin therapy for the critically ill patient." Clinical Cornerstone 5(2): 56-63. Vanhorebeek, I., C. Ingels and G. Van den Berghe (2006). "Intensive Insulin Therapy in High-Risk Cardiac Surgery Patients: Evidence from the Leuven 111 Randomized Study." Seminars in Thoracic and Cardiovascular Surgery 18(4): 309-316. Vapnik, V. N. (1995). The Nature of Statistical Learning Theory, Springer. Wang, J., K. N. Plataniotis, J. Lu and A. N. Venetsanopoulos (2008). "Kernel quadratic discriminant analysis for small sample size problem." Pattern Recognition 41(5): 1528-1538. Wolfberg, W. H. and O. L. Mangasarian (1990). "Multi surface method of pattern separation for medical diagnosis applied to breast cytology." Proc. Natl. Acad. Sci 87: 9193-9196. Zhang, X., X. Song, H. Wang and H. Zhang (2008). "Sequential local least squares imputation estimating missing value of microarray data." Computers in Biology and Medicine 38(10): 1112-1120. Zhou, Z.-H., Y. Jiang, Y.-B. Yang and S.-F. Chen (2002). "Lung cancer cell identification based on artificial neural network ensembles." Artificial Intelligence in Medicine 24(1): 25-36. 112 APPENDIX A. CV of the Author MELISSA ANGELINE SETIAWAN melissa.angeline.84@gmail.com 93372640 Blk 301 Bukit Batok Street 31 #04-05, S 650301 Female 24 • Christian Chinese • Indonesian, In process for Singapore PR QUALIFICATION • • • able to work in a target-oriented, performance-centric environment Possess critical thinking and problem solving skills; quick learner. Independent as well as team player, organized and self-motivated Academic Qualification Master of Engineering (thesis under examination) January 2007- December 2008, National University of Singapore (NUS), Singapore Department of Chemical and Biomolecular Engineering Research Topic: Machine Learning and Data Analysis • Proposed a new data driven model to design insulin advisory system for ICU patients at NUH • Data mining methods for food product classification • Classification technique development to detect the depth of anesthesia (DOA) for patients who undergo surgery • Microarray data analysis • Generated rules to differentiate nearly-identical cancer cells • Development of a MATLAB toolbox for classification problems • Information extraction from historical data for process optimization GPA: 4.25 (5-point scale) Bachelor of Engineering 2002-2006, Bandung Institute of Technology (ITB), Bandung Industrial Engineering Faculty, Chemical Engineering Department Major in Chemical Engineering, Food Technology and Bioprocess Engineering Research Topic: Edible Oil from Indonesian Brassica species • Extracted oil from Brassica’s seeds • Performed oil characterization analysis Project Design: Refined Bleached Deodorized Palm Kernel Oil (RBD PKO) production from Palm Kernel • Designed the most efficient process to convert raw material to product • Designed the equipment involved in production line • Designed all utility systems and required waste treatment processes • Complete economic analysis for plant’s profitability GPA: 3.83 (4-point scale) Working Experience • Teaching assistant (1 semester: 2007/2008) – Dept of Chemical and Molecular Engineering, NUS : Tutor for one undergraduate module (Process Design 1) 113 • • Teaching assistant (3 semester: 2003/2004 and 2004/2005) – Chemical Engineering Department : Tutor for three undergraduate modules (Chemical Engineering Mathematics course, Laboratory Technology for Chemical Engineering, and Transport Phenomena) Internship for 1.5 month in Wall’s Ice Cream Factory, PT. Unilever Tbk, Cikarang, Indonesia : Did one project in reducing paper wrapping waste in production line (optimization of raw material used in production line and waste reduction) Research Experience • • • Data analysis, Programming and Modeling related to microarray data, and medical data sets Machine learning applications for industrial process improvement both batch and continuous Characterizing oil extracted from Indonesian Brassica species (edibility). The outcome of this research is that the oil cannot be used as edible oil, but it can be processed further to create a biofuel. Computational Skills • • • Bioinformatics: Real time Microarray data analysis Systems Biology: Survival analysis using clinical diagnostic data and Depth of Anesthesia prediction in surgery. Software package: Worked intensively with MATLAB, CART and Treenet, SignalMap, Affymetrix, HYSYS, and MS Office. Participation on Seminar and Training • • • • • Microteaching and Tutoring skills, July 2007 Communication and Presentation skill workshop, NUS, 2007 MATLAB and Simulink workshop, BTI, 2007 Industrial Management and Business, PT. Sampoerna Tbk, Surabaya, May 2006 Modern Biotechnology at ATMAJAYA University, Jakarta Awards and Achievement • • • • • Obtained the AUN-SEED Net Scholarship to pursue Master’s degree in Chemical Engineering at NUS. Graduated as Cumlaude from Chemical Engineering Department, ITB: only 20% of chemical engineering students fulfilled Cumlaude criteria. Member of Indonesia Sampoerna Best Student 2006: only 80 students are nominated from all Indonesian universities. First rank at Jakarta Senior High School Chemistry Competition, Dinas Pendidikan Menengah dan Tinggi (equivalent to MOE in Singapore). Semifinalist at Junior High School Mathematics Competition. Organizational experience and leadership • • • • Member of Church Music Ministry at GKI GunSa (period 1999-now) and BBPC (2007-2009) Chairman of Easter Commitee 2001 and vice chairman of easter commitee 2008 Organizing Committee Member of 3rd Regional Symposium on Membrane Science and Technology: ITB Organizing Committee Member of National Plant Design Competition, ITB 2005 114 Publications/Presentations • Setiawan Melissa, A., Rao Raghuraj, K. and S. Lakshminarayanan, “Partial Correlation Metric based Classifier for Food Product Characterization”, Accepted for publication in Journal of Food Engineering (June, 2008) • “Machine Learning in Medicine” presented at the AUN/SEED-Net field-wise seminar in Chemical Engineering, Thailand. • “Decision Rules for Cancer Tumor Identification” presented at the Graduate Student Symposium in Biological and Chemical Engineering – 2007 • Setiawan Melissa, A., Rao Raghuraj, K. and S. Lakshminarayanan, “Performance of data mining tools in classifying depth of anesthesia”, under review for publication in Artificial Intelligence in Medicine. • Setiawan Melissa, A., Rao Raghuraj, K. and S. Lakshminarayanan, “Variable Interaction Structure Based Machine Learning Technique for Cancer Tumor Classification”, will be presented on International conference on Biomedical Engineering 2008 • Setiawan Melissa, A., Wulan Sari and Tatang H. Soerawidjaja, “Oil and fat from Indonesian Brassica” in Indonesian language Other • • Excellent in English and Indonesian, both oral and written Excellent in Piano, organ and keyboard Availability: immediate References: Dr. S. Lakshminarayanan (M.Eng Supervisor at NUS) Asst. Prof., Dept. of Chemical and Biomolecular Engg. National University of Singapore, Singapore. Tel: +65-65168484, Email : chels@nus.edu.sg Rao Raghuraj Research fellow Singapore-Delft Water Alliance NUS, Singapore Tel.: +65 6516 8304, email: cverr@nus.edu.sg Dr. Tatang Hernas S. (B. Eng Supervisor at ITB) Dept. of Chemical Engineering Bandung Institute of Technology, Indonesia Tel: +628122349474 115 [...]... dealing with data complexity The success of data analysis and modeling efforts is highly dependent on the data set itself Poor quality and/ or quantity of data as well as missing data can make data analysis even harder Some biological and medical datasets are too huge in size Therefore, it is a bit too hard for some computers to handle this kind of dataset owing to limitations of hardware and software... huge “need” for information among people and provide solid proof that our society is transforming into an “information based society” As a result of this transformation, data and information have a great effect in decision making in various spheres of human activity To satiate this hunger for accurate and quick information, methodologies that can generate accurate information from raw data must be... 61 Table 4.6 Sensitivity and specificity values for each classifier in DOA classification 64 Table 4.7 Analysis result for WBC dataset using LDA, CART, TreeNet, DPCCM and VPMCD 66 Table 4.8 Analysis result for WDBC dataset using LDA, CART, TreeNet, DPCCM and VPMCD 67 Table 4.9 Classification result on heart disease dataset using CART and TreeNet 69 Table... in modeling ICU patients’ blood glucose value as a function of food, glucose and insulin could help the doctor to predict the amount of glucose and insulin to be administered to the patient to avoid hypoglycemia and hyperglycemia Hence it will increase the number of survive patient in the ICU 4 1.4 Challenges in Data Analysis and Modeling Work There are some challenges in doing data analysis and modeling. .. values for training and test samples using Group3 patient data 88 Table 5.4 Range of the parameters for each patient group 92 Table 5.5 MAE value for training and test samples using home monitoring data 94 Table 5.6 Range of estimated parameters for home monitoring data 95 xiii LIST OF FIGURES Page Fig 3.1 PCCM profiles for IRIS data 32 Fig 3.2 Variable correlation shade map for. .. to improve classifier performance on medical data sets 5 • Identifying the limitations of existing blood glucose modeling methods in diabetics (surgical ICU patients and patients under home monitoring) and evaluation of a new modeling methodology Section 1.6 provides more detailed information of this work This present work mainly focuses on information extraction and data analysis covering food product... quality and quantity Data set with a few samples will give insufficient classification information to the classifier hence its performance will be low Large data sets, which has many variables, can potentially provide enough information, but the analysis will be time consuming and computationally expensive Therefore, in problems involving large (in the number of variables) data sets (e.g micro array data) ,... to data analysis and information extraction are addressed in this present study They are: • Evaluating the performance of a newly developed method (DPCCM) by implementing it on problems from various domains such as food quality and medicine (cancer identification and depth of anesthesia classification) and comparing its performance with some existing leading machine learning methods • Applying and. .. developed 1.2 Analysis Techniques in Data Rich Area – Problem Definition High quality information at a high speed is sought by many people in all walks of life This is more so with people engaged in business, research, or 2 manufacturing Before we discuss further about information, its existence and its importance, it will be better for us to define information The Oxford English Dictionary defines information... DPCCM is introduced in chapter 3 Herein, the performance of DPCCM is compared to some existing and established classification methods such as CART, Treenet, and LDA Chapter 4 discusses data mining in the context of medical applications Some classification methods are applied and evaluated for early detection of cancer, heart disease identification and for DOA level maintenance during surgery process ... in Data Analysis and Modeling Work There are some challenges in doing data analysis and modeling work The main one relates to dealing with data complexity The success of data analysis and modeling. . .DATA ANALYSIS AND MODELING FOR ENGINEERING AND MEDICAL APPLICATIONS MELISSA ANGELINE SETIAWAN (B.Tech, Bandung Institute of Technology, Bandung, Indonesia) A THESIS SUBMITTED FOR THE... modeling efforts is highly dependent on the data set itself Poor quality and/ or quantity of data as well as missing data can make data analysis even harder Some biological and medical datasets

Định dạng
Số trang	131
Dung lượng	1,13 MB