Imputation Methods to Deal with Missing Values when Data Mining Trauma Injury Data Kay I Penny Centre for Mathematics and Statistics, Napier University, Craiglockhart Campus, Edinburgh, EH14 1DJ k.penny@napier.ac.uk Thomas Chesney Nottingham University Business School, Jubilee Campus, Wollaton Road, Nottingham, NG8 1BB Thomas.Chesney@nottingham.ac.uk Abstract. Methods for analysing trauma injury data with missing values, collected at a UK hospital, are reported. One measure of injury severity, the Glasgow coma score, which is known to be associated with patient death, is missing for 12% of patients in the dataset. In order to include these 12% of patients in the analysis, three different data imputation techniques are used to estimate the missing values. The imputed data sets are analysed by an artificial neural network and logistic regression, and their results compared in terms of sensitivity, specificity, positive predictive value and negative predictive value. Keywords Data mining, missing data imputation, trauma injury. 1. Introduction Trauma injury is the most common cause of loss of life to those under forty [1]. In 1991 a trauma system was put in place at the North Staffordshire Hospital (NSH) in Stoke-on-Trent in the U.K. It records injury details including Injury Severity Score (ISS) [2], Abbreviated Injury Scores (AIS) [3], the Glasgow Coma Score (GCS) [4], the patient's sex and age, management and interventions, and the outcome of the treatment, including whether the patient lived or died during their hospital stay. North Staffordshire Hospital is a major trauma centre in the area and receives patient referrals from surrounding hospitals. Oakley [5] analysed data for only the most severely injured patients admitted between 1992 to 1998, and found determinants of mortality for this subset of patients included age, head AIS, chest AIS, abdominal AIS, external injury AIS, mechanism of injury, primary receiving hospital and calendar year of admission. Further analysis includes a comparison of several artificial neural network (ANN) models and logistic regression (LR) to predict death during hospital stay [6]. Factors found to be important in the modelling were age, mechanism of injury, whether the patient was referred from another hospital, and several injury severity scores including GCS motor and GCS verbal scores. Missing data do not always cause concern when using data mining techniques, however, these data have 12% of GCS scores missing. Applying the standard practice of complete-case analysis therefore means that 12% of the dataset has been excluded from the modelling since these patients do not have recorded values for the three GSC scores. Exclusion of this subset of patients may lead to bias in the results, as patients who have not had their GCS scores recorded may not be a representative sample of the population of trauma injury patients e.g. it may be that these patients tend to be more seriously injured than the average or typical patient, hence the scores were not recorded due to lack of time, or that they presented with a different type or combination of injuries etc. The aim of this research is to investigate the accuracy of modelling patient death following trauma injury in conjunction with missing value imputation. 2. Methods The study involves trauma audit data from patients treated at the North Staffordshire WK ,QW&RQI,QIRUPDWLRQ7HFKQRORJ\,QWHUIDFHV,7,-XQH&DYWDW&URDWLD Hospital from 1993 Align your goals with your values Align your goals with your values Bởi: Joe Tye “The only natural law I’ve witnessed in three decades of observing successful people’s efforts to become more successful is this: People will something – including changing their behavior – only if it can be demonstrated that doing so is in their own best interests as defined by their own values.” Marshall Goldsmith: What Got You Here Won’t Get You There The reason people so often get on the health improvement roller coaster – lose the weight then gain it back, quit smoking then start again, go to the gym then let your membership lapse – is because they’re doing it for superficial reasons: to impress other people, to get a nagging spouse off their back, to get the employee health insurance discount, etc But as Marshall Goldsmith points out, people will only sustain these behavior changes if they are in line with their personal values Health Solutions is a company based in Cedar Rapids, Iowa, that provides health coaching for employees of corporate clients Because they understand the“natural laws” described by Goldsmith, they are working with Values Coach to incorporate key elements of our course on The Twelve Core Action Values into their coaching programs Core Action Value #1 in the course is Authenticity If you were to ask overweight smokers if smoking and being obese reflected their authentic best selves, the answer will almost always be a resounding NO As they strive to become more authentic, quitting smoking and losing weight will happen almost spontaneously, as a by-product of living their core values The very best time to think about and commit to your core values is when your world has turned upside down; it is in those desperate times that you are most likely to make the behavioral changes that will stay with you for the rest of your life Go to any fitness center during normal working hours and many if not most of the people you see there will be those who are out of work, working off their frustrations Some of them will maintain the fitness habits they create long after they have found another job 1/1 Open Access Available online http://ccforum.com/content/12/6/R145 Page 1 of 6 (page number not for citation purposes) Vol 12 No 6 Research End-expiratory lung volume during mechanical ventilation: a comparison with reference values and the effect of positive end-expiratory pressure in intensive care unit patients with different lung conditions Ido G Bikker, Jasper van Bommel, Dinis Reis Miranda, Jan Bakker and Diederik Gommers Department of Intensive Care Medicine, Erasmus MC, 's Gravendijkwal 230, 3015 CERotterdam, The Netherlands Corresponding author: Diederik Gommers, d.gommers@erasmusmc.nl Received: 25 Jun 2008 Revisions requested: 31 Jul 2008 Revisions received: 30 Oct 2008 Accepted: 20 Nov 2008 Published: 20 Nov 2008 Critical Care 2008, 12:R145 (doi:10.1186/cc7125) This article is online at: http://ccforum.com/content/12/6/R145 © 2008 Bikker et al.; licensee BioMed Central Ltd. This is an open access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0 ), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. Abstract Introduction Functional residual capacity (FRC) reference values are obtained from spontaneous breathing patients, and are measured in the sitting or standing position. During mechanical ventilation FRC is determined by the level of positive end-expiratory pressure (PEEP), and it is therefore better to speak of end-expiratory lung volume. Application of higher levels of PEEP leads to increased end-expiratory lung volume as a result of recruitment or further distention of already ventilated alveoli. The aim of this study was to measure end-expiratory lung volume in mechanically ventilated intensive care unit (ICU) patients with different types of lung pathology at different PEEP levels, and to compare them with predicted sitting FRC values, arterial oxygenation, and compliance values. Methods End-expiratory lung volume measurements were performed at PEEP levels reduced sequentially (15, 10 and then 5 cmH 2 O) in 45 mechanically ventilated patients divided into three groups according to pulmonary condition: normal lungs (group N), primary lung disorder (group P), and secondary lung disorder (group S). Results In all three groups, end-expiratory lung volume decreased significantly (P < 0.001) while PEEP decreased from 15 to 5 cmH 2 O, whereas the ratio of arterial oxygen tension to inspired oxygen fraction did not change. At 5 cmH 2 O PEEP, end-expiratory lung volume was 31, 20, and 17 ml/kg predicted body weight in groups N, P, and S, respectively. These measured values were only 66%, 42%, and 34% of the predicted sitting FRC. A correlation between change in end- expiratory lung volume and change in dynamic compliance was found in group S (P < 0.001; R 2 = 0.52), but not in the other groups. Conclusions End-expiratory lung volume measured at 5 cmH 2 O PEEP was markedly lower than predicted sitting FRC values in all groups. Only in patients with secondary lung disorders were PEEP-induced changes in end-expiratory lung volume the result of derecruitment. In combination with compliance, end- expiratory lung volume can provide additional information to optimize the ventilator settings. Introduction Monitoring end-expiratory lung volume (EELV) might be a val- uable tool to optimize respiratory settings in mechanical venti- lation [1]. However, determining EELV at the bedside in critically ill patients is not without difficulties. EELV can be measured using computed tomography [2,3], but this tech- nique is not available for routine application at the bedside. Traditionally, EELV measurement techniques are based on dilution of tracer gases, such as sulfur hexafluoride washout [4], closed circuit helium dilution [5], or open circuit multi- breath nitrogen washout [6]. All of these techniques still need expensive and/or complex instrumentation and are in general Available online http://ccforum.com/content/13/6/430 Page 1 of 2 (page number not for citation purposes) Following the publication of our article [1] we noticed that three of the figures were incorrectly numbered and positioned with respect to the figure legends. The complete set of correct figures (Figure 1, 2, 3 and 4) follows below. Figures 2, 3 and 4 appeared incorrectly in the original article. Reference 1. Bikker IG, van Bommel J, Reis Miranda D, Bakker J and Gommers D: End-expiratory lung volume during mechanical ventilation: a comparison with reference values and the effect of positive end-expiratory pressure in intensive care unit patients with different lung conditions. Crit Care 2008, 12:R145. Correction Correction: End-expiratory lung volume during mechanical ventilation: a comparison with reference values and the effect of positive end-expiratory pressure in intensive care unit patients with different lung conditions Ido G Bikker, Jasper van Bommel, Dinis Reis Miranda, Jan Bakker and Diederik Gommers Department of Intensive Care Medicine, Erasmus MC, ‘s Gravendijkwal 230, 3015 CE Rotterdam, The Netherlands Corresponding author: Diederik Gommers, d.gommers@erasmusmc.nl Published: 15 December 2009 Critical Care 2009, 13:430 (doi:10.1186/cc8196) This article is online at http://ccforum.com/content/13/6/430 © 2009 BioMed Central Ltd Figure 1 Progression of EELV in individual patients over three stepwise reductions in PEEP. Mean EELV values at each PEEP level are presented as black dots. Patients are divided according to the type of lung condition. Patients in group N had normal lungs, those in group P had a primary lung disorder, and those in group S had a secondary lung disorder. EELV, end-expiratory lung volume; PBW, predicted body weight; PEEP, positive end-expiratory pressure. Figure 2 Measured EELV as percentage of predicted sitting FRC at three PEEP levels. The black dotted line represent predicted sitting FRC (100%). Patients in group N had normal lungs, those in group P had a primary lung disorder, and those in group S had a secondary lung disorder. Values are expressed as mean ± standard deviation. EELV, end- expiratory lung volume; FiO 2 , inspired oxygen fraction; FRC, functional residual capacity; Pao 2 , arterial oxygen tension; PEEP, positive end- expiratory pressure. Critical Care Vol 13 No 6 Bikker et al. Page 2 of 2 (page number not for citation purposes) Figure 4 Correlation between change in EELV and change in dynamic compliance. Data are presented as the difference between the lowest PEEP level (5 cmH 2 O) and 10 or 15 cmH 2 O PEEP. Patients in group N had normal lungs, those in group P had a primary lung disorder, and those in group S had a secondary lung disorder. EELV, end-expiratory lung volume; PEEP, positive end-expiratory pressure. Figure 3 Pao 2 /Fio 2 ratio in different types of lung conditions at three PEEP levels. Patients in group N had normal lungs, those in group P had a primary lung disorder, and those in group S had a secondary lung disorder. Values are expressed as mean ± standard deviation. EELV, end-expiratory lung volume; FiO 2 , inspired oxygen fraction; Pao 2 , arterial oxygen tension; PBW, predicted body weight; PEEP, positive end-expiratory pressure. Give service reps the info they need to deliver great service for Service Managers Note: If your organization is brand new to Microsoft Dynamics CRM, you get these customer service features automatically. Existing organizations get these features when they apply product updates. For details about product updates, take a look at this article. SLAs let you clearly define various metrics (also known as key performance indicators or KPIs) to measure the performance of your service team. For example, you can set conditions to have your service reps: • Resolve high priority cases for premium customers within 4 hours • Resolve cases with normal priority in 2 days To help service reps monitor how they’re doing as they work on their cases, you can define what actions to take when a deadline for KPIs is nearing (called “warning actions”), or when a service rep doesn’t meet the goal (called “failure actions”). DEALING WITH MISSING VALUES IN DNA MICROARRAY CAO YI NATIONAL UNIVERSITY OF SINGAPORE 2008 DEALING WITH MISSING VALUES IN DNA MICROARRAY CAO YI (M.Eng. USTC, CHINA) A THESIS SUBMITTED FOR THE DEGREE OF DOCTOR OF PHILOSOPHY DEPARTMENT OF INDUSTRIAL AND SYSTEMS ENGINEERING NATIONAL UNIVERSITY OF SINGAPORE 2008 Acknowledgements First and foremost, I would like to thank my supervisor Associate Professor Poh Kim Leng, for his untiring support and guidance throughout my entire candidature. His valuable advice and critical comments on various aspects of the thesis have definitely improved the quality of this work. I would also express my sincere gratitude to Associate Professor Leong Tze Yun for her helpful suggestion on my research topic. I greatly acknowledge the support from Department of Industrial and Systems Engineering for providing a scholarship, without which it would be impossible for me to complete study. Many thanks also go to members of the Biomedical Decision Engineering Group for many insightful discussions with them. Further, I thank my colleagues in System Modeling and Analysis Lab for the memorable days spent with them. Family support has been crucial for me in this effort. Thanks to my parents for their constant encouragement and allowing me to pursue my study far away from home all these years. Their unconditional love, care, and attention have been showering on me all along the way. I am very grateful for that and am confident that this effort gives them much joy. Finally, I wish to express my most loving thanks to my dear and understanding wife, Qu Huizhong, whose keen criticism and advice has contributed to every page of this dissertation, and whose constant, loving support has made its completion possible. A special THANK YOU to you. i Contents Introduction 1.1 The Missing Value Problem in Microarray . . . . . . . . . . . . . . . . . 1.2 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.3 Statement of the Problem . . . . . . . . . . . . . . . . . . . . . . . . . . 1.4 Objectives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.5 Organization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . The Missing Value Problem in Microarray 2.1 Microarray . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9 2.1.1 Types of microarray . . . . . . . . . . . . . . . . . . . . . . . . . 10 2.1.2 Basic aspects of microarray . . . . . . . . . . . . . . . . . . . . . 10 Biological Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11 2.2.1 DNA and gene . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11 2.2.2 The central dogma of molecular biology . . . . . . . . . . . . . . . 12 2.3 Standard Form of Microarray . . . . . . . . . . . . . . . . . . . . . . . . 14 2.4 Missing Values . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14 2.5 Statistical Classification of Missing Values . . . . . . . . . . . . . . . . . 15 2.2 Literature Review 17 3.1 Classification of Imputation Methods . . . . . . . . . . . . . . . . . . . . 18 3.2 Methods for Dealing with Missing Values in Microarray . . . . . . . . . . 19 3.2.1 19 Cluster-based imputation methods . . . . . . . . . . . . . . . . . ii CONTENTS 3.3 iii 3.2.2 Regression-based imputation methods . . . . . . . . . . . . . . . . 22 3.2.3 Bayesian imputation methods . . . . . . . . . . . . . . . . . . . . 27 3.2.4 Iterative imputation methods . . . . . . . . . . . . . . . . . . . . 28 3.2.5 External biological knowledge incorporated methods . . . . . . . . 29 3.2.6 Others . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30 A Review on Evaluation Criteria . . . . . . . . . . . . . . . . . . . . . . 30 3.3.1 Theoretical evaluation . . . . . . . . . . . . . . . . . . . . . . . . 30 3.3.2 Experimental evaluation . . . . . . . . . . . . . . . . . . . . . . . 34