Tài liệu hạn chế xem trước, để xem đầy đủ mời bạn chọn Tải xuống
1
/ 131 trang
THÔNG TIN TÀI LIỆU
Thông tin cơ bản
Định dạng
Số trang
131
Dung lượng
1,13 MB
Nội dung
DATA ANALYSIS AND MODELING FOR ENGINEERING AND
MEDICAL APPLICATIONS
MELISSA ANGELINE SETIAWAN
NATIONAL UNIVERSITY OF SINGAPORE
2009
DATA ANALYSIS AND MODELING FOR ENGINEERING AND
MEDICAL APPLICATIONS
MELISSA ANGELINE SETIAWAN
(B.Tech, Bandung Institute of Technology, Bandung, Indonesia)
A THESIS SUBMITTED
FOR THE DEGREE OF MASTER OF ENGINEERING
DEPARTMENT OF CHEMICAL AND BIOMOLECULAR ENGINEERING
NATIONAL UNIVERSITY OF SINGAPORE
2009
ACKNOWLEDGEMENTS
First of all, I want to thank God who is always with me during my
coursework and research, gives me health and ability for doing all my work, equips
me with hope so I can face failures and keep persisting with my research and
blesses me in every single day of my life.
With all respect, I would like to acknowledge my supervisor, Dr Laksh, for
his guidance during my research. I really learnt a lot from him including how to be
a good researcher, how to conduct research, how to be creative, how to motivate
people and how to be a good teacher. He encouraged me during the difficult times I
went through in the course of my research.
I would like to acknowledge my parents, my little sister and Yudi who
always supports me in prayer, gives advice, cheers me up whenever I felt down, and
reminds me to not lose my hope. Thanks for your love, support, advice, concern,
encouragement, and prayer.
I also want to thank NUS and AUN-SEED Net for giving me the
scholarship and opportunity to pursue my M.Eng degree through research.
I want to take this opportunity to acknowledge all my labmates, particularly
Raghu, who equipped me with professional skills, Yelneedi Sreenivas and Sundar
Raj Thangavelu who always came up with jokes and made the situation in our lab
so cheerful. Thanks to Kanchi Lakshmi Kiran, May Su Tun and Loganathan for
discussions that turned out to be really useful for me. Thank you all for your
friendship, I really enjoy our time together in IPC group.
i
Last but not least, I would like to thank all my best friends who are not
mentioned by name explicitly. Nevertheless, I thank each of you for your
encouragement, support, suggestions, attention, and friendship.
ii
CONTENTS
Page
ACKNOWLEDGEMENTS........................................................................................ i
CONTENTS..............................................................................................................iii
SUMMARY............................................................................................................ viii
NOMENCLATURE .................................................................................................. x
LIST OF TABLES................................................................................................... xii
LIST OF FIGURES ................................................................................................ xiv
1. INTRODUCTION ................................................................................................. 1
1.1
INFORMATION BASED SOCIETY – RESEARCH BACKGROUND ...................... 1
1.2
ANALYSIS TECHNIQUES IN DATA RICH AREA – PROBLEM DEFINITION ...... 2
1.3
MOTIVATION AND CONTRIBUTIONS ............................................................ 4
1.4
CHALLENGES IN DATA ANALYSIS AND MODELING WORK ......................... 5
1.5
SCOPE OF PRESENT WORK .......................................................................... 5
1.6
ORGANIZATION OF THE THESIS .................................................................... 6
2. SUPERVISED PATTERN RECOGNITION ........................................................ 7
2.1
VARIABLE
SELECTION .............................................................................. 10
2.1.1
Fisher criterion ..................................................................................... 11
2.1.2
Entropy method................................................................................... 11
2.1.3
Single variable ranking (SVR)............................................................. 12
2.1.4
Partial Correlation Coeficient Metric (PCCM).................................... 12
2.2
MACHINE LEARNING METHODS .............................................................. 13
iii
2.2.1
Artificial Neural Network (ANN)........................................................ 13
2.2.2
TreeNet ................................................................................................ 13
2.2.3
Classification and Regression Trees (CART)...................................... 14
2.2.4
Linear/Quadratic Discriminant Analysis (LDA/QDA)........................ 16
2.2.5
Variable Predictive Model based Class Discrimination (VPMCD) .... 17
2.2.6
K-nearest neighbour (K-NN) ............................................................... 17
2.2.7
Support Vector Machine (SVM).......................................................... 18
2.3
MODEL VALIDATION ................................................................................ 19
2.3.1
Resubstitution test................................................................................ 20
2.3.2
N-fold Cross-validation ....................................................................... 20
2.3.3
Independent Test.................................................................................. 20
2.3.4
Leave one out cross-validation (LOOCV) test .................................... 21
3. PARTIAL CORRELATION METRIC BASED CLASSIFIER FOR FOOD
PRODUCT CHARACTERIZATION ..................................................................... 22
3.1
INTRODUCTION ................................................................................... 22
3.2
METHODS .............................................................................................. 24
3.2.1
Concept of partial correlation coefficients........................................... 24
3.2.2
Discriminating Partial Correlation Coefficient Metric (DPCCM)....... 27
3.2.3
DPCCM Algorithm.............................................................................. 29
3.2.4 DPCCM illustration with Iris data ....................................................... 31
3.2.5
Other classifiers used for comparison.................................................. 34
iv
3.2.6 Validation methods ............................................................................... 36
3.2.6.1 Re-Substitution Test...................................................................... 36
3.2.6.2 Random Sample Validation Test .................................................. 37
3.3
MATERIAL............................................................................................. 37
3.3.1
Datasets ................................................................................................ 37
3.3.2
Implementation .................................................................................... 39
3.4
RESULTS ................................................................................................... 39
4. ANALYSIS OF BIOMEDICAL DATA ............................................................. 46
4.1
INTRODUCTION ................................................................................... 46
4.2
METHODS .............................................................................................. 49
4.2.1 Classification Methods......................................................................... 49
4.2.2
4.3
Variable Selection Methods................................................................. 50
MATERIALS AND IMPLEMENTATION............................................. 51
4.3.1
Datasets ................................................................................................ 51
4.3.1.1 Anesthesia Dataset ........................................................................ 51
4.3.1.2 Wisconsin Breast Cancer (WBC) dataset ..................................... 52
4.3.1.3 Wisconsin Diagnostic Breast Cancer (WDBC) dataset ................ 52
4.3.1.4 Heart Disease dataset .................................................................... 53
4.3.2
Implementation .................................................................................... 53
4.3.3
Model Development............................................................................. 54
4.3.4
Validation Testing................................................................................ 54
4.3.5
Variable Selection................................................................................ 55
4.3.6
Software ............................................................................................... 56
v
4.4
RESULTS ................................................................................................ 56
4.4.1
Parameter Tuning................................................................................. 56
4.4.2
Test set Analysis .................................................................................. 57
4.4.2.1 DOA classification........................................................................ 57
4.4.2.2 Classification with WBC dataset .................................................. 65
4.4.2.3 Classification with WDBC dataset ............................................... 67
4.4.2.4 Heart Disease Identification.......................................................... 68
4.4.3
Variable Selection................................................................................ 69
5. EMPIRICAL MODELING OF DIABETIC PATIENT DATA .......................... 75
5.1 INTRODUCTION ....................................................................................... 75
5.2 FIRST ORDER PLUS TIME DELAY (FOPTD) MODEL...................................... 78
5.3 MATERIALS AND IMPLEMENTATION ................................................ 79
5.3.1
Dataset and Software ........................................................................... 79
5.3.2
FOPTD Implementation....................................................................... 82
5.4
RESULTS AND DISCUSSION .............................................................. 83
5.4.1
Patients with Continuous Insulin Infusion (Group 1) .......................... 83
5.4.2
Patients with Intermittent Insulin Infusion (Group 2).......................... 85
5.4.3 Patients with Blood Glucose Response Affected by Other Factors
(Group 3).............................................................................................. 87
5.4.4
Medication Effect................................................................................. 89
5.4.5
Analysis of Home Monitoring Diabetes Data...................................... 92
6. CONCLUSIONS AND RECOMMENDATIONS .............................................. 99
vi
6.1 CONCLUSIONS ............................................................................................... 99
6.2 RECOMMENDATIONS................................................................................... 101
REFERENCES ...................................................................................................... 105
APPENDIX A. CV of the Author.......................................................................... 114
vii
SUMMARY
Information revolution has slowly but surely turned us into an information
based society. As a result, data (as one form or source of information) collection
and interpretation holds an important role in obtaining good information. In this
thesis, some machine learning techniques are elaborated and applied to some
classification problem exists in food industry and medical field. In addition, the use
of First Order Plus Time Delay (FOPTD) to model ICU patient blood glucose is
also proposed here.
In the present study, a newly developed classifier (DPCCM) is utilized to
address both Cheese and Wine identification problems and disease identification
problems (using WBC and WDBC). Its performance was then compared with other
well established classification methods. The comparison results in Cheese and Wine
identification problems show that DPCCM has better performance than linear
classifiers and comparable result to non-linear SVM classifiers. It also provides
good visualization for understanding the specific variable interactions contributing
to the nature of each class. DPCCM consistency in its performance is even shown
in disease identification problems since it has better performance, in terms of
overall accuracy, than other classifier used in this study. To conclude, DPCCM
shows better potential to be an efficient data analysis tool for both clinical diagnosis
and food product characterization.
The performance analysis of machine learning techniques in medical field is
also done by applying some of those techniques to do depth of anesthesia (DOA)
classification and heart disease identification. According to our analysis, in terms of
overall accuracy, CART and QDA are observed to be the best classifier models for
viii
DOA classification using cardiovascular features and AEP features respectively.
Even when classifiers are built using a subset of features, the superiority of CART
and QDA in DOA classification using cardiovascular dataset and AEP features
respectively is confirmed. Our analysis in heart disease identification study shows
that TreeNet gives much better overall accuracy and gives lower class 2
classification performances compared to CART in both overall accuracy and class
wise accuracy.
The last stage of this study is to model ICU patients’ blood glucose value
using FOPTD (First Order Plus Time Delay) as the proposed model. The
performance of FOPTD is then compared with Bergman and Chase models.
According to the study, FOPTD successfully fits and predicts the actual patient data
for all datasets received from the hospital. In addition, its performance is much
better than the other two established models not only for good datasets but also for
atypical datasets. Moreover, its simplicity makes this model easy to be applied and
modified according to the input availability of the dataset.
ix
NOMENCLATURE
A, B, C, X, Z - selected variable in a given system
AEP, CV, WBC, WDBC, HEART – subscripts used to identify the name of dataset
AEP – Auditory Evoked Potential
ANN – Artificial Neural Network
CART – Classification and Regression Trees
CO – Cost Optimization
CoV – Coefficient of Variance
DOA – Depth of Anesthesia
DPCCM – Discriminating Partial Correlation Coefficient Metric
FC – Fisher Criteria
FOPTD– First Order Plus Time Delay
HR – Heart Rate
LDA – Linear Discriminant Analysis
M – correlation coefficient matrix
MAE – Mean Absolute Error
MAP – Mean Arterial Pressure
N – data matrices used in training
P – data matrices
PCCM – Partial Correlation Coefficient Metric
PNN – Probabilistic Neural Network
QDA – Quadratic Discriminant Analysis
SAP – Systolic Arterial Pressure
SVM – Support Vector Machines
x
SVR – Single Variable Ranking
VPMCD – Variable Predictive Model based Class Discrimination
WBC – Wisconsin Breast Cancer dataset
WDBC – Wisconsin Diagnostic Breast Cancer dataset
d – number of correlations defined in the system
i, j, k – subscripts used to identify the variables
k - number of classes
l – number of samples in a class
n - number of observations
p - number of variables
r - correlation coefficient
r – subscript used to represent reduced dataset
test – subscripts used to represent test data matrices used in model validation
x – order of partial correlation
xi
LIST OF TABLES
Page
Table 3.1 Classification result for case study I (WINE classification).................. 40
Table 3.2 Classification result for case study II (CHEESE classification)............ 41
Table 4.1 Summary of parameter tuning result using validation dataset for
anesthesia ............................................................................................... 58
Table 4.2 Summary of parameter tuning result using validation dataset for
breast cancer .......................................................................................... 59
Table 4.3 Summary of parameter tuning result using validation dataset for
heart disease........................................................................................... 60
Table 4.4 Classification result (correct classification) on test set using
cardiovascular features as predictors ..................................................... 60
Table 4.5 Classification results (correct classification) on test set using AEP
features as predictors.............................................................................. 61
Table 4.6 Sensitivity and specificity values for each classifier in DOA
classification .......................................................................................... 64
Table 4.7 Analysis result for WBC dataset using LDA, CART, TreeNet,
DPCCM and VPMCD............................................................................ 66
Table 4.8 Analysis result for WDBC dataset using LDA, CART, TreeNet,
DPCCM and VPMCD ........................................................................... 67
Table 4.9 Classification result on heart disease dataset using CART and
TreeNet .................................................................................................. 69
Table 4.10 Variables selected from 10 AEP features using different selection
methods ................................................................................................ 70
Table 4.11 Variables selected from 3 variables in cardiovascular dataset using
different selection methods.................................................................. 70
Table 4.12 Model accuracy using selected variables (AEP dataset) ..................... 72
Table 4.13 Model accuracy using selected variables (cardiovascular dataset)...... 72
Table 5.1 MAE values for training and test samples using data from patients with
continuous insulin infusion.................................................................... 84
xii
Table 5.2 MAE values for training and test samples using patient data with
intermediate insulin infusion................................................................. 86
Table 5.3 MAE values for training and test samples using Group3 patient data... 88
Table 5.4 Range of the parameters for each patient group .................................... 92
Table 5.5 MAE value for training and test samples using home monitoring
data......................................................................................................... 94
Table 5.6 Range of estimated parameters for home monitoring data .................... 95
xiii
LIST OF FIGURES
Page
Fig. 3.1 PCCM profiles for IRIS data .................................................................... 32
Fig. 3.2 Variable correlation shade map for each class in CHEESE
classification dataset ................................................................................ 43
Fig. 5.1 FOPTD model scheme (MISO system).................................................... 79
Fig. 5.2 Data from Patient 1 who belongs to the first Group................................. 81
Fig. 5.3 Results for the “best” patient data set using the FOPTD model............... 84
Fig. 5.4 Results for the “worst” patient data set using the FOPTD model ............ 85
Fig. 5.5 Results for the “best” patient data set using the FOPTD model
(Intermittent Insulin Infusion)................................................................... 86
Fig. 5.6 Model performance on the “best” patient data from Group 3 .................. 88
Fig. 5.7 FOPTD prediction without medication for Patient 27.............................. 89
Fig. 5.8 FOPTD prediction with medication for Patient 27................................... 90
Fig. 5.9 FOPTD prediction without medication for Patient 34.............................. 90
Fig. 5.10 FOPTD prediction with medication for Patient 34................................. 91
Fig. 5.11 Results with the FOPTD model for the patient with the highest
MAE (home monitoring dataset) ............................................................ 94
Fig. 5.12 Results with the FOPTD model for the patient with the lowest
MAE (home monitoring dataset) ............................................................ 95
Fig. 5.13 Actual glucose and model fit for all 5 home monitoring patients .......... 96
Fig. 5.14 Actual glucose and model prediction for all 5 home monitoring
patients .................................................................................................... 97
xiv
Chapter 1
Introduction
As a general rule, the most successful man in life is the man who has
the best information
Benjamin Disraeli (1804-1881)
Former British Prime Minister
1.1 Information Based Society – Research Background
Fishing and hunting marked the first stage in human history where humans
were primarily engaged in efforts to fulfill their nutritional needs. Increase in
population led to the use of agriculture and domestication of animals. Later, the
improvement in their creativity and way of thinking initiated the enhancement of
civilization. Concurrently with the invention and utilization of stones, wood and
their derivatives, civilization enhancement led to the invention and advancement of
technology. One biggest event that marked technological enhancement happened in
late 18th century is the industrial revolution (Halsall, 1997; Gascoigne, 2008). In
the early stages of industrial revolution, which began in Great Britain (circa 1730),
a machine was introduced to the industrial domain through the invention of steam
engine. The turning point and great transition from manual labor based industry to
machine based manufacturing environment resulted in both positive and negative
impact on the society at that time. Continuous development and improvement of
machines has facilitated life style transformation in the society (Kelly, 2001). Dr.
Earl H. Tilford (2000) writes about an unnoticed impact of industrial revolution
which is currently underway – the information revolution.
1
Information revolution has slowly turned us into an information based
society. While ‘information’ was always useful for human development, it is
becoming a basic need along with food, clothing and shelter. Some facts that
highlight the importance of information in today’s drive towards a knowledge based
economy are the ubiquitous cell phone and the exponential increase in the use of
internet. Ten years ago, cell phone was not that common. Its unaffordable price
made it a luxurious item at that time. The escalation of human needs in information
has encouraged cell phone manufacturers to provide additional application features,
such as radio, internet application (WIFI), Bluetooth, street directory, GPS etc at
low cost. Therefore, almost all people own a cell phone nowadays – even in
developing countries. In addition, the development of internet has paved way for
quicker and reliable information exchange with various information resources and
services such as electronic mail, online chatting, file transfer, file sharing, and other
World Wide Web (WWW) resources. As reported by internet world statistics usage,
the number of internet users has doubled in the last 8 years (2000-2008). In Africa
and Middle East, the internet user growth has even increased by 1000% during the
same period (Anonymous, 2001). These facts highlight the huge “need” for
information among people and provide solid proof that our society is transforming
into an “information based society”. As a result of this transformation, data and
information have a great effect in decision making in various spheres of human
activity. To satiate this hunger for accurate and quick information, methodologies
that can generate accurate information from raw data must be developed.
1.2 Analysis Techniques in Data Rich Area – Problem Definition
High quality information at a high speed is sought by many people in all
walks of life. This is more so with people engaged in business, research, or
2
manufacturing. Before we discuss further about information, its existence and its
importance, it will be better for us to define information. The Oxford English
Dictionary defines information as things that are conveyed or represented by a
particular sequence of symbols, impulses, etc (Oxford, 2005). Based on this
definition, we can come to a conclusion that data is one form or source of
information. As a consequence, data collection and interpretation holds an
important role in obtaining good information.
Even 10-20 years ago, data was scarce due to the relative non-availability of
analytical instruments. Even if an instrument existed, its ability was very limited
and it took quite a long time to get the results. For example, in order to check the
existence of cancer cells, the doctor had to take sample cells from the organ and
check them for any abnormalities manually (using a microscope). This procedure
took even one or two days per sample. The complexity of this conventional method
made it overwhelming when the physician had to differentiate between two nearly
identical cancers in order to give the right treatment for the patient. Luckily,
nowadays, improvements in technology have enabled the collection of samples in a
short time. Modern instruments with ability to simultaneously analyze several
samples and provide results within minutes are now available. This has resulted in a
deluge of data leading to a new problem – the challenge of sifting through this mass
of data and extracting useful information from it can be quite formidable. This is
true of data sets arising from life sciences, chemistry, pharmaceutics (drug
discovery), process operations and even medicine. Methods that can extract useful
information from data are needed and are in fact being developed actively by many
research groups.
3
1.3 Motivation and Contributions
The abundance of data available especially in food engineering and
medicine sector has become a significant problem because they contain precious
information. Since this information will facilitate the doctor and food engineer to
make good decisions which then lead to some improvement in those areas, they
have to be extracted from those datasets. The needs of information extraction have
become a strong motivation in this research.
The research was conducted as a contribution to food engineer and medical
practitioner which is finally useful for the society in many aspect of their life
especially in food quality and medicine. An excellent classification of food product
characterization using data mining technique may help food industry quality control
with relatively lower cost than the taster. Hence the production cost could be lower
and selling prices could be decreased for the convenient of the consumer.
The fact that machine learning technique could accurately be used for
disease identification and DOA classification is very important not only for the
doctor but also for the patient. The doctor may apply machine learning technique
and use the result as a basis to make decisions whether or not the patients need
further treatment. In addition, the use of machine learning technique could also be
an advantage for the patient because they do not have to take so many medical tests
which take a lot of time and very costly.
The ability of First Order Plus Time Delay (FOPTD) in modeling ICU
patients’ blood glucose value as a function of food, glucose and insulin could help
the doctor to predict the amount of glucose and insulin to be administered to the
patient to avoid hypoglycemia and hyperglycemia. Hence it will increase the
number of survive patient in the ICU.
4
1.4 Challenges in Data Analysis and Modeling Work
There are some challenges in doing data analysis and modeling work. The
main one relates to dealing with data complexity. The success of data analysis and
modeling efforts is highly dependent on the data set itself. Poor quality and/or
quantity of data as well as missing data can make data analysis even harder. Some
biological and medical datasets are too huge in size. Therefore, it is a bit too hard
for some computers to handle this kind of dataset owing to limitations of hardware
and software. Unknown noise and disturbances affecting the system can make
modeling difficult even if sufficient number of samples is available. In addition, the
complexity of the physical, chemical and biological phenomena occurring inside
the system accentuates the modeling difficulties. To keep the model simple, data
pretreatment methods such as filtering, sample section and variable selection may
be needed as well.
1.5 Scope of Present Work
Some works related to data analysis and information extraction are
addressed in this present study. They are:
•
Evaluating the performance of a newly developed method (DPCCM) by
implementing it on problems from various domains such as food quality and
medicine (cancer identification and depth of anesthesia classification) and
comparing its performance with some existing leading machine learning
methods.
•
Applying and evaluating selected variable selection methods to improve
classifier performance on medical data sets.
5
•
Identifying the limitations of existing blood glucose modeling methods in
diabetics (surgical ICU patients and patients under home monitoring) and
evaluation of a new modeling methodology.
Section 1.6 provides more detailed information of this work. This present
work mainly focuses on information extraction and data analysis covering food
product characterization problems, early identification of some chronic illness,
DOA (depth of anesthesia) level maintenance and blood glucose modeling in
diabetic patients. Various existing classification, variable selection, and model
fitting methods are studied.
1.6 Organization of the Thesis
Chapter 2 of this thesis will provide an overview on existing data analysis
methods. Both variable selection methods and classification methods are reviewed.
For all the methods, basic information about their working and their
limitations/advantages are discussed. A newly proposed classification methodology,
DPCCM is introduced in chapter 3. Herein, the performance of DPCCM is
compared to some existing and established classification methods such as CART,
Treenet, and LDA. Chapter 4 discusses data mining in the context of medical
applications. Some classification methods are applied and evaluated for early
detection of cancer, heart disease identification and for DOA level maintenance
during surgery process. The role of variable selection methods in classifier
performance is also addressed here. After doing classification and data analysis, in
Chapter 5 of the thesis, the challenging task of modeling of blood glucose data from
ICU patients and patients under home monitoring are considered. Chapter 6
contains the conclusions, a summary of the contributions and possible future work.
6
Chapter 2
Supervised Pattern Recognition
The difficulty of literature is not to write, but to write what you mean;
not to affect your reader but to affect him precisely as you wish
Robert Louis Stevenson (1850-1894)
Scottish essayist, poet and book author
Machine learning and data analysis works by learning from historical or past
experimental data. Facilitated by supervised pattern recognition, a prediction on the
outcome can be done using information available on the attributes (inputs).
Currently, many problems in manufacturing, business and medical domains (e.g.
process monitoring, disease detection and depth of anesthesia (DOA) estimation)
are related to classification problem. For such problems, supervised pattern
recognition uses data from past and existing samples in each class and builds
discrimination rules/models so that one can distinguish between classes. The aim of
constructing the classifiers is to predict to which class the new samples would
belong to. With this prediction, the analyst is able to take the best next step
(Berrueta et al., 2007). Therefore, data analysis is useful for decision making and
can help to improve industrial processes, medical treatment and business outcomes.
Some supervised pattern recognition methods exploit inter-class variations
existing in the samples to build the classification model. In this case, the classifier
tries to identify the main difference between classes. These discriminating
conditions are then applied to a new future sample which is then classified
accordingly. The Classification and Regression Tree (CART) method applies this
7
approach for classification. On the other hand, methods such as Variable Predictive
Model based Class Discrimination (VPMCD) make use of the specific similarities
that exist in each class to build the classification model. VPMCD basically tries to
find out the similarities that exist between the samples in each class. When a new
sample comes, it is checked for its class-specific properties and then categorized
into its corresponding class.
Berrueta et al. (2007) state that data analysis can be envisioned as 4
algorithmic steps. The first one is data set division. In this step, the complete data
set is usually divided into training set and validation set (or test set). The portion of
the division is usually 80% for training set and 20 % for test set (or 75% for training
set and 25% for test set). The training set is then used to build the classification
model and the test set is kept aside for validation purposes.
The second step is data pretreatment. This step is done to facilitate the next
step namely classification or information extraction and to avoid making wrong
conclusions from the dataset (Berrueta et al., 2007). Common data pretreatment
methods available for multivariate data analysis include scaling, weighting, missing
data handling and variable selection. During the experiment, some features or
attributes may be measured and characterized by using different instruments or
machines. Also, the variables recorded may have different orders of magnitude. For
such cases, weighting and scaling is usually applied to make the input variables
have the same basis. In weighting, different weights can be assigned for each
variable such that they have appropriate contributions on the output (weighting is
related to scaling). Some examples of scaling methods are mean centering
(subtracting features value by its variable average value), standardization (dividing
the mean centered value by its standard deviation), normalization (dividing all
8
values in each variable by the square root of its sum of squares), and normalization
variable (variables are normalized with respect to single variable) (Berrueta et al.,
2007).
Data received from hospitals and other sources may also contain missing
data. Data imputation is one method developed to handle missing data. It replaces
the missing value with estimated values. Some techniques replace the missing value
with the mean value of the variable (Little and Rubin, 1986; Zhang et al., 2008).
However, this method assumes there are no dependencies between the variables and
may distort other statistical properties of the data. The other well known imputation
method is hot deck imputation. In this method, missing value is replaced with the
value from other row which is similar to the row with missing value (Rilley, 1993;
Dahl, 2007). Regression imputation and decision tree imputation can also be used to
predict missing value. In regression imputation, missing data is predicted by
regression equation built using the other variables which contain no missing value.
Similarly, for decision tree imputation, a decision tree is built using rows which
have no missing value and the variable with missing value acts as the target
variable. The missing value is then predicted by applying this decision tree to the
row with missing value (Jagannathan and Wright, 2008). Variable selection is
needed when we deal with huge datasets so as to minimize the computational time
and make model or classifier construction relatively easy. Variable selection will be
discussed in detail in section 2.1. In this thesis, we only focus on variable selection
method (Chapter 4) and centering method (Chapter 5) because the dataset used is
relatively large and there is no missing data in the datasets.
The third step is classification model building. In this step, all information
contained in the training data set (excluding test set) is used to build the
9
classification model. Once the classification model is constructed, the data analyst
proceeds to the last stage which is the crucial validation part. The model obtained
from the previous stage is tested using the test data set. The accuracy and other
characteristics of the classifier are then noted and reported as “classifier
performance”. An elaborate explanation about data analysis algorithms can be seen
in sections 2.1 to 2.3.
2.1 Variable selection
One biggest challenge faced by almost all classifiers relates to the size of
data set. To create a good and robust classifier, we need a data set that is rich in
both quality and quantity. Data set with a few samples will give insufficient
classification information to the classifier hence its performance will be low. Large
data sets, which has many variables, can potentially provide enough information,
but the analysis will be time consuming and computationally expensive. Therefore,
in problems involving large (in the number of variables) data sets (e.g. micro array
data), the most common data pretreatment methods used is variable selection. Only
important “discriminating variables” will be processed by the classification
algorithm.
Variable selection is not an absolute requirement for classifier development
or as a matter of fact for any data analysis activity. However, variable selection can
sometimes boost the classifier performance especially if it is applied on data set
containing noise. Through this step, variables containing noise, redundant
information and without discriminating ability are removed from the data set. This
reduces the input space so that the building of the classifier model will be easier,
faster and even more accurate. In addition, identification of important variables may
be able to give better information to perform a more accurate classification (Cheng
10
et al., 2006). It is understood that pretreatment must be done in similar manner on
both the training and test sets.
We now review some variable selection methods:
2.1.1 Fisher criterion
Fisher criterion is defined as the ratio of “between-class” and “inter-class”
variances (Wang et al., 2008). This criterion is maximized by Linear Discriminant
Analysis (LDA) (Duda et al., 2000) to identify the best separation plane by
weighting predictor variables. Therefore, after the plane is built, each variable has
its own weight factor. These weight factors are then used as a basis to rank the
variables. Since this approach is derived from LDA and Quadratic Discriminant
Analysis (QDA) concepts, the chosen variables will be biased towards LDA and
QDA classification method. Therefore, this variable selection method will generally
boost LDA and QDA performance. However, it is not uncommon for combination
of Fisher criterion-classifier other than LDA/QDA to give a good classification
result that is even better than the combination of Fisher criterion-LDA/QDA.
2.1.2 Entropy method
Entropy, as variable ranking method, is basically a part of the CART
algorithm. Since it works in line with CART classifier, the best variable set chosen
will provide enough information to CART to perform a good classification.
Therefore, it is not surprising that, entropy is usually a useful method for improving
CART performance.
Like CART, in the first step of this algorithm, an entropy (Ebrahimi et al.,
1999) value which signifies the randomness in the variables is calculated for every
11
variable. After that, the variables are ranked based on their entropy value. The
greater the entropy value, the more potential a variable has as class separator.
2.1.3 Single variable ranking (SVR)
SVR is an univariate approach derived from LDA and QDA. In SVR, a
selected predictor variable (only one) is used to build an LDA model which is then
tested to determine the classification accuracy. This LDA model building and
testing is independently repeated for all the predictor variables so that the
classification accuracy for each variable is obtained. The variables are then ranked
based on these prediction accuracy values. The SVR approach provides a good
measure of variable influence on classification in line with the principle of LDA
classification.
2.1.4 Partial Correlation Coeficient Metric (PCCM)
In PCCM method, the partial correlation coefficients of orders 0, 1 and 2 are
calculated between different pairs of variables. The resulting multivariate
associations (in the form of edges on a node in the association network) are then
used as a basis for variable ranking (Raghuraj Rao and Lakshminarayanan, 2007a).
PCCM as data pretreatment can potentially influence variable interaction based
approaches such as VPMCD and Artificial Neural Network (ANN).
After applying variable selection method, the training data is then ready to
be processed by the chosen machine learning method to build a classification
model. Some popular and effective machine learning methods are described next.
12
2.2 Machine Learning Methods
Once the data set is ready for further analysis, the training data is subjected
to a suitable supervised pattern recognition method to build a classification model.
As discussed earlier, the test set data is kept aside during model building.
2.2.1 Artificial Neural Network (ANN)
Artificial Neural Network (Razi and Athappilly, 2005; Berrueta et al., 2007)
is a widely used black box machine learning method since it is insensitive to noise,
has a high tolerance to data complexity and is able to handle the non-linearities in
data set quite naturally. ANN comprises of an input layer representing input
variable nodes, set of hidden layers with computational neurons and an output layer.
The performance of neural network is sensitive to the number of hidden layers used
while building the network. Higher number of hidden layers can lead to data overfitting while smaller number of hidden layers can affect prediction accuracy. In this
study, we utilize back-propagation neural network in which the weight values (the
coefficients of connectivities between nodes) are adjusted during training by
propagating the error (difference between the network output and true diagnoses
available in training dataset) backward through the network (Statnikov et al., 2005).
This learning process will identify the matrix of weights that gives the best fit to
training data (Berrueta et al., 2007).
2.2.2 TreeNet
TreeNet (Freidman, 1999) applies a slow learning process leading to a
network of several (possibly hundreds of) small trees (see Classification and
Regression Trees description below). Each of the trees makes a little contribution
towards the final model (Raj Kiran and Ravi, 2008). The trees usually have less
13
than 8 terminal nodes and the final model is similar in spirit to a long series
expansion (such as a Fourier or Taylor series expansion) - a sum of factors that
becomes progressively more accurate as the expansion continues. Therefore, more
the number of trees used in building the network, a better fit to the data can be
obtained. Since TreeNet is equipped with self-test ability, it is able to prevent overfitting. Some of TreeNet advantages are fast model generation, automatic selection
of predictors, simple data pretreatment steps, easy handling of missing values, and
robustness to partially accurate data. Technically, TreeNet is equipped with a cost
tab which facilitates model building. The basic idea of cost tab is to assign larger
cost for misclassification on one particular class than other classes. Hence the
model built will give a good accuracy to that particular class. However, it will
sacrifice the accuracy of other classes as a consequence. The cost tab is useful when
dealing with medical data sets which need more accuracy on one class of patients
(e.g. patients with certain disease) than others (e.g. healthy subjects).
2.2.3 Classification and Regression Trees (CART)
CART (Breiman et al., 1983) is a supervised pattern recognition method
which has been used to extract useful information from not only chemical process
datasets (Saraiva and Stephanopoulos, 1992) but also medical record data sets
(Kurt et al., 2008). The extracted information is then presented as classification
rules in the form of a tree. For situations where the target variable is discrete or
categorical (such as DOA level), classification trees are developed and if the target
variable is continuous, regression trees are constructed (Deconinck et al., 2005).
The existence of classification rules as its outcome gets CART categorized
as a white box classifier. It is superior to other classifiers since the rules can be
easily applied to classify a new sample to its corresponding class. Therefore, it is
14
not surprising that CART is widely used to generate rules for processes
improvement based on historical plant data (Bevilacqua et al., 2003; Tittonell et al.,
2008), safety management (Bevilacqua et al., 2008), product quality prediction
(Rousu et al., 2003) or to detect cancer early based on medical record data
(Spurgeon et al., 2006; Kojima et al., 2008). One of the other advantages of CART
as a tree building algorithm is its ability to handle missing data and nonlinear
relationships between input and output variables.
Given a set of training data, CART will choose a variable which has the
potential to be the best separator from feature matrix (X) by doing diversity
measurement. There are 3 diversity measurements available in CART and each of
them will generate their own tree which differs from one another (Kurt et al., 2008).
The tree generated by Gini index tends to separate class with the largest population,
followed by the class with next smaller population and so on to the class with the
smallest population at the bottom of the tree. The other diversity measurement is
entropy. In this method, the entropy value of each variable will be calculated and all
variables are then ranked based on their entropy value from the highest to the
lowest. The tree (with entropy diversity measure as the basis) is then built by using
the variable with highest entropy value as the best separator, continued by using the
second best separator and so on. The last method of diversity measurement is
twoing method. This method tends to build a tree which is able to separate half of
total classes available in the data from the other half at each step.
Using the best variable, a rule is then constructed to separate one class from
another. This condition will be the initial node for tree building and will be splitted
further based on logical outcome of decision for the condition. This binary splitting
process will recursively proceed from the top of the tree to the bottom of the tree
15
until the population of the terminal node is nearly homogenous. The tree built is
now called as maximal tree which may suffer from overfitting especially in high
dimensional datasets with multivariate interactions between variables. In order to
overcome this problem, the tree must be pruned using some approach. Here, we
employ minimal cost pruning method which will prune the branches in a manner
that does not significantly affect the accuracy of prediction with the tree. To select
the optimal pruned tree for classification of new samples, either cross-validation
test, or validation with fresh data test can be utilized. Like TreeNet, CART is also
equipped with cost tab to facilitate application handling where higher prediction
accuracies are sought for some specific classes.
2.2.4 Linear/Quadratic Discriminant Analysis (LDA/QDA)
Linear Discriminant Analysis (LDA) (Duda et al., 2000; Roggo et al., 2007) is
the most common machine learning technique used for classification. LDA weighs
all variables to identify separating planes between classes by maximizing the ratio
of “between-class variance” and “within-class variance”. The main assumption used
in LDA is that class conditionals follow Gaussian distribution (Wang et al., 2008).
Since LDA is a linear classifier, LDA’s performance is generally very good for
linearly separable datasets. However, the presence of overlapping samples
belonging to different classes which cannot be separated linearly on a descriptor
space, affects LDA’s performance.
Another technique available for classification is Quadratic Discriminant
Analysis (QDA). QDA (Duda et al., 2000; Roggo et al., 2007) is developed to
handle situations wherein the classes are not linearly separable. As a non-linear
classifier, QDA constructs a parabolic boundary that maximizes “between-class
variance” and minimizes “within-class variance” in projected scores. The
16
assumption that class conditionals follow Gaussian distribution is still used in QDA.
However, unlike LDA, it tolerates differences in covariance matrices for the various
classes (Wang et al., 2008). LDA and QDA will generally exhibit a good
performance in problems which have more number of samples than variables
(Berrueta et al., 2007).
2.2.5 Variable Predictive Model based Class Discrimination (VPMCD)
VPMCD, proposed recently by (Raghuraj Rao and Lakshminarayanan,
2007b), is a parametric supervised pattern recognition method. During the
development of this classifier model, the main assumption used is that predictor
variables are dependent on one another and each class exhibits a unique pattern of
variable dependence. VPMCD belongs to the family of classifiers that uses
mathematical equations to define classification boundary between classes. For each
class, VPMCD develops a model for every variable as a function of the other
variables. As a result, each class has a unique system characterization in terms of
specific inter-variable interaction models which can be exploited further to classify
new samples.
2.2.6 K-nearest neighbour (K-NN)
K-nearest neighbour based classifier (Cover and Hart, 1967) makes use of
Euclidean distance to classify a new object (Bagui et al., 2003; Statnikov et al.,
2005). In the case involving strongly correlated variables, correlation based
measures are used instead of Euclidean distance. The new object will be assigned in
the class to which majority of K nearest objects to the new object belong. K is
usually odd (K=3 is frequently preferred). Preprocessing data (variable scaling) is
strongly encouraged to avoid the effect of different scales of the variables.
17
Compared to other classifiers, K-NN is mathematically simpler, free from statistical
assumptions and its effectiveness is independent of the spatial distribution of
classes. However, similar to LDA, the performance of K-NN will be poor if the
samples for existing classes are not equally distributed (Berrueta et al., 2007).
2.2.7 Support Vector Machine (SVM)
SVM (Vapnik, 1995) is one of the most powerful established classification
algorithms in supervised pattern recognition literature. Its performance, in
classification, is comparable and even superior to other existing classifiers. Since it
is insensitive to dimensionality, its ability in handling a large scale classification
problem (many variables and many samples) is acknowledged. Furey et al. (2000)
and Guyon et al. (2002) have noted superior SVM performance in dealing with
classification problems in biomedical area on data sets involving large number of
variables and very little samples.
In its basic form, SVM can only be applied to solve binary classification
problems. It constructs a hyperplane that maximizes the width margin between the
classes. A new sample will be assigned to the class based on the area it falls into
(Statnikov et al., 2005). Since most of problems existing in the real world are made
up of multiple category, the question of applying such a powerful algorithm for
solving multiclass problems was considered by many researchers. Some algorithms
have been developed over the last several years to enable SVM implementation on
multicategory problems. Examples include: One versus Rest (OVR) and One versus
one (OVO). These approaches are detailed below.
Explained in detail by (Kressel, 1999), One versus Rest (OVR) is the
simplest algorithm proposed for multiclass SVM. In this algorithm, one k-class
18
problem is broken into k binary-class problems. The classification is then done by
constructing a separation between class 1 and the others, class 2 and the others and
so on until class k and the other classes. The sample will be assigned to the class
with the furthest hyperplane. The disadvantages of this approach are that it is
computationally expensive and has no theoretical justification (Statnikov et al.,
2005).
In the one versus one (OVO) approach, one separation plane which
maximizes the margin between two classes is built for every pair of classes.
Therefore, for the k-class problem, [k*(k-1)/2] planes need to be constructed. A
new sample will subjected to all [k*(k-1)/2] classifiers which results in [k*(k-1)/2]
label predictions. The sample is classified to the class which has the largest number
of votes (Statnikov et al., 2005).
After the model is created, some tests are applied to check the accuracy and
robustness of the classifier. This stage is called validation step and is explained
below.
2.3 Model Validation
The final model obtained from the model building step is then applied to test
dataset. The results of this test provide a realistic estimate of the classifier
performance in predicting the class to which a new sample belongs to. It is a valid
metric to decide which classifier is suitable to solve the problem at hand. It is
important to know that the performance of classifier is highly dependent on the data
set. For one dataset, method A may turn out to be the best but for another data set,
method B may work better than method A.
19
As stated above, once the classifier model is developed using any of the
techniques described in section 2.2, the validity of the model is gauged using test
data. Two different classifier testing methods are usually used to compare the
performances of different techniques.
2.3.1 Resubstitution test
Resubstitution test can provide a measure of self consistency of the model.
In this case, all data are used to build a model. After the model is built, it is tested
on the same dataset that was used for model building. Most of the classifiers will
indicate a very good performance when subjected to the resubstitution test.
However, it is not a good testing criterion as it does not provide any indication of
the generalizing capability of the classifier.
2.3.2 N-fold Cross-validation
In N-fold cross-validation test, the dataset is randomly divided into N sets of
data. The classification model is then built by using (N-1) sets of data and tested on
the 1 set of data that was excluded during model building. This data division-model
building-test procedure is repeated N times and usually the mean accuracy and
standard deviation of accuracy are reported as the outcome of this N-fold test. Nfold cross-validation is usually used to choose the optimum classification model in
some classification methods. The model obtained from this test is usually robust
enough to be applied to new samples because it has considered data randomness
during the modeling step.
2.3.3 Independent Test
An independent test is done as the final step of the classifier building effort.
After the final model is obtained based on training data, it is tested on a fresh test
20
set. This, in most cases, would be a portion of the original dataset which was
excluded during model building. This type of validation justifies the stability of the
algorithm in that the effect of new data points on the performance of the classifier is
considered (Duda et al., 2000).
2.3.4 Leave one out cross-validation (LOOCV) test
Basically, LOOCV algorithm is similar to cross-validation test. In LOOCV,
1 sample is taken out from the dataset for testing. The classifier model is built by
using the remaining (N-1) samples and the model is then tested on the 1 excluded
sample. This algorithm is applied repeatedly so that every single sample becomes a
test sample. The average accuracy is calculated as the outcome of LOOCV test and
it represents the overall performance of the classifier.
The performances on the selected data sets are compared based on the
percentage of correct classification, both for individual classes and for all classes
put together (overall classification accuracy).
Overall, chapter 2 thoroughly discusses data analysis algorithm, summarizes
some data pretreatment techniques, and elaborates commonly used variable
selection methods, classification algorithms and model validation methods.
21
Chapter 3
Partial Correlation Metric Based Classifier for
Food Product Characterization
Food is the moral right of all who are born into this world
Norman Borlaug (1914)
American Scientist
3.1 INTRODUCTION
Identification and classification of products into different categories is an
important and a significant problem in food industries. General applications like
spoiling yeast growth modeling (Evans et al., 2004), data analysis in food
applications (Berrueta et al., 2007), HACCP implementation in food industries
(Bertolini et al., 2007) and food authentication (Toher et al., 2007) have benefited
from discriminant analysis research. The classification problems are characterized
by special challenges such as multivariate feature space, presence of different types
of attributes (binary, discrete and continuous) and multiple-class datasets. Many
methods have been attempted to address these issues (Tominaga, 1999; Berrueta et
al., 2007). The main objective of these supervised algorithms is to learn the
relationship between the measurable variables (observed based on physico-chemical
attributes) and different pre-defined product characteristics of the system (classes
based on quality indicators). These relationships, in the form of mathematical
models, set of rules or statistical distributions are then used to predict the class of
the new set of measurements made on the same system.
The performance efficiency of any classification method depends largely on
the type of dataset. Sample classes that can be linearly separated (Tominaga, 1999)
22
on a descriptor space can be effectively classified using Linear Discriminant
Analysis (LDA). Suitable linear decision boundaries can be designed to distinctly
group the samples on either side of the boundary. In complex multivariate datasets,
characteristic of many chemometrics applications, the class data points show
overlapping clusters when projected on a lower dimensional space. During training,
suitable straight lines or hyper-planes cannot be designed to effectively distinguish
the observations belonging to different classes. Methods built in orthogonal feature
space (linearly independent variables) fail to capture the inter-variable
dependencies leading to specific class structure and hence linear hyper-plane
classifiers, like LDA cannot always separate groups distinctly.
Model-based statistical methods like discriminant partial least squares
(DPLS) (Tominaga, 1999; Chiang and Braatz, 2003), decision rule based
classification trees, advanced machine learning techniques like Artificial Neural
Networks (ANN) (Razi and Athappilly, 2005) and Support Vector Machines
(SVM) (Vapnik, 1995; Granitto et al., 2007) have been successfully employed for
non-linear classification problems. The discriminating ability of these classifiers
depend
either
on
variations
in
variables
across
different
classes
(LDA/SVM/decision tree) or on the extent of associations between different
features and output variables (ANN/DPLS). For effective classification of linearly
inseparable, multivariate data, these two factors measured in terms of class to class
dissimilarities and intra-class associations between variables need to be utilized
simultaneously.
The new Partial Correlation Coefficient Metric (PCCM) based classification
technique, used in this chapter, attempts this balanced approach of data
classification. The basic idea adopted is to model the possible inter-variable
23
relations (in the form of inference metric) for each class in the training data based
on the higher order partial correlations between them. These metrics, defined for
each class in the training set, model the intra-class attribute relations for individual
classes. The sample to be tested is then embedded into each class model and new
inter-variable correlations structure is measured. The proximity of the new variable
interaction structure to the individual class models is used as classification criteria.
The PCCM methodology and the new classification approach are studied here with
respect to classification of food products and quality characterization.
3.2 METHODS
3.2.1 Concept of partial correlation coefficients
The Pearson correlation coefficient (r) defines the linear association
between continuous random variables and has been widely employed in literature
(Sokal and Rohlf, 1995; Timm, 2002), for many variable interaction mapping
problems. However, the correlation coefficient alone cannot distinguish direct and
indirect relationships between variables. Consider, for example, two variables A and
B. The association between A and B can occur in different ways such as direct
relationship A
B, both co-regulated by a third variable C (i.e. C
or indirect relationship A
C
B and C
A)
B. The regular correlation coefficient r defined on
the two variables A and B does not differentiate between these types of relations and
marks A and B as being related or not related.
The partial correlation coefficient brings out this difference separating the
indirect relations or path relations. The correlation between two variables is said to
be conditioned on the third or a specific set of other variables when the effects of
those variables are filtered from A and B before calculating the coefficient. Hence,
24
partial correlation rAB/C highlights the existence of correlation between A and B if
the effect of the conditioned variable C is deleted.
The order of the partial
correlation coefficient is zero if the correlation is directly defined between A and B
without conditioning on any variable. The order is x when the correlation is
calculated after conditioning on x number of different variables other than A and B
(Sokal and Rohlf, 1995). Eqs. (1) through (3) give the general definition for the first
three orders of partial correlations.
zeroth-order correlation:
rAB =
cov ( A, B )
var ( A) var ( B)
(1)
first-order partial correlation:
rAB / Z =
[rAB − (rAZ rBZ )]
(1 − r )(1 − r )
2
AZ
2
BZ
(2)
second-order partial correlation:
rAB / XZ =
[rAB / X − (rAZ / X rBZ / X )]
(1 − r )(1 − r )
2
2
AZ / X
BZ / X
(3)
The correlation measure rAB and partial correlation measures rAB/Z and rAB/XZ
exhibit symmetric property (i.e. rAB = rBA, rAB/Z = rBA/Z and so on) and these
coefficients are bounded between values -1 and 1 (Sokal and Rohlf, 1995). Hence,
instead of evaluating a full correlation matrix (with redundant entries) the intervariable association structure can be represented as a single array of unique values
of correlation coefficients representing a definite order of variable combinations.
25
Such a vector of coefficients is referred here as the Partial Correlation Coefficient
Metric (PCCM). This PCCM vector stores a definite pattern and strengths of intervariable associations for a given system. In a system with p variables, 0th order
PCCM will have [p*(p-1)/2] elements, 1st order PCCM will have [p*(p-1)*(p-2)/2]
elements and 2nd order PCCM will have [p*(p-1)*(p-2)*(p-3)/4] elements in the
vector.
Partial correlation coefficient has been used in literature to infer direct and
indirect associations between random measurements (Eisen et al., 1998; Steuer et
al., 2003; Baba et al., 2004; de la Fuente et al., 2004). Most recently, Raghuraj and
Lakshminarayanan utilized partial correlation structure to select a set of important
features for classification (Raghuraj Rao and Lakshminarayanan, 2007a) and
multivariate calibration (Raghuraj Rao and Lakshminarayanan, 2007c) applications.
The focus in the present study is to adopt the concept of partial correlation metric as
a discriminating model for sample classification applications. To our understanding
this approach is the first of its kind in data classification, especially for
chemometrics applications in food technology.
The general statement of the classification problem can be formulated as
follows. Consider a system N [n x p; k] in which n observations belonging to k
different classes of the system are obtained by measuring p variables. The objective
of the discriminant analysis is to develop a classifier (using the observations in N)
by modeling each of the k classes. The adequacy of the classifier is then tested
based on its ability to predict the classes of samples in N (self-consistency or resubstitution test) and to predict the classes of new set of samples Ntest [m x p; k],
which were not used during modeling (independent sample test). The methodology
26
adopted to achieve this objective using PCCM based discriminant analysis, is
explained in the next section.
3.2.2 Discriminating Partial Correlation Coefficient Metric (DPCCM)
The underlying principle of DPCCM method is to build distinct variable
interaction structure for each class using training data (N). These individual class
models are represented by a characteristic vector of calculated partial correlation
coefficients between an identified sequence of all the variable pairs. The intervariable correlation coefficient vector (Ri , i = 1, 2, 3, …, k) of each class are stored
in a single model structure in the form of DPCCM, Mmodel [k x d], where d is the
number of partial correlations defined between pairs of variables by conditioning on
other variables. For example, d = p*(p-1)/2 for 0th order and d = p*(p-1)*(p-2)/2
for 1st order partial correlations between variables. Mmodel [k x d] represents the
learnt classifier model for the entire system N, which can be then used to predict the
class of a new observation given the values of its p measurements. It must be
highlighted here that the basic assumption made during this training step is that all
the samples belonging to specific class in the dataset N consistently represent the
characteristics of that class and the group samples do not contain any outliers. In
case of applications where the training data samples are inconsistent within each
class, a suitable outlier detection step can be employed as precautionary preprocessing step before building the metric, Mmodel. When a new observation from
the sample matrix Ntest is to be classified, it is appended as an additional row into
the model data (i.e. in N) for each class and the above procedure is repeated using
the expanded dataset to obtain a new correlation structure, R for that class (using
the same order of partial correlations as used during modeling). This is repeated by
embedding sample observation into the data set for each class to obtain sample
27
DPCCM, Msample [k x d]. Each row of Msample represents the vector Ri, (i = 1, 2, 3,
…, k), each computed after embedding the sample observation in respective class
data. Each row in Msample is then compared for its similarity with corresponding row
in Mmodel, using the standard Pearson’s correlation between the two vectors. Since
the inter-variable association structures are captured in terms of scale free
qualitative measures of correlation coefficients, we again utilize the correlation
coefficient similarity index instead of any scale based measure (like Euclidian
distance). The sample observation is classified into class i (i = 1, 2, 3, …, k), if the
correlation between row ‘i’ of Msample and row ‘i’ of Mmodel is maximum. Since the
PCCM algorithm captures all the inter-variable relations, it is conjectured that the
final DPCCM Mmodel obtained on the training data represents a variable interaction
discriminatory model to be used for sample testing. The DPCCM classification
analysis for new samples is built on the hypothesis that if the sample is embedded
with the right class while rebuilding the DPCCM for sample analysis, the rows of
Msample will not differ significantly as compared to Mmodel. In other words, if the
inter-variable correlations are distinct for each class, then a test sample belonging to
a particular class will be an outlier for other classes and hence will break the
correlation structure for those classes, while retaining the original structure for the
class it belongs to. Since the class specific variable association structure (PCCM) is
designed using correlation between all possible pairs of variables, the effect of
outlier increases with increase in the number of system variables, p while testing for
new samples.
Higher order PCCM, if used with threshold values, identify and eliminate
the indirect relations. This enhances the accuracies when applied to network
inference problem at the expense of computational effort (de la Fuente et al., 2004).
28
One can start from zeroth order PCCM and gradually improve the network using
higher order PCCM. However, DPCCM uses the full PCCM without eliminating
the entries based on statistical significance of the correlations. The premise is, even
the less significant correlations are necessary components of inter-variable
association structure and can be useful distinguishing factors during the sample
prediction step. The new sample observation belonging to a particular class must
have both, the strong and the weak correlations between variables consistently
appearing in the corresponding row of Msample. If the insignificant variable
correlations in Mmodel become significant in Msample, it will contribute further to the
discriminating ability of the model and hence will improve the classifier
performance. In the present analysis, the algorithm uses different order for DPCCM
to map the attributes. The order which gives the best discriminating results (during
re-substitution test) is utilized as Mmodel for that particular application. This is
attributed to the fact that, for applications where variables are not strongly
correlated, higher order DPCCM may not affect the results positively. On the other
hand, for applications where the variables are highly interdependent, increase in the
order of DPCCM will improve the classification results. The following section
gives a step by step algorithm for DPCCM classification analysis.
3.2.3 DPCCM Algorithm
DPCCM training:
Step 0: Read training data matrix N [n x p; k]. Pre-process to detect and remove
the outlier samples from each group. Select the order (0, 1, or 2) for calculating
PCCM.
29
Step 1: Split the matrix N [n x p] into Gi (i = 1, 2, …, k) separate group matrices
with orders l1 x p, l2 x p, …,lk x p respectively, where li is number of
observations for the ith class.
Step 2: For each group matrix Gi calculate all possible sets of partial correlation
coefficients using Eqs. (1), (2) or (3) depending on order selected in Step 0.
Store the correlation coefficient arrays Rj, j = 1, 2, …, d as the rows of DPCC
Metric. Mmodel is thus a k x d matrix with the ith row comprising the partial
correlation coefficients (of selected order 0, 1 or 2) for class i.
Re-substitution test for optimizing the order: Initiate Ntest = N
Step 3: Select the test dataset, Ntest [m x p; k] for sample prediction. Select a test
sample reading Y [1 x p] and augment the row in each of the group matrices Gi
starting with first group. With Y embedded in each group matrix, repeat step 2
to obtain new rows in DPCC Metric, Msample
Step 4: Calculate the correlation coefficient between corresponding rows of Mmodel
and Msample
Step 5: Determine the row ‘i’ (i = 1, 2, …, k) for which the correlation is highest
and classify Y as belonging to that class. Repeat steps 3 to 5 for all test samples
in Ntest.
Step 6: Calculate the percentage of samples in Ntest that are correctly predicted.
Repeat steps 1 to 6 using PCCM order 0, 1 and 2. Optimize the DPCCM order
based on the highest accuracy of prediction.
DPCCM sample testing: Read test set to be predicted, Ntest
Step 7: Select Mmodel for the order optimized in step 6. Repeat steps 3 to 5 with
given test set as Ntest and predict the classes for each sample.
30
3.2.4 DPCCM illustration with Iris data
The concept of inter-variable correlations metric and DPCCM algorithm
are illustrated with a well studied dataset on Iris flower classification. This, flower
taxonomy dataset originally studied by (Fisher, 1936) is available at
(http://www.ics.uci.edu/~mlearn/databases/). The dataset consists of 150 Iris flower
samples (n =150) belonging to three different groups (k = 3 ; labeled Setosa,
Virginica and Versicolor) with four measurements on each flower (p = 4 ; Sepal
Length - SL, Width - SW, Petal Length –PL and Width - PW). For the present
analysis, one sample belonging to Setosa group is separated for testing (Ntest [1 x 4 ;
Setosa]) and the remaining 149 samples are used as training set N [149 x 4 ; 3].
Figure 3.1, brings out the concept of class specific inter-variable correlation
structures and working principle of DPCCM method. We select 0th order PCCM
measure for comparing different groups. The samples (in N) belonging to each class
are separated and correlations are defined between each pair of variables (as shown
in x-axis of Fig. 3.1) using Eq. (1). Rows of the PCCM metric, Mmodel (shown using
solid lines in Fig. 3.1), represent the six inter-variable correlations for a particular
group of flowers (shown with different markers for each group). As observed, each
group of flowers shows distinct PCCM profile. SL and SW are correlated better in
Setosa group compared to others, whereas SL - PL are highly correlated in
Virginica and Versicolor flowers. Correlation between SW-PL and SW-PW bring
better separation between the three groups. Overall, it is evident that 0th order
PCCM measure can capture the unique inter-variable patterns in each group and
hence can be utilized to distinguish samples belonging to different groups. The
same set of correlations is re-calculated for all the three groups, by inserting test
sample Ntest into respective group data in N. The correlation profiles for the new
31
sets with embedded test sample represent the rows of Msample (shown as dashed
lines in Fig. 3.1).
1
M model-SETOSA
0.9
M model-VERGINICA
M model-VERSICOLOR
0.8
M sample-SETOSA
0th. order partial correlation
M sample-VERGINICA
0.7
M sample-VERSICOLOR
0.6
0.5
0.4
0.3
0.2
0.1
0
SL_SW
SL_PL
SL_PW
SW_PL
SW_PW
PL_PW
variable pairs
Fig. 3.1 PCCM profiles for IRIS data. Rows of Mmodel (solid lines) and Msample
(dotted lines) for each group (differentiated with different markers) of
flowers are plotted for comparison. Correlation metric profiles in Mmodel and
Msample are similar for SETOSA flower group, indicating the class of the
selected test sample flower. Correlation metric breaks when test sample is
embedded into other two groups due to class mismatch.
The PCCM profile in Msample corresponding to ‘Setosa’ (dash line with
‘O’ markers) is very similar to the PCCM profile in Mmodel for ‘Setosa’ (solid line
with ‘O’ markers). On the contrary, the PCCM profiles for other two groups in
Mmodel, differ significantly from the respective profiles in Msample. The correlation
32
between corresponding rows of Mmodel and Msample are computed to be 0.9997,
0.7720 and 0.5570 for ‘Setosa’, ‘Virginica’ and ‘Versicolor’ groups respectively.
Based on this PCC metric similarity score, DPCCM classifies sample in Ntest as
‘Setosa’ type flower. It must be also observed that a single sample when included
during PCCM calculation with other group, disturbs the inter-variable correlations
significantly even if there are 50 other homogenous samples in that group. For
example, SW-PL and SW-PW correlations are higher in Mmodel, but show lower
correlation values (in Msample) when non-homogenous sample is embedded. It is also
interesting to observe lower correlations between PL-PW in Mmodel have shown
higher correlations in Msample, establishing the importance of retaining all
correlations in differentiating the groups. We presume, this variation in PCCM
between Mmodel and Msample profiles is mainly due to the sensitivity of correlation
measure to an outlier. This difference should be more prevalent for higher variable
dimension data, as we define more inter-variable correlations. We also tested the
effect of partial correlation order on the distinct PCCM patterns. With the same set
of N and Ntest data, the 1st order PCCM profiles in Mmodel and Msample are correlated
as 1.0000, 0.7804, and 0.5862 for each group respectively. Similar analysis with
2nd order PCCM gives group wise correlations as 1.0000, 0.9477 and 0.7768.
Comparing the inter-group differences in these Mmodel-Msample similarity scores for
each PCCM order, we can conclude that 0th order inter-variable correlations provide
highest distinction between groups for Iris data. With these encouraging
observations, we further explore the extension of DPCCM classification method to
different chemometrics problems and compare its classification performance with
other established classifiers.
33
3.2.5 Other classifiers used for comparison
The DPCCM technique is applied to two case studies and the results are
compared with that from established classification algorithms like LDA, CART,
Treenet and SVM. These methods are discussed in detail in Chapter 2. However,
briefly description of each method is given below to help the reader with ready
information.
LDA (Duda et al., 2000)
is the most commonly used linear classifier
developed by Fisher (1936). The LDA classifier provides a linear boundary
separating the two classes. The classification function for this boundary is designed
by maximizing the ratio of inter-group variance to the intra-group variance of
projected scores. It has advantages such as being quick and accurate for linearly
separable classes but performs poorly for data with overlapping class profiles.
CART (Breiman et al., 1983) is a decision tree based classifier which is also
called as binary recursive partition method. Classification tree is built by splitting
the data into two branches using the best attribute or variable as separator variable
(node). The best attribute used to define a decision rule on a node is prioritized
based on one of the impurity measures such as Gini index, entropy or using
‘twoing’ method (Kurt et al., 2008). The split node is called as “parent node” and
the resulting nodes as “child nodes”. The splitting process is continued from top to
bottom and tree construction stops at the ‘terminal nodes’ which contain data
samples with nearly homogenous class. The advantages of CART algorithm
include: (i) easily interpretable and implementable rules (ii) needs very little data
pretreatment, (iii) ability to handle both numerical and categorical data and (iv)
ability to handle missing data. On the contrary, it can overfit a classifier model for
34
training data, especially for high dimensional datasets with multivariate interactions
between variables.
Treenet (Freidman, 1999) is a network of several hundred small decision
trees with each of them having a small contribution in building the overall model
(Raj Kiran and Ravi, 2008). Each minimal tree usually has less than 8 terminal
nodes. Apart from having the advantages of CART algorithm, Treenet approach
provides a more generalizable classifier model. Treenet approach has been
successfully used mainly in financial data analysis and recently for soil
characterization (Brown, 2007).
SVM (Vapnik, 1995) is established as the most advanced and robust
classifier for many applications ranging from character recognition to cancer
diagnosis. It provides an effective tool to distinguish non-linearly separated classes
(overlapping or embedded classes). It projects the original feature vectors onto a
new, linearly separable vector space using variable transformation functions called
‘kernels’. Since it finally uses only the support vector features in the projected
space the SVM model is almost independent of the number of attributes in the
original data. Hence its performance is easily scalable, giving it immediate
advantage compared to other methods especially for complex classification
problems (p >> n). On the contrary, it suffers from computational effort for datasets
with a large number of samples in N. Such cases require increased number of
support vectors for classification further complicating the rigorous optimization
algorithms employed during model building. As it is basically a binary classifier, its
extension to multi-class problem needs additional mathematical formulation.
Compared to the above methods, the new DPCCM approach proposed in
this chapter provides a new classifier which does not seek for decision boundaries,
35
analyzes the data in original variable space without having to employ any iterative
optimization algorithms and is able to simultaneously attempt multi-class problem.
Hence the new approach attempts to eliminate most of the limitations associated
with existing methods as discussed above. The classification performance of the
DPCCM method is established in comparison with the existing methods based on
the validation tests explained below.
3.2.6 Validation methods
Once the classifier model is developed using any of the techniques, the
validity of the model is performed using test data. Detail information of validation
techniques have been given in section 2.3. However, the brief information of
validation methods employed in this chapter is given below to help the reader with
ready access to the concept.
Two different classifier testing methods are used to compare the
performances of different techniques. The performances on the selected datasets are
compared based on the percentage of correct classification, both for individual
classes and for overall classification. The concepts and algorithm steps discussed in
this chapter can be used for further investigations and evaluation using other
performance measures available in literature (Baldi et al., 2000).
3.2.6.1 Re-Substitution Test
All samples in the training dataset are re-substituted back into the model as
validation samples. This test is commonly used to check the self-consistency of the
classifier. However, it is a test that does not provide the right indication of the
classifier’s ability in correctly classifying new data samples (i.e. those that are not
used during training).
36
3.2.6.2 Random Sample Validation Test
A fixed percentage of training samples is randomly selected and set aside (to
serve as test samples) and the remaining data is used to design the classifier. Then
the smaller pre-selected subset of test sample is used for classifier verification. The
prediction accuracy is evaluated only on the sample test data. This “split-train-test”
procedure is repeated several times and the average of accuracies in these runs is
reported. This type of validation justifies the stability (Duda et al., 2000) of the
algorithm in that the sense of effect of new data points on the performance of the
classifier is considered.
3.3 MATERIAL
3.3.1 Datasets
Though the algorithm explained in section 3.2 can be in general applied to
any chemometric classification problem, we demonstrate its specific application to
food quality monitoring. Two important food product characterization datasets are
presented here as case studies to implement and analyze the performance of new
classifier.
Case study I: Wine classification data (WINE)
Wine
product
quality
recognition
data
(available
at
http://www.ics.uci.edu/~mlearn/databases/wine/) (Asuncion and Newman, 2007)
provides a significant chemometrics classification problem to benchmark the new
method. This problem is also statistically challenging as, in this dataset, the samples
are not uniformly distributed among the different classes. Beltrán et al. (2006) used
LDA, QDA, PNN and ANN to characterize similar dataset on Chilean wines with
spectral measurements. The samples in the dataset are obtained from chemical
37
analysis of 178 wine samples, produced in the same region in Italy but derived from
three different cultivators (3 class problem). The quantities of 13 constituents
(features) found in each of the three types of wines are analytically measured as
descriptors. De-noised and well-processed observational data is used for training
the classifier model in order to classify the given unknown sample into one of the
three classes of wines. 20% of the 178 samples selected randomly from original
data, are set aside for cross validation. Thus, the system used for analysis is N ~ [n
= 143 x p = 13; k = 3] and Ntest ~ [m = 35 x p = 13; k = 3].
Case study II: Cheese classification data (CHEESE)
A food quality characterization dataset studied by Granitto et al. (2007) is
used as the second experimental dataset. This dataset with multiple classes, higher
number of attributes and fewer samples in each group is a challenging classification
problem. It also tests the feasibility of using DPCCM approach to difficult
chemometrics applications.
The dataset consists of 60 samples from 6 classes of
Nostrani cheese (10 samples each class). They are ‘‘Puzzone di Moena”, ‘‘Spressa
delle Giudicarie”, ‘‘Vezzena”, ‘‘Nostrano del Primiero”, ‘‘Nostrano della Val di
Non” and ‘‘Nostrano della Val di Sole”. There are 35 sensory attributes (based on
physical, chemical and visual characteristics of cheese samples) measured for each
sample. Thus, the system considered for classification is N ~ [n = 48 x p = 35; k =
6]. For cross validation analysis, 20% of the given data (60 samples) is separated
and used as test data: Ntest ~ [m = 12 x p = 35; k = 6].
38
3.3.2 Implementation
The DPCCM algorithm discussed in section 3.2.3 was coded and executed
in MATLAB (MATLAB, 2005). The order of PCCM to be used during DPCCM
analysis is provided as input parameter. Built-in MATLAB functions are used for
LDA and CART algorithms. A separate MATLAB code provided at http://asi.insarouen.fr/~arakotom/toolbox/index.html by Canu et al. (2005) was used for multiclass SVM analysis. Treenet classification result is obtained using TreeNet®
software developed by Salford Systems (USA) (Freidman, 1999; Salford Systems,
2007a).
Partial correlations of order 0, 1 and 2 are attempted to verify the efficiency
of DPCCM. The order which gives best classification result (during re-substitution
test) is selected for further analysis. No parameters were tuned for LDA except that
‘diagonal’ LDA was adopted whenever the datasets were non-positive definite.
Cost criteria were adjusted during model building using CART and Treenet. The
cost function with best re-substitution result was adopted for cross validation
performance test. Simple RBF (Radial Basis Function) was used for SVM kernel
with polynomial coefficient c and γ as tuning parameters during training.
3.4 RESULTS
Results for the above case study problems are presented in Tables 3.1 and
3.2 respectively. Percentage correct predictions for individual classes are shown in
the first few columns of the Table (with column labels as ‘class’ followed by class
number). Overall classification results are indicated in the last column with the
percentage of test samples that are correctly classified. For cross validation test, the
results shown are average prediction accuracy over 100 experiments for each class
39
along with standard deviation for the overall prediction accuracy. DPCCM
performances for selected order are indicated as DPCCM(order). Results shown for
comparison methods are obtained using the datasets, N and Ntest, identical to that
used for DPCCM during the two tests.
Table 3.1
Classification result for case study I (WINE classification)
Test type
Re-substitution
Method
class 1
class 2
class 3
overall
LDA
100
100
100
100
CART
96.61
97.18
97.92
97.19
Treenet
100
100
100
100
SVM
100
100
100
100
DPCCM(0)
91.52
100
97.92
96.63
DPCCM(1)
96.61
100
97.92
98.32
DPCCM(2)
100
100
100
100
LDA
100
97.07
99.44
98.65 ± 2.02 a
CART
92
87.29
93.67
90.91 ± 4.93
Treenet
99.15
94.3
100
97.44 ± 0.67
SVM
99.23
98.00
95.11
97.65 ± 2.4
DPCCM(2)
94.55
100
100
98.23 ± 1.52
Cross
validation
a
Overall accuracy is reported as average accuracy over 100 iterations ± standard deviation
Table 3.1 indicates the comparative performance of DPCCM for WINE
data. For re-substitution test, DPCCM has learnt the variable interactions and
modeled the classes distinctly with 2nd order PCCM, predicting the samples
completely. Improvement in performance with increase in order of partial
40
correlations indicates the presence of multivariate interactions and indirect
relationships between the variables. Hence, second order partial correlation based
classification, DPCCM(2), is used during cross-validation tests. Other classifiers
also provide complete classification accuracy. Decision rules using conditions on
numerical values of the variables can lead to classifier over-fitting as observed in
the case of CART. CART has significantly poor cross validation result as compared
to re-substitution test. The difference between the re-substitution test and cross
validation test results are not significantly different for DPCCM indicating the
stability of the new method. For this dataset with non-uniform class sample
distribution, the DPCCM method has provided performance matching that of well
established methods like SVM and Treenet.
Table 3.2
Classification result for case study II (CHEESE classification)
Test type
Re-substitution
Method
No
Pr
Pu
So
Sp
Ve
overall
LDA
100
100
100
100
100
100
100
CART
100
80
100
100
100
90
95
Treenet
100
100
100
100
100
100
100
SVM
100
100
100
100
100
100
100
DPCCM(0)
100
90
100
100
100
100
98.33
DPCCM(1)
100
100
100
100
100
100
100
DPCCM(2)
100
100
100
100
100
100
100
LDA
78.5
81.5
86
53
100
64.5
77.33 ± 10.33 a
CART
77
57.5
53.5
31
98.5
44.5
61.67 ± 9.91
Treenet
87
66
73.5
34.5
94.5
49.5
67.50 ± 4.21
SVM
96
76
66
74
100
86
83.00 ± 10.83
DPCCM(1)
100
70
90
70
100
70
83.33 ± 7.85
Cross
validation
a
Overall accuracy is reported as average accuracy over 100 iterations ± standard deviation
41
For CHEESE dataset, the classification results are outlined in Table 3.2.
During re-substitution test, DPCCM performance improved with 1st and 2nd order
partial correlation. This indicates multivariate dependencies between variables
which characterize the heterogeneity between different classes of product. To keep
the computational effort low, DPCCM(1) was used during cross-validation tests.
DPCCM and SVM methods provide the least error during cross-validation test. All
the classes are learnt and predicted quite accurately during the random sample
testing. 12 samples randomly selected from original set are used as Ntest set during
cross-validation runs and DPCCM on an average always predicts 10 of them
correctly (~83% accuracy). The standard deviation for the method is also smaller
compared to LDA, CART and SVM which establishes the robustness of the
method. The new approach provides improvement over the original study carried
out on cheese dataset (Granitto et al., 2007) using Random Forest (77.1±11.1) and
DPLS (74.3±13) classification approaches. Methods like LDA and CART provide
relatively poor performance for cross validation test indicating the inability of these
methods to effectively discriminate overlapping classes.
Another important advantage of this approach is that the variables are
observed in their measured state and are not projected on the new space as in PCA,
DPLS or SVM. Hence, it will be easier to achieve a straightforward investigation
based on meaningful physico-chemical influence of variables on different quality of
products. DPCCM approach provides a good visualization of intra-class variable
associations and inter-class dissimilarities in correlation patterns based on original
variables themselves. Fig. 3.2 shows variable correlation shade map for each group
in CHEESE dataset. We can observe that each type of cheese sample is
characterized by a pattern of variable correlations. For example, type 1 cheese (No)
42
has strong association between variables Aci, Ama and Pic (Granitto et al., 2007),
whereas for type 2 (Pr) cheese, good correlation exists between sample variables
Ar, Fru and Ade. These plots not only provide class specific important features but
also indicate how distinct the classes are and possibility of class overlapping.
Cheese type 1 (No) and type 5 (Sp) look similar in their association whereas type 2
(Pr) and type 3 (Pu) form similar variable interaction profiles. Such information can
be effectively used in sensor selection to select important variables for quality
analysis of particular type of product.
Type 1
Type 2
Type 3
10
10
10
20
20
20
30
30
30
10
20
30
10
Type 4
20
30
10
Type 5
10
10
20
20
20
30
30
30
20
30
10
20
30
Type 6
10
10
20
30
10
20
30
Fig. 3.2 Variable correlation shade map for each class in CHEESE classification
dataset. Each of the 35 measured variables (as columns) are correlated with
all the other variables (as rows). The white shade implies full correlation (r
= 1) and black color indicates no correlation (r = 0) and other gray shades in
between. All the diagonals are white representing the self correlation for
each variable. Each type of cheese sample shows distinct inter-variable
association patterns.
43
It must be highlighted that DPCCM addresses the multiclass multivariate
classification problem with one PCCM model for each class without seeking any
decision boundary (unlike LDA), working only with the correlations between
variables (independent of scale of the measurements) and without projecting the
variables on new descriptor space (unlike binary SVM classifier). Another
important factor in which DPCCM scores over other methods is its simplicity in
implementation without having to tune many parameters (except selecting the
optimum order of partial correlation based on three re-substitution runs). DPCCM
does not employ rigorous optimization algorithms. Hence, if the system considered
has distinct inter-variable correlation structure for different classes (which are more
likely to occur in high dimensional, multivariate chemometrics applications) the
DPCCM approach offers an efficient classification tool.
It must be pointed out that for high dimensional data with higher order
conditional dependencies between variables (for example characterization using
spectral measurements), the computational time can increase significantly. In our
observation on a desktop computer (with 2.4GHz CPU and 2 GB RAM), 0th order
DPCCM is as fast as LDA for any application and higher order DPCCM can train
and test samples within 20 seconds for systems with 100 variables. For
classification problems with p > 100, one can implement DPCCM in conjunction
with suitable variable selection algorithms (Raghuraj Rao and Lakshminarayanan,
2007a). The performance of the DPCCM classifier may also be affected if few
classes in the system exhibit similar inter-variable associations or no correlations at
all. This singular situation may not arise in chemometrics applications where
different physical, chemical and visual measurements and unique association
patterns between them are often the basis of specific characteristics of the system.
44
With further improvements like incorporating nonlinear correlation measures,
selecting different order PCCM for different classes and incorporating significance
of correlations during classifier development, DPCCM promises to be a powerful
tool for solving complex classification problems.
In this chapter, DPCCM performance is analyzed using the two
classification case studies and is compared with well established classifiers.
DPCCM performs better than linear classifiers and comparable to non-linear SVM
classifiers. This new method can potentially eliminate some of limitations of
existing methods and also provides good visualization for understanding the
specific variable interactions contributing to the nature of each class.
45
Chapter 4
Analysis of Biomedical Data
Be as smart as you can, but remember that it is always better to be
wise than to be smart
Alan Alda (1936)
4.1 INTRODUCTION
Machine learning applications for medical purposes has received
considerable attention (Magoulas and Prentza, 2001). Integration of machine
learning techniques into medical environment has enhanced the accuracy and
reliability of medical diagnosis resulting in improved patient care. This is mainly
because many medical problems, especially those which are related to classification
of samples into their corresponding class based on measurement of certain
attributes, can be well handled by using machine learning techniques. Some of
machine learning applications in medicine include early screening for gastric and
oesophageal cancer (Liu et al., 1996), lung cancer cell identification (Zhou et al.,
2002; Polat and Günes, 2008), classification of normal and restrictive respiratory
conditions (Mahesh and Ramakrishnan, 2007), classification for personalized
medicine with high dimensional data (Moon et al., 2007), breast cancer diagnosis
(Sahan et al., 2007) and artery disease (Kurt et al., 2008). Here, other important
areas of medical application, such as prediction of the depth of anesthesia, heart
disease and breast cancer identification that can largely benefit by classification
approaches (Linkens and Vefghi, 1997; Mahfouf, 2006; Sharma and Paliwal, 2008)
are addressed.
46
Anesthesia is usually employed as one of surgical procedures to remove all
sensations of pain. During the surgery, the dose and infusion rate of anesthetic drug
has to be controlled to maintain depth of anesthesia (DOA) at a level that is safe for
the patient as well as deep enough to remove the sensation of pain. Many studies
have established a well-controlled anesthesia with PID or other advanced
controllers (e.g. adaptive controllers) (Elkfafi et al., 1998; Jiann Shing et al., 1999).
However, these controllers need a good estimate of the patient’s DOA level to
decide on the right dosage of anesthetic drug to be administered. Therefore, the
determination of the correct DOA level is a crucial factor in obtaining a well
controlled anesthesia.
In this study, DOA level is determined using classification techniques.
Multiple patient data such as recorded patient’s auditory evoked potential (AEP)
features and cardiovascular features as well as known DOA level (awake, Ok/light,
Ok, and Ok/deep as determined by the anesthesiologist) available from published
literature are used to build the classification models.
Once constructed, the
classifiers can be used to classify the DOA level reliably into the four classes based
only on AEP or cardiovascular measurements (Nayak and Roy, 1998; Nunes et al.,
2005). The classification analysis is separately carried out using two different DOA
datasets, one using AEP features and the other using cardiovascular features which
include heart rate (HR), systolic arterial pressure (SAP) and mean arterial pressure
(MAP). These two independent datasets provide distinct patient samples to train
and test the classifier models. They also facilitate the selection of important features
in the data set for reducing the complexity of the classifiers and/or improving the
accuracy of DOA classification.
47
The second case study considered here concerns breast cancer identification.
According to US cancer statistic working group (2007), breast cancer is the most
common cancer diagnosed in women and is the second leading cause of cancer
death among women in US. However, this cancer has a high chance to be cured.
Jerez-Aragonés et al. (2003) noted that 97% of breast cancer patients survive for
five years if the cancer is early detected and treated. This fact highlights the
importance of early detection of cancer followed by early treatment. Some past
studies have shown that machine learning methods can play an important role in
these efforts (Bagui et al., 2003; Hong and Cho, 2008; Liu and Huang, 2008). Using
information provided by some measurable cell attributes or microarray data
information from many normal cells and cancer cells, machine learning methods are
able to build classification models. When a patient comes to the hospital for
diagnosis, the doctor has only to extract some cells and process it with a microarray
analyzer. The results obtained from microarray analysis are then processed by the
classification model to determine the existence and severity of cancer in the patient.
In medical data analysis, especially those related to illness identification,
patient misclassification may have a fatal impact. For example, when people with
cancer disease are classified as being healthy, they will receive no cancer treatment.
This may then increase illness severity and may even lead to death. Therefore, to
make this study reliable, classifier performance is compared based not only on
overall accuracy, but also based on class-wise performance. Two available online
datasets, namely the Wisconsin Diagnostic Breast Cancer dataset (WDBC) and
Wisconsin Breast Cancer dataset (WBC), are used in the case studies for breast
cancer classification/identification in this chapter.
48
The final case study that will be covered in this chapter is on heart disease
identification. American Heart Association records show that heart disease has
become the leading cause of death in the United States and indeed in most of the
developed countries. Therefore, it will be of interest to check the capability of data
analysis techniques in correctly classifying patients with heart disease. Such early
detection (based on classification techniques) can help in initiating timely medical
treatment and in reducing heart-related deaths.
In this study, we process the data collected on some patient attributes using
classification techniques to distinguish patients with heart disease from normal
people. The results are then compared to obtain the most suitable classifier for heart
disease identification. Even here, type 1 misclassification case, wherein a patient
with heart disease is classified as “healthy”, has to be kept as low as possible. As a
result, comparison of classifier performance is done based on both overall accuracy
and class-wise performance.
There are many classification techniques available in machine learning
literature that can be attempted to solve all these problems. These techniques
provide different advantages but also have data-specific limitations. The main
objective of the present study is to find out the best classifier to predict DOA level
and identify the existence of cancer and heart disease through a performance
comparison of some popular classification methods.
4.2 METHODS
4.2.1 Classification Methods
In this study, ANN, TreeNet, CART, LDA, and VPMCD or DPCCM were
used to predict DOA level during surgery and cancer identification. Their
49
performance are then compared each other to decide the best classifier for each
case. The detail information related to these techniques is thoroughly discussed in
section 2.1 - section 2.5 and section 3.2.3.
4.2.2 Variable Selection Methods
Just as in regression models, the classifier models can benefit from the
selection of important variables. When the classifier is constructed based on a
subset of original variables, it helps to reduce the complexity of the model and
computational effort without compromising on classifier performance. Noninclusion of certain nuisance variables (characterized by high noise and without
having any discriminating value) can even enhance the performance of the classifier
(Flores et al., 2008). In this work, several variable selection methods are used to
rank the predictor variables according to their importance in classifying the
samples. Once this ranking is available, the final classifier is built using only the
most important variables (here, we choose the top 50% of the variables after
ranking them using different variable selection methods) and examine the
improvement in classification accuracy without any parameter re-tuning.
The first method uses the Fisher criteria (FC) to rank the variables. Fisher
criterion is defined as the ratio of “between class” and “inter-class” variances
(Wang et al., 2008). This criterion is maximized by LDA (Duda et al., 2000) to
identify the best separation plane by weighting predictor variables. Fisher ranking
method basically uses these weights to rank the variables. In order to check the
independent effect of each variable on classification, we have also adopted single
variable ranking (SVR) approach. In this univariate approach, a selected (single)
predictor variable only is used to build a LDA model which is then tested to
determine the classification accuracy. This LDA model building and testing is
50
independently repeated for all the predictor variables so that the classification
accuracy for each variable is obtained. The variables are then ranked based on the
prediction accuracy values. FC and SVR approaches provide good measures of
variable influence on classification in line with the principle of LDA classification.
To establish similar advantage for other classifiers working on different principles,
we adopt two other variable selection methods. Entropy measure which is useful for
CART and partial correlation based variable selection approach (PCCM) which can
potentially influence variable interaction based approaches of VPMCD and ANN.
For entropy method, variables are ranked based on their entropy measures
(Ebrahimi et al., 1999) signifying the randomness in data for that variable. In
PCCM method, the partial correlation coefficients of orders 0, 1 and 2 are
calculated between different pairs of variables and the resulting multivariate
associations (in the form of edges on a node in the association network) is used as a
basis for variable ranking (Raghuraj Rao and Lakshminarayanan, 2007a). Though
these
specific
techniques
can
potentially
influence
particular
classifier
performances, we analyze the performance of all variable selection methods with all
the classifiers. This is mainly to achieve the objective of selecting the best
combination of variable selection method and the classifier.
4.3 MATERIALS AND IMPLEMENTATION
4.3.1 Datasets
4.3.1.1 Anesthesia Dataset
The problem of classifying and predicting DOA level (Mahfouf, 2006) can
be attempted using either AEP features or cardiovascular features as predictors. The
difference in the number of samples for each class makes classification difficult and
51
challenging for this dataset. The analysis is done separately using two different
datasets obtained from Prof. Mahfouf (Nunes et al., 2005; Mahfouf, 2006). The
datasets are collected in Royal Hallamshire Hospital in Sheffield, UK. The readers
are directed to (Nunes et al., 2005; Mahfouf, 2006) for further information about the
datasets. The first one uses 10 AEP features while the second one uses 3
cardiovascular features (heart rate (HR), systolic arterial pressure (SAP) and mean
arterial pressure (MAP)). The classification problem involves four classes i.e. DOA
levels (awake, Ok/light, Ok, and Ok/deep) and consists of 414 samples which
correspond to the number of patients during the surgery. Thus the classification
datasets considered are N1 ~ [n = 414 x p = 10; k = 4] for AEP features dataset and
N2 ~ [n = 414 x p = 3; k = 4] for cardiovascular features dataset.
4.3.1.2 Wisconsin Breast Cancer (WBC) dataset
WBC dataset collected by (Wolfberg and Mangasarian, 1990) is available
online in a public domain database (Asuncion and Newman, 2007). By considering
9 cell attributes information such as mitoses, clump thickness and so on, cells are
then classified into 2 classes (malignant cancer cells and benign cancer cells). There
are a total of 699 samples in this dataset with 65.5% of them being benign cells and
the remaining 34.5% are malignant cell samples. Some missing data occurred in 16
records of patients hence these 16 samples are excluded from the analysis. The size
of this system is M ~ [683 samples x 9 predictors; 2 classes].
4.3.1.3 Wisconsin Diagnostic Breast Cancer (WDBC) dataset
WDBC dataset collected by (Wolfberg and Mangasarian, 1990) is available
for public use in http://archive.ics.uci.edu. This is quite a big dataset compared to
WBC with less number of samples. 30 real-valued attributes information from 569
52
samples (357 samples taken from benign cells and the remaining data taken from
malignant cells) without any missing value are archieved in the website. The size of
this system is P ~ [569 samples x 30 predictors; 2 classes].
4.3.1.4 Heart Disease dataset
Heart disease dataset relates to a 2 class problem. It consists of 13 attributes
from 270 observations (150 patients not having heart disease and 120 patients
having heart disease). This dataset is analyzed to classify patients with heart disease
and without heart disease and the result is then used to predict the presence of heart
disease in new patients. The total system size is O ~ [270 samples x 13 predictors; 2
classes]. The existence of 4 different types of attributes adds some challenges in
analyzing this dataset. Class 1 shows the absence of heart disease and class 2 shows
the presence of heart disease. No missing value exists in this dataset.
4.3.2 Implementation
Since our classifications models need to be validated, data splitting process
into training set and test set was done. The training set is used to build classifier
models and all classifiers were built on the same training datasets. The test set is
kept separately and only be used for (pure) validation of classifier performance. It is
not used during modeling or parameter tuning. Prior to the start of analyzing
anesthesia dataset, the cardiovascular features dataset was randomly divided into
training set (2/3 of data) M ~ [n = 276 x p = 10; k = 4] and test set (1/3 of data), S ~
[n = 138 x p = 10; k = 4]. A similar 2/3-1/3 split was performed on the AEP
features dataset. However, breast cancer (WBC and WDBC) and heart disease
datasets are split differently. A training set which consists of 80% of total samples
53
is used to build the classification model and the other 20% is kept for validation
purpose.
4.3.3 Model Development
Every classification method has its own set of user-defined parameters. The
performance of any method depends significantly on identifying suitable values for
these tuning parameters. To this end, the training set is divided randomly into M1 ~
80% of training set and M2 ~ 20% of training set. M1 is used to build a model (with
a certain choice for parameters) followed by validation on M2. The data split, model
building and validation are repeated 50 times for each classifier. This procedure is
executed with different parameter values and the best parameters are chosen based
on the optimization of specific criteria e.g. high classification accuracy. Such
parameter tuning has been done for all classifiers used in this work. The mean (µ)
and standard deviation (σ) of classification accuracies (over 50 iterations) are
calculated for the best model for each method and the coefficient of variation (CoV)
is calculated using the following equation.
CoV = 100 x σ /µ
(4.1)
If the CoV value is less than 20%, the classifier with the tuned parameters is
considered for further analysis. If not, the parameter tuning step is repeated for this
classifier until stable model parameters are obtained. All steps in model
development are done for cardiovascular parameters dataset, AEP features dataset,
WBC dataset, WDBC dataset and heart disease dataset.
4.3.4 Validation Testing
After the stable parameter values are obtained for each classifier, the final
classifier is built using the entire training dataset (M) and the best parameter values.
54
The model is finally tested on the test dataset (S) to get the accuracy of the model.
Since the number of samples for each class is not equal, overall accuracy is not the
best metric to compare performance of the classifiers. Therefore, the analysis is
expanded by doing class-wise comparison and calculating sensitivity and
specificity. The formula used for calculating sensitivity and specificity for each
specific class can be seen in Podgorelec et al. (2005).
After the test samples in S are subjected to validation of the classifier model,
the sensitivity and specificity percentages are calculated for all the classes and the
average value is reported as the indication of classifier performance. Sensitivity
shows the probability of correct classification when the negative case is absent. On
the other hand, specificity shows the probability of correct classification when the
negative case is present (Liu et al., 1996).
4.3.5 Variable Selection
Variables are ranked using variable ranking methods discussed in section
2.1. Variable selection is only applied on both cardiovascular features dataset and
AEP features dataset. After all the variables are ranked, the two and five most
important variables are retained for each method for cardiovascular features dataset
and AEP features dataset respectively. As a result, the size of the dataset is reduced
to Mr ~ [n = 276 x p = 2; k = 4] (for cardiovascular features dataset) and [n = 276 x
p = 5; k = 4] (for AEP features dataset) for training set. Sr ~ [n = 138 x p = 2; k = 4]
(for cardiovascular features dataset) and [n = 138 x p = 5; k = 4] (for AEP features
dataset) for test set. After performing variable selection, 6 different sets of Mr and
Sr are collected (one for each variable selection method) because each of the
methods has different variables as their respective top 5 variables. For
55
cardiovascular dataset, some variable selection methods give the same top 2
variables.
The analysis is then continued by building a model using Mr and the best
parameter is obtained at the model development step. The model is then tested on
the respective Sr set. The model building and testing is done using 6 different sets of
Mr and Sr for every classifier. Therefore, there are 24 combinations of dataset and
classifier in this analysis (6 datasets x 4 classifiers).
4.3.6 Software
CART and TreeNet classification is done using software developed by
Salford Systems, USA (Salford Systems, 2007a; Salford System, 2007b).
MATLAB’s (MATLAB, 2005) built-in function “classify” is used for LDA and
QDA. A MATLAB implementation of the DPCCM, VPMCD algorithm (Raghuraj
Rao and Lakshminarayanan, 2007b) and neural network algorithm are used for
building the VPMCD, DPCCM and ANN classifiers. Variable selection methods
used in the analysis (except entropy) were coded in MATLAB (MATLAB, 2005).
CART software is used to rank the variable using entropy ranking method. The
developed MATLAB codes can be made available to interested readers upon
request.
4.4 RESULTS
4.4.1 Parameter Tuning
The parameter tuning results are presented in Table 4.1 for DOA
classification, Table 4.2 for breast cancer identification and Table 4.3 for heart
disease identification. The tables show the settings of the best parameters obtained
from the tuning procedure and also the coefficient of variation for each classifier
56
based on 50 cross validation tests. As can be seen in Tables 4.1 to 4.3, TreeNet and
CART show their stability in the context of random data sampling. However, the
number of parameters that have to be tuned in these classifiers is more than that in
other classifiers since misclassification cost for each class in CART and TreeNet
have to be optimized as well. As comparison, for DOA classification case, we have
to tune 15 parameters for CART, 14 parameters for Treenet, 2 parameters for
VPMCD, 1 parameter for ANN and none for Discriminant Analysis (DA).
Therefore, it can be concluded that obtaining a good model using CART and
TreeNet is a significantly time-consuming activity.
Cost optimization (CO) is included in CART and TreeNet analysis (Table
4.1-Table 4.3) in order to reduce misclassification cases that can lead to undesired
effect. For example, when DOA level 2 which should be ‘ok’ is misclassified as
DOA level 4 (ok/deep), the controller may reduce the amount of anesthetic drug. As
a result, patient’s DOA could drop to level 1 (awake state) and result in a condition
that is harmful for the patient. The CO ensures some level of fault tolerance in the
closed loop system that includes the anesthesiologist (or automatic controller),
patient, measuring system, classifier and other hardware elements.
4.4.2 Test set Analysis
4.4.2.1 DOA classification
The results of classifier testing on the test dataset are shown in Table 4.4 for
the case where all the cardiovascular features are used as predictors and in Table
4.5 when all the AEP features are employed as predictors. The leftmost column
shows the type of classifier used and the second column (titled “class”) shows the
number of correctly classified sample for each class. The third column shows the
57
total number of samples which are correctly classified and the last column shows
percent overall accuracy for each classifier.
Table 4.1 Summary of parameter tuning result using validation dataset for
anesthesia
AEP features dataset
Methods
Best Parameters
VPMCD
Model type: linear +
Coef. of
Best Parameters
Variation
CoV
CoV
16.9 %
Model type: quadratic +
interaction
Number of independent
Number of independent
variables: 2
variables: 1
No tuned parameters
QDA
(diagquadratic)
Network consists of 700
Coef. of
Variation
interaction
LDA/
TreeNet
Cardiovascular features dataset
15.47 %
No tuned parameters
8.7 %
5.11 %
(diaglinear)
1.2 %
Network consists of 700
trees, and minimum number
trees, and minimum
of training observations in
number of training
terminal nodes = 5
observations in terminal
Class weight: unit
nodes = 5
Cost:
Class weight: balanced
4 misclassified as 3 : 2
Cost:
0.3 %
3 misclassified as 2 : 1.3
4 misclassified as 3 : 1.5
CART
Splitting method: entropy
4.18 %
Splitting method: entropy
Priors : learn
Priors : equal
Minimum cases in parent
Minimum cases in parent
node: 3
node: 3
Cost:
Cost:
4 misclassified as 3 : 3
4 misclassified as 3 : 2
0.34 %
2 misclassified as 4 : 2
ANN
Number of hidden layers: 3
5.1 %
Number of hidden layers:
7.4 %
3
58
Table 4.2 Summary of parameter tuning result using validation dataset for breast
cancer
Methods
WBC dataset
Best Parameters
VPMCD
LDA/QDA
Model type: Quadratic
WDBC dataset
Coef. of
Variation
CoV
CoV
2.30 %
Model type: Linear
Number of independent
Number of independent
variables: 4
variables: 2
No tuned parameters
Network consists of 700
Coef. of
Variation
1.82 %
(Linear)
TreeNet
Best Parameters
No tuned parameters
4.54 %
2.03 %
(Linear)
0.19 %
Network consists of 700
trees, and minimum
trees, and minimum
number of training
number of training
observations in terminal
observations in terminal
nodes = 3
nodes = 3
Class weight: balanced
Class weight: unit
Cost:
Cost:
2 misclassified as 1 : 2
3 misclassified as 2 : 1.3
0.18 %
4 misclassified as 3 : 1.5
CART
DPCCM
Splitting method: Gini
0.27 %
Splitting method: entropy
Priors : mix
Priors : mix
Minimum cases in parent
Minimum cases in parent
node: 3
node: 5
Cost:
Cost:
2 misclassified as 1 : 2
1 misclassified as 2 : 5
Order: 1
2.47 %
Order: 1
3.05 %
2.55 %
59
Table 4.3 Summary of parameter tuning result using validation dataset for heart
disease
Methods
Heart disease dataset
Best Parameters
Coef. of
Variation
CoV
TreeNet
Network consists of 700 trees, and
0.79 %
minimum number of training
observations in terminal nodes = 3
Class weight: balanced
Cost:
2 misclassified as 1 : 3
CART
Splitting method: Entropy
1.66 %
Priors : mix
Minimum cases in parent node: 5
Cost:
2 misclassified as 1 : 3
Table 4.4 Classification result (correct classification) on test set using
cardiovascular features as predictors
Classifier
Class#
Total #
% accuracy
1
2
3
4
VPMCD
0
7
12
36
55
39.86
LDA
2
7
16
60
85
61.59
TreeNet
0
10
9
53
72
52.17
CART
0
11
7
73
91
65.94
ANN
0
0
8
52
60
43.48
9
28
22
79
138
Total
Samples
#
shows the number of samples that correctly classified
60
Table 4.5 Classification results (correct classification) on test set using AEP
features as predictors
Classifier
Class#
Total #
accuracy
(%)
1
2
3
4
VPMCD
0
4
21
34
59
42.75
QDA
0
1
14
82
97
70.29
TreeNet
0
2
22
38
62
44.93
CART
0
9
11
75
95
68.84
ANN
0
8
18
63
89
64.49
3
15
23
97
138
Total
Samples
#
shows the number of samples that correctly classified
As can be seen in Table 4.4, according to our study, for dataset which used
cardiovascular features as predictors, CART gives the best overall accuracy by
correctly classifying 91 samples out of 138 samples. CART, which does not
consider interactions amongst predictors while constructing the classifier, has
considerably better performance than the other classifiers. On this dataset, TreeNet
gives lower accuracy than CART.
VPMCD gives a very low accuracy for this dataset possibly because
interaction between predictor variables might not be significant here. Similar to
VPMCD, ANN classifier is based on modeling. Therefore, their accuracies are
almost similar. It is interesting that VPMCD can predict class 2 and class 3 samples
better than ANN, while ANN can make a better prediction on class 4 samples. This
may happen with ANN because the number of class 4 samples is much more than
for other classes. This reason is also supported by ANN’s poor ability in classifying
class 2 samples. On the other hand, VPMCD which tries to capture all class profiles
61
while building the classifier models, can classify some of class 2 and 3 samples, but
its performance in classifying class 4 samples is lower than other classifiers.
LDA, which takes into account the weightage on variables while
constructing the separating plane, provides the second best performance. Three of
the classifiers presented here (LDA, CART and TreeNet) perform better than the
results reported in Mahfouf (2006) and Nunes et al. (2005). The results in these
earlier studies on the same datasets, reported an overall accuracy of 46.5% using a
fuzzy relation classifier for the same training and test sets.
LDA is seen to provide low overall accuracy compared to CART. This
happens because LDA performs poorly compared to CART on class 4 samples and
class 2 samples while it does better on classes 1 and 3. With the number of samples
in class 4 being too high, the overall accuracy of LDA turns out to be lower than
CART. All classifiers perform their best in classifying class 4 samples and poorly
on class 1 samples.
The results for classifiers built using AEP features are presented in Table
4.5. In Mahfouf (2006), it is reported that fuzzy relation classifier gives an overall
accuracy of 61%. In the present analysis, CART and QDA (which gives higher
accuracy than LDA) provide better prediction accuracies for the same training and
test sets with AEP features. The results in Table 4.5 indicate a pattern where no
classifier is able to correctly classify any of the class 1 samples. This may be
because the number of class 1 samples is too small in the training set (10 samples
out of 276). Small number of samples in training set will result in inadequate
learning by any classifier. Therefore, it is difficult to model the class 1 profile and
classify new samples correctly.
62
QDA shows its capability as the best predictor, in terms of overall accuracy,
for DOA classification by classifying 97 new samples correctly to their
corresponding class. For class 2 samples, QDA can only classify 1 out of 15
samples correctly. Thus, QDA cannot classify class 1 and class 2 as well as it is
able to correctly classify classes 3 and 4. In this case, CART has slightly lower
overall accuracy than QDA while its sensitivity and specificity is slightly higher
(see second column of Table 4.6). VPMCD models the class 3 samples better than
the remaining classifiers, even though its overall performance is lower. 21 of the 23
samples of class 3 (91%) are correctly identified by VPMCD while testing on AEP
feature dataset. This is better than any other classifier performance for class 3
samples. Although ANN’s performance in classifying class 2 samples is as good as
CART, its performance in class 3 and class 4 classifications is slightly lower than
CART.
For DOA classification using AEP features as predictors, TreeNet
performance is significantly lower than CART. While TreeNet performs better or as
good as CART on class 1 and 3, it does very poorly on class 2 and 4 samples as
compared to CART.
From Tables 4.4 and 4.5, it can be observed that almost all classifiers,
excluding TreeNet, gives higher accuracy using AEP features dataset compared to
cardiovascular dataset. This also highlights the fact that AEP features are better
predictors in classifying depth of anesthesia than cardiovascular features. In a
surgical setting, DOA level 1 and DOA level 4 are considered as the most crucial
conditions which have to be classified correctly. In such situations, QDA is the
recommended classifier if DOA classification is done using cardiovascular features
because it is the only classifier which is able to classify class 1 samples correctly. In
63
addition, its performance in classifying class 4 samples is only slightly lower than
CART (see Table 4.4). Also, if the emphasis of diagnosis is on achieving the
highest class 4 accuracy, QDA with AEP features could be a suitable classifier
choice. These observations highlight that there is no single classifier which has best
performance satisfying different objectives of DOA decision making. Given a
specific performance objective, the classifiers need to be tuned and chosen
accordingly. This conclusion is further supported by the analysis using class
sensitivity and specificity measures.
Table 4.6 Sensitivity and specificity values for each classifier in DOA classification
AEP Features
Cardiovascular
Classifier
Sensitivity
Specificity
Sensitivity
Specificity
VPMCD
38.26
81.97
31.28
80.47
LDA/QDA
38.02
84.82
48.97
88.35
TreeNet
37.04
81.57
35.93
81.58
CART
46.29
84.85
40.88
85.23
ANN
49.14
88.29
25.55
79.02
All results are presented in percentage (%)
Table 4.6 shows specificity and sensitivity value for all classifiers in both
datasets (cardiovascular parameters and AEP features). As can be seen, QDA has
the highest value of sensitivity and specificity for cardiovascular parameters dataset
while ANN holds the highest value of sensitivity and specificity for AEP features
dataset. The class specific performance of different classifiers is clearly evident
from these results. The sensitivity and specificity value for VPMCD, TreeNet and
QDA are quite similar for AEP features dataset while the overall accuracy of QDA
is significantly higher than TreeNet and VPMCD. This is an important observation
for present DOA classification problem as the selection of best classifier needs to
be based on class specific objectives instead of overall classification accuracy.
64
4.4.2.2 Classification with WBC dataset
Table 4.7 shows the classification result on WBC dataset with class 1
representing patients with benign cancer and class 2 representing patients with
malignant cancer. As mentioned earlier, with biomedical data, it is very important
to also consider class-wise accuracy in comparing classifier performance. For this
case, class 2 accuracy is more important than class 1 accuracy because cancer
patient needs medication and treatment as soon as possible. If cancer patients are
wrongly classified as “healthy”, they will not receive any medication at least until
the illness becomes quite obvious and serious when it may be too late to be cured.
Therefore, classifier performance should be deemed better if it has higher prediction
accuracy for class 2.
As can be seen in Table 4.7, according to our study, TreeNet not only gives
the best overall accuracy by correctly predicting 95.56% of total test samples but
also has the ability to perfectly identify all cancer patients that exist in the test
dataset. This fact confirms the superiority of TreeNet compared to other classifiers
in breast cancer identification based on attributes that are available in WBC dataset.
However, TreeNet’s performance in classifying class 1 is not as good as it is with
class 2 samples. This is mainly because the cost set (see Chapter 2) during model
construction makes TreeNet give different weighs to each decision tree existing in
the model. In other words, it adjusts the parameters in TreeNet model in such a way
that the resulting model is very good for classifying class 2 samples. As a
consequence, the information retained in the model is insufficient to correctly
classify class 1 samples.
65
Table 4.7 Analysis result for WBC dataset using LDA, CART, TreeNet, DPCCM
and VPMCD
LDA
Resubstitution
Testing
VPMCD DPCCM CART
TreeNet
Class 1
98.20
99.77
95.05
99.55
98.65
Class 2
92.89
94.14
100
100
100
overall
96.34
97.80
96.78
99.71
99.12
Class 1
96.59
94.32
90.91
93.18
93.18
Class 2
87.23
93.62
100
97.87
100
overall
93.33
94.07
94.07
94.81
95.56
All results are presented in percentage (%)
As shown in Table 4.7, CART gives the same performance as TreeNet in
classifying healthy subjects into class 1. On the other hand, it fails to classify some
of cancer patients correctly. Therefore its overall performance is slightly lesser than
TreeNet. Interestingly, VPMCD and DPCCM give the same overall accuracy albeit
with different class-wise performance. Similar to TreeNet, DPCCM is able to
classify class 2 samples very well. Its accuracy for class 2 samples is much better
than VPMCD. On the contrary, VPMCD has better class 1 performance than
DPCCM.
According to our analysis, LDA gives the lowest performance in both
overall accuracy and class 2 accuracy. This might be due to the fact that the samples
in those classes may not follow Gaussian distribution (Wang et al., 2008). In
addition, the presence of class overlapping profile will be another disadvantage for
LDA in building separation plane. Since the number of class 1 samples is larger
than class 2 samples, LDA has the best performance in class 1 classification.
66
4.4.2.3 Classification with WDBC dataset
Table 4.8 shows the classification results for the WDBC dataset with class 1
representing patients with malignant cancer and class 2 representing people without
cancer. Similar to WBC case study, class 1 accuracy must receive more attention
than class 2 accuracy since a cancer patient needs medication and treatment as soon
as possible. As presented in Table 4.8, DPCCM holds the highest and perfect value
for all overall, class 1 and class 2 accuracy. In other words, DPCCM is able to
perfectly classify all samples into their corresponding class. This result puts the
DPCCM proposed in Chapter 3 in better light - it may prove to be a good classifier
not only in food applications but also in biomedical applications.
Table 4.8 Analysis result for WDBC dataset using LDA, CART, TreeNet, DPCCM
and VPMCD
LDA
Resubstitution
Testing
VPMCD DPCCM CART
TreeNet
class 1
92.45
82.08
96.23
100
100
class 2
99.44
96.36
96.36
99.16
100
overall
96.84
91.04
96.31
99.47
100
class 1
95.24
88.10
100.00
100
92.86
class 2
100
97.18
100.00
88.73
100
overall
98.23
93.81
100.00
92.92
97.35
All results are presented in percentage (%)
Unlike in WBC dataset, LDA provides a good performance in classifying
class 2 samples for this data set. This could be because the data points are linearly
separable. A reasonably good performance by VPMCD (Table 4.8) using a linear
model in classifying class 2 samples strongly indicates linear separability of the
dataset. As observed in Table 4.8, LDA performance on class 2 classification is
better than its performance on class 1 classification. During model building, LDA
67
set weights to all variables in such a manner that makes its performance on class 2
predictions perfect. This reduces its ability to predict class 1 samples accurately.
The other classifier which gives a good performance in class 2 classification is
TreeNet. The ensemble of decision tree confirms its superiority to single decision
tree (CART) by giving better overall and class 2 accuracy. However, its
performance on class 1 is still lower than CART. This may happen because, during
model construction, TreeNet assigned weight factor on each variable so as to
classify class 2 samples well. As a consequence, it fails to classify some class 1
samples correctly. CART, a classifier with the lowest overall accuracy, is able to
perfectly classify class 1 samples. On the other hand, it has poor performance in
predicting class 2 samples. Based on these results, it can be concluded that cancer
identification using WDBC dataset is preferably done by using DPCCM which
gives the highest random testing accuracy.
4.4.2.4 Heart Disease Identification
Table 4.9 shows the classification result on heart disease dataset with class 1
representing the absence of heart disease (healthy) and class 2 representing the
presence of heart disease. Since this analysis involves some categorical variables,
classifiers which use mathematical equations in building the classification model
(e.g. LDA, DPCCM, and VPMCD) cannot be used. Therefore, the classification is
only done by CART and TreeNet.
As can be seen in Table 4.9, according to our analysis, TreeNet gives much
better overall accuracy and gives lower class 2 classification performances
compared to CART. CART performance on class 1 classification is much poorer
than TreeNet. As with the earlier biomedical case studies, class 2 accuracy must be
afforded higher priority when comparing classifier performance. Thus, CART
68
seems to be a better classifier for heart disease identification since it can
“recognize” a patient with heart disease better than TreeNet.
Table 4.9 Classification result on heart disease dataset using CART and TreeNet
Resubstitution
Testing
CART
TreeNet
class 1
91.33
92
class 2
100
85.83
overall
95.19
89.26
class 1
73.33
86.67
class 2
87.5
83.33
overall
79.63
85.19
All results are presented in percentage (%)
4.4.3 Variable Selection
Variable selection method is only applied to AEP features and
cardiovascular features dataset for DOA classification. The variables selected by
the different methods for the AEP features dataset are shown in Table 4.10 and
variables selected for cardiovascular parameters dataset are tabulated in Table 4.11.
Different selection algorithms select different sets of variables as important even
though they all start from the same dataset. This indicates the differences in the
existing variable ranking methods and stresses the importance of selecting a
specific
technique
for
a
given
problem.
After doing variable selection, the analysis is continued by building all
classifiers based on each set of selected variables. The analysis results are tabulated
in Table 4.12. In some cases, the classifiers are seen to have better performance
when developed based on a subset of variables. In other cases, contrary results are
observed. Poorer performance may occur when the variable subset selection method
69
is not compatible with the classifier. In these cases, the selected variables fail to
give enough information to the classifier in order to make a good separation
(McCabe, 1984). In addition, by decreasing the number of predictor variables, some
information may be lost.
Table 4.10 Variables selected from 10 AEP features using different selection
methods
Methods
Ranking of single variable
Variables selected
4, 3, 1, 2, 6
PCCM based ranking
Order 0
2, 9, 5, 6, 7
Order 1
5, 2, 9, 1, 4
Order 2
5, 1, 2, 3, 4
Entropy
5, 9, 6, 8, 4
Fisher Criteria
4, 3, 1, 9, 5
Table 4.11 Variables selected from 3 variables in cardiovascular dataset using
different selection methods
Methods
Ranking of single variable
Variables selected
SAP and MAP
PCCM based ranking
Order 0
HR and SAP
Order 1
HR and SAP
Order 2
HR and SAP
Entropy
Fisher Criteria
HR and SAP
SAP and MAP
70
The best result after variable selection is achieved by QDA (in AEP
features dataset) that employs Single Variable Ranking (SVR) as the method for
selecting variables. 102 test samples are correctly classified to their corresponding
class out of 138 samples after selecting only the five best variables. This result also
confirms the consistency of QDA as the best method for DOA classification, in
terms of overall accuracy, using AEP features data. This is not surprising because
single variable ranking method applies LDA concept to rank the variables.
Therefore, the variables selected contain most of information needed for LDA and
QDA classification. As a result, the single variable ranking gives the best
classification result for QDA. It is observed that none of the variable selection
methods improve the performance of CART. CART is a variable-based classifier
which needs information contained in variables. By decreasing the number of
variables involved in classification, CART probably has insufficient information to
separate those classes. As a result, its performance is lower with variable subset
selection.
All variable selection methods are also applied to the cardiovascular dataset.
The selected variables, tabulated in Table 4.11, are used to build the classifiers
without retuning any of the parameters. The classifiers are then validated on test
dataset and the results are presented in Table 4.13. For this dataset, only 3
classifiers benefit from the variable selection procedure. The accuracy of VPMCD,
CART and ANN increase by 36.4%, 4.4% and 31.67% respectively compared to
classification with all the variables. It is noteworthy that the combination of
VPMCD and PCCM based variable selection significantly increased the
classification accuracy for this dataset.
71
Table 4.12 Model accuracy using selected variables (AEP dataset)
No
SVR
selection
VPMCD
QDA
Treenet
CART
ANN
PCCM
PCCM
PCCM
(order=0)
(order=1)
(order=2)
Entropy
Fisher
59
44
79
53
51
96
58
(42.75 %)
(31.88 %)
(57.25 %)
(38.4 %)
(36.96 %)
(69.57 %)
(42.03 %)
97
102
28
99
100
27
101
(70.3 %)
(73.9 %)
(20.3 %)
(71.7 %)
(72.5 %)
(19.6 %)
(73.2 %)
62
65
81
70
68
68
63
(44.93 %)
(47.1 %)
(58.7 %)
(50.72 %)
(49.28 %)
(49.28 %)
(45.65 %)
95
85
91
65
70
91
92
(68.84 %)
(61.59 %)
(65.94 %)
(47.1 %)
(50.72 %)
(65.94 %)
(66.67 %)
89
90
97
89
89
97
89
(64.49 %)
(65.22 %)
(70.29 %)
(64.49 %)
(64.49 %)
(70.29 %)
(64.49 %)
Table 4.13 Model accuracy using selected variables (cardiovascular dataset)
No
SVR
selection
VPMCD
LDA
Treenet
CART
ANN
PCCM
PCCM
PCCM
(order=0)
(order=1)
(order=2)
Entropy
Fisher
55
48
75
75
75
75
48
(39.86 %)
(34.78 %)
(54.35 %)
(54.35 %)
(54.35 %)
(54.35 %)
(34.78 %)
85
77
(61.59 %)
(55.80 %)
(58.70 %)
(58.70 %)
(58.70 %)
72
70
70
70
70
70
70
(52.17 %)
(50.72 %)
(50.72 %)
(50.72 %)
(50.72 %)
(50.72 %)
(50.72 %)
91
95
80
80
80
80
95
(65.94 %)
(68.84 %)
(57.97 %)
(57.97 %)
(57.97 %)
(57.97 %)
(68.84 %)
60
79
77
77
77
77
79
(43.48 %)
(57.24 %)
(55.79 %)
(55.79 %)
(55.79 %)
(55.79 %)
(57.24 %)
81
81
81
81
77
(58.70 %) (55.80 %)
Our experience with the DOA dataset emphasizes the necessity of
employing a case specific classifier and a suitable preprocessing technique. The
72
performances of the classifiers are definitely data specific and no single method
should be overemphasized. It is also important that different methods be tried and
sound procedures be employed to determine the best user-defined parameters for
the methods and also the best subset of variables.
In this study, DOA classification is performed using CART, TreeNet,
VPMCD, ANN and LDA/QDA. The comparison study is performed with the
objective of determining the best classifier i.e. the capability to correctly classify
new samples into their corresponding classes. According to our analysis, in terms of
overall accuracy, CART and QDA are observed to be the best classifier models for
DOA classification using cardiovascular features and AEP features respectively.
Even when classifiers are built using a subset of features, the superiority of CART
and QDA in DOA classification using cardiovascular dataset and AEP features
respectively is confirmed.
The utility of DPCCM and other advanced machine learning tools like
CART and TreeNet in handling data from medical domain and extracting
information from them is checked by applying DPCCM to WBC and WDBC.
DPCCM as well as TreeNet not only give the best overall accuracy on the test data
set but are also able to classify all cancerous cells perfectly to their respective
classes in the WBC dataset. This indicates the promising performance of DPCCM
for medical applications. In addition, DPCCM appears to be the most suitable
classifier on WDBC case study since it can perfectly classify all test samples to
their corresponding class. This study confirms the ability of DPCCM as a strong
classifier since its performance is not only good for food product datasets but is also
good for biomedical datasets.
73
In this chapter, the performance of the classifiers is also examined using
heart disease data sets. The existence of categorical data in this dataset precludes
some classifiers because of their inability to handle categorical data. Therefore, this
study was conducted using only TreeNet and CART. Based on our results on heart
disease classification, CART is the recommended classifier for heart disease
identification since patients with heart disease must be identified correctly for
medical treatment. On the other hand, if the objective is to identify healthy patients,
TreeNet can be applied to the dataset.
74
Chapter 5
Empirical Modeling of Diabetic Patient Data
All models are wrong, some models are useful
George Box
5.1 INTRODUCTION
One major component of critical care in Intensive Care Unit (ICU) is the
regulation of blood glucose in patients. Patients in ICU experience psychological
trauma and extreme stress. The challenge is to achieve tight glycaemic control and
avoid
abnormal
conditions
such
as
hyperglycemia
and
hypoglycemia.
Hyperglycemia is commonly observed in critically ill patients regardless of their
past medical history. The effect of hyperglycemia on death rate in ICU patients was
first observed in the surgical ICU of Leuven University Hospital (Van den Berghe,
2003). Van den Berghe (2003) showed that tight glucose control can reduce ICU
patients’ mortality rate up to 45%. Based on this study, The American Association
of Clinical Endocrinologists recommended 80 and 110 mg/dl as the lower and
upper limit of blood glucose value for intensive care patients (Kelly et al., 2006;
Umpierrez et al., 2007; Kitabchi et al., 2008; Tamaki et al., 2008). Studies have
shown that poor glycaemic control can lead to vascular complications such as
blindness, renal dysfunction, nerve damage, multiple organ failure, myocardial
infarction, limb amputation and even death in the case of type 1 and type 2 diabetic
patients (Taylor et al., 2006; Vanhorebeek et al., 2006; Chase et al., 2008; Kitabchi
75
et al., 2008). Thus, the regulation of blood glucose is of utmost importance for all
ICU patients as well as patients suffering from type 1 or type 2 diabetes.
Two broad approaches are available for blood glucose regulation in diabetic
patients. Practitioners generally prefer protocols (i.e. rule based administration of
insulin, oral drugs and/or oral glucose (for treating hypoglycemia)) while
researchers and academics have mainly focused on feedback controllers designed
based on control theory. The relative ease of implementation and the lack of proven
track record in treating diabetics with automatic control have made hospitals prefer
protocol based methods over automatic feedback-based control in ICU patients.
Many established protocols are available in the literature (Taylor et al., 2006;
Tamaki et al., 2008) and many ICUs prefer using their own in-house developed
protocols to control blood glucose levels in patients under their care. A drawback
with existing protocols is that they fail to explicitly consider variations in insulin
levels, effectiveness of insulin utilization, glucose absorption and other patientspecific factors (Chase et al., 2005). Therefore, it would be ideal if the protocol is
designed, optimized and personalized considering these patient-specific factors. In
this context, patient-specific models, constructed from patient data collected during
the early stages of ICU stay, can be beneficially used by physicians and caregivers
for improved ICU care. The modeling of blood glucose in ICU patients is
complicated owing to noisy measurements, infrequent sampling, lack of reliable
insulin or glucose infusion profiles, known/unknown disturbances related to patient
condition (stress, sepsis, etc.) and unrecorded events (therapeutic drugs taken for
the medical conditions for which the patient is in ICU).
Many model structures available in the literature have been reviewed by
Ramprasad (2004). Here, some of the representative and popular models are
76
reviewed. V.W.Bollie (1961) set a mark in modeling study of diabetics by
developing a two state linear model consisting of one differential equation each for
glucose and insulin. Ackerman et al. (1965) proposed a similar model structure for
glucose-insulin dynamics in healthy person. Although these two models
oversimplify the physiological glucose and insulin effects, the interaction effect of
glucose and insulin were successfully captured by the models.
Bergman et al. (1981) developed a model with three differential equations
that represent insulin production and infusion, insulin storage in a remote
compartment, and glucose input and insulin utilization in a second compartment.
The model takes a remote compartment concept for insulin storage to account for
the time delay between insulin injection and its utilization (Lam et al., 2002).
A model consisting of glucose subsystem, glucagon subsystem and insulin
subsystem was presented by Cobelli et al. (1982). Glucose and glucagon
subsystems are both modeled using single-compartment and the insulin subsystem
is represented by a 5-compartment model. This non-linear model utilized the
threshold function to describe the saturation behavior observed in biological
sensing. Cobelli and Mari (1983) validated this model in a glucose regulation case
study.
Puckett (1992) modeled the human body as two blood-pool system
representing insulin and glucose concentrations. The model included nonlinear
metabolic behavior of the glucose insulin system as well as carrier mechanism and
diffusion pathways which improve the accuracy of glucose and insulin removal
from the blood stream. However, high frequency dynamics are neglected by the
steady state compartments represented in this model. Puckett and Lightfoot (1995)
77
then improved the model by accounting for intra- and inter-patient variability
(Ramprasad, 2004).
More recently, a model was presented by Chase et al. (2005). The model
was developed by doing some modification to the Bergman model by bringing in
insulin utilization, insulin losses and saturation dynamics into the model. Each of
the above models have their own advantages and disadvantages. In this chapter, we
propose the use of the simple first order plus time delay (FOPTD) model to fit and
predict ICU patients’ blood glucose level. To the best of our knowledge, the
FOPTD model has not been employed for modeling the blood glucose dynamics in
ICU patients. This is one novelty in the present work. In addition, the FOPTD
model based structure is flexible and extendable to model any additional
phenomena that may become important.
5.2 First Order plus Time Delay (FOPTD) Model
First order plus time delay (FOPTD) finds application in many problems
related to process dynamics and control. It is a commonly used model structure to
capture the dynamic behavior of chemical engineering processes. Its simplicity and
ability to characterize plant dynamics makes FOPTD very useful, especially in
designing feedback control systems (Ogunnaike and Mukati, 2006; Fedele, 2008).
The FOPTD model for a single input single output system is given by Eq. 5.1 where
y(s) represents process output and u(s) represents the process input. K is the steady
state gain, τ is the time constant and θ represents the time delay.
y(s) =
K
e −θ s u ( s )
τs + 1
(5.1)
78
In this study, the FOPTD model is used to fit and predict blood glucose
level of ICU patients as functions of insulin (intravenous & bolus) and glucose
(intravenous and oral) inputs. The FOPTD model is not capable of representing any
phenomenological aspects of blood glucose dynamics; rather, it is a correlational
model that is able to capture and express the effect of external inputs (exogenous
insulin, meal, patient state etc.) on the blood glucose level. Despite its simplicity,
the FOPTD structure is capable of capturing patient-specific blood glucose
dynamics. Furthermore, the model parameters can be easily interpreted by the
physician and readily employed for treating the patients making it very attractive
for practical applications. To accommodate the effect of the different inputs, we
employ a multi-input single output (MISO) model structure with each dynamics
modeled as a FOPTD subsystem (see Fig. 5.1).
Intravenous Glucose
Oral Glucose
K
1
τ 1s + 1
K
2
τ 2s + 1
e
u (s)
e −θ 2 s u ( s )
K 3
e
τ 3s + 1
Insulin
−θ 1s
−θ 3s
+
BSL deviation
u (s)
Noise
Fig. 5.1 FOPTD model scheme (MISO System)
5.3 MATERIALS AND IMPLEMENTATION
5.3.1 Dataset and Software
The datasets were collected in the surgical ICU of the National University
Hospital (NUH), Singapore between January and July 2008. Blood glucose values
(recorded once every 4 hours), amounts of intravenous glucose, orally administered
79
glucose and the insulin infused (via bolus and intravenous route) were recorded. At
the ICU in NUH, the physicians endeavor to maintain the blood glucose levels in
patients between 6 to 8 mmol/l. This is a standard practice in many ICUs and is
considered to be a good compromise between tight glycemic control and the risk of
hypoglycaemia. The hypocount is monitored every 4 hours. Based on the
hypocount levels, an insulin infusion rate is fixed. This infusion is then
supplemented with boluses of insulin based on a sliding scale protocol. As an
example, the protocol used in Nutritional Support Service (Memphis) can be seen in
Dickerson et al. (2008).
Data from 19 ICU patients were made available. Based on the continuity of
insulin infusion and patients’ response to insulin, the cohort was classified into
three categories. Seven of the patients were given continuous insulin infusion and
their blood glucose response to insulin was as expected. Five patients needed only
intermittent insulin infusions and in the remaining 7 patients, the blood glucose
response was affected by factors other than insulin infusion (unnoted events that
results in unreasonably high or low blood glucose values and abnormal response to
insulin). Data from a typical patient belonging to the first group is shown in Figure
5.2. The leftmost column indicates the blood sugar level sampling time (the date
and the exact time). In columns 2 and 3, the blood sugar level is provided in two
different units. The type and dose of insulin supplied to the patient is noted in the
fourth column while the fifth column contains administered glucose information
(given intravenously or in the meal form depending on the patient’s consciousness
and condition at that time).
80
Figure 5.2 Data from Patient 1 who belongs to the first Group
81
5.3.2 FOPTD Implementation
All datasets were divided into training samples (first 80% of the data) and
test samples (last 20% of the data). The data set containing the test samples was
then kept aside for model validation. The training sets were used to build the multiinput single output FOPTD model with oral glucose, intravenous glucose and
insulin as the inputs and deviation in blood glucose (BG) values as the output. BG
deviation value is then added to the blood glucose equilibrium value (mean of two
previous BG values (Chase et al., 2008)) to get BG model predicted values (ĝ). The
mean absolute prediction error (MAPE = measured BG – model predicted values) is
then calculated. The best prediction from the model is possible for the set of
parameter values for which the MAPE is minimum. This set of parameters can be
arrived at using genetic algorithm (GA) with MAPE as the objective function. In
GA, a population of different sets of parameter values are initialized and updated by
using genetic principles such as crossover and mutation operators until the stopping
criteria for optimization is fulfilled. Finally, the GA tool will provide parameter
values which give the smallest MAE as the result. Bounds on the parameters are set
using physical reasoning – for example, the time constants and time delays are nonnegative, the gains of the insulin inputs to blood glucose are negative and the gains
of glucose inputs to blood glucose are positive.
After all model parameters are obtained, the model was applied to whole
dataset (training and test data sets) and both predicted value (ĝ) and actual value (g)
of blood glucose were plotted versus time. In such plots, the first 80% of the dataset
shows the fitting ability of the model and the last 20% data samples shows its
predictive ability.
82
5.4 RESULTS AND DISCUSSION
5.4.1 Patients with Continuous Insulin Infusion (Group 1)
The first pool of the cohort was given continuous insulin infusion and their
blood glucose value increase/decrease with the decrease/increase of insulin.
Validation results show that FOPTD model captures the dynamics of all the 7
patients with mean absolute error (MAE) value less than 2.1mmol / L (see Table
5.1). Using some pre-selected datasets, Chase et al (2008) claim their maximum
MAE is 2.9mmol/L for patients with continuous insulin infusion which is larger
than obtained with the FOPTD model without any data pre-selection. The model fit
and prediction results for the patients with the lowest and highest MAE are plotted
in Fig 5.3 and Fig 5.4 respectively. In these figures, solid lines and dotted lines
represent data fitting and model validation respectively.
Table 5.1 shows that the MISO FOPTD model structure gives considerably
low MAE value not only in training samples but also in test samples for all patients
who receive continuous insulin infusion. The small MAE differences between
training and test set indicate the stability and consistency of the model performance
in handling both old and new data. As shown in Figures 5.3 and 5.4, the FOPTD
model is able to track the patient response to a significant degree. These results
compare very favorably to results obtained with first principles based models
(Loganathan et al., 2008). However, none of the models were able to capture some
of the highs and lows seen in patient data. Unmeasured variables like additional
medications administered during the trial, stress level, existence of infection etc.,
may have contributed to such aberrations.
83
Table 5.1 MAE values for training and test samples using data from patients with
continuous insulin infusion
MAE training
MAE test
Pat 1
1.7648
1.8687
Pat 2
2.734
1.8208
Pat 22
1.2148
0.9306
Pat 34
1.43
1.6637
Pat 1B
2.6416
1.8084
Pat 30
1.8401
2.0282
Pat 25
1.1701
0.9754
Predicted and Actual Blood Glucose Value for Pat 22
22
Validation
Training
Actual
20
Blood glucose (mmol/L)
18
MAE=0.9306
16
14
12
10
8
6
4
2
0
2000
4000
6000
Time (min)
8000
10000
12000
Fig. 5.3. Results for the “best” patient data set using the FOPTD model
84
Predicted and Actual Blood Glucose Value for Pat 30
18
Validation
Training
Actual
MAE=2.0282
16
Blood glucose (mmol/L)
14
12
10
8
6
4
2
0
2000
4000
6000
8000
10000
Time (min)
12000
14000
16000
Fig. 5.4. Results for the “worst” patient data set using the FOPTD model
5.4.2 Patients with Intermittent Insulin Infusion (Group 2)
Patients with a blood glucose response that is relatively stable are supplied
insulin intermittently (i.e. only in case of need). The robustness of the modeling
procedure can be better tested with such patients since the insulin input
(perturbations) is relatively less compared to patients in group 1. As stated earlier,
five patients fall under this category. The results of the “best” of the 5 patients
based on MAE are shown in Fig 5.5. Table 5.2 gives the MAE values for this pool
of patients. The low value of MAE, shown in Table 5.2, confirms the robustness of
the MISO FOPTD model in handling intermittent insulin infusion. In addition,
Figure 5.5 portrays the ability of our model to capture the dynamics of patient blood
glucose data and shows extremely good performance in the test samples as well.
85
Table 5.2 MAE values for training and test samples using patient data with
intermediate insulin infusion
MAE training
MAE test
Pat 6
1.3549
0.9564
Pat 13
0.6401
0.5883
Pat 16
1.1165
0.4997
Pat 27
0.9564
0.9751
Pat 32
1.3771
0.8679
Predicted and Actual Blood Glucose Value for Pat 16
13
Validation
Training
Actual
12
Blood glucose (mmol/L)
11
MAE=0.4997
10
9
8
7
6
5
0
1000
2000
3000
Time (min)
4000
5000
6000
Fig 5.5. Results for the “best” patient data set using the FOPTD model (Intermittent
Insulin Infusion).
86
5.4.3 Patients with Blood Glucose Response Affected by Other Factors
(Group 3)
The usual practice in building and testing a proposed model (with a given
structure) has been to select a consistent cohort from a large pool of patients and
examine the data from this group (Chase et al., 2005). However, in real practice, the
medical team often comes across extreme and challenging cases in the ICU. A
model structure which is robust enough to handle multiple medical interventions
and a broader range of patient dynamics is needed.
In this study, efforts were made to include a group of 7 patients with
complex blood glucose response. In this pool of patients, we have patients that
exhibit severe hyper/hypoglycemic tendencies as well as those who needed frequent
medication for treating conditions such as allergies, stress, and cardiogenic shock
treatment. The MAE values for all cases belonging to this group are shown in Table
5.3 and the model performance results for the “best” case (the least MAE) from this
pool are shown in Fig 5.6.
As shown in Table 5.3, the identified FOPTD models are associated with
low MAE values for each patient. The high accuracy of FOPTD model is shown not
only in fitting part but also in validation part. The ability of the proposed MISOFOPTD model structure in handling such datasets confirms its robustness and
reliability in modeling blood glucose dynamics especially for ICU patients.
87
Table 5.3 MAE values for training and test samples using Group3 patient data
MAE training
MAE test
Pat 12
1.4652
0.9606
Pat 14
1.9856
1.543
Pat 19
2.476
3.1174
Pat 21
1.7299
0.9052
Pat 23
1.1382
0.9993
Pat 24
1.7418
1.7784
Pat 26
1.962
3.205
Predicted and Actual Blood Glucose Value for Pat 26
18
Validation
Training
Actual
16
MAE=3.205
Blood glucose (mmol/L)
14
12
10
8
6
4
0
1000
2000
3000
Time (min)
4000
5000
6000
Fig. 5.6 Model performance on the “best” patient data from Group 3
88
5.4.4 Medication Effect
One of the key limitations of the existing models in the literature is the
inability of such models to account explicitly for the effects of medication and other
medical conditions that may occur during trials (Lam et al., 2002; Chase et al.,
2008). The general argument put forward is that the parameters in existing models
would take care of such dynamics. Such claims have largely been unsubstantiated
as yet. The predictions using such models can be poor and may end up in missing
out a hypo/hyperglycemic episode (the former being more serious for patient
health).
In the proposed MISO-FOPTD structure, any medication effects or
medical conditions can included in a straightforward manner by including them as
additional inputs with a suitable model structure (e.g. FOPTD) relating them to the
blood glucose output. Here, we have included medication data as an additional
input to the FOPTD structure considered in Figure 5.1. The effect of medication is
studied using 2 patient datasets for whom medication data were available. The
simulation results are very promising and are shown in Figures 5.7, 5.8, 5.9 and
5.10.
Predicted and Actual Blood Glucose Value for Pat 27 w/o medication
22
Validation
Training
Actual
20
Blood glucose (mmol/L)
18
MAE=0.9751
16
14
12
10
8
6
4
2
0
0.5
1
1.5
Time (min)
2
2.5
3
4
x 10
Fig. 5.7 FOPTD prediction without medication for Patient 27
89
Predicted and Actual Blood Glucose Value for Pat 27 w/ medication
22
Validation
Training
Actual
20
Blood glucose (mmol/L)
18
MAE=0.7818
16
14
12
10
8
6
4
2
0
0.5
1
1.5
Time (min)
2
2.5
3
4
x 10
Fig. 5.8 FOPTD prediction with medication for Patient 27
Predicted and Actual Blood Glucose Value for Pat 34
22
Validation
Training
Actual
20
Blood glucose (mmol/L)
18
MAE=1.6637
16
14
12
10
8
6
4
2
0
0.2
0.4
0.6
0.8
1
1.2
Time (min)
1.4
1.6
1.8
2
x 10
4
Fig. 5.9 FOPTD prediction without medication for Patient 34
90
Predicted and Actual Blood Glucose Value for Pat 34 w/ medication
22
Validation
Training
Actual
20
18
Blood glucose (mmol/L)
MAE=1.6529
16
14
12
10
8
6
4
2
0
0.2
0.4
0.6
0.8
1
1.2
Time (min)
1.4
1.6
1.8
2
4
x 10
Fig. 5.10 FOPTD prediction with medication for Patient 34
From Fig. 5.7, it can be seen that the FOPTD model, without using the
medication data doesn’t capture the hyperglycemic episodes. However, from Fig.
5.8, wherein the results correspond to the FOPTD model with medication, the
hyperglycemic data is captured very well. It has to be noted that the model with
medication predicts a non-existing hypoglycemia (at time~7500 min). This would
force the medical staff to decrease the insulin infusion, which in-turn will increase
blood glucose. Hence, here in this case, it works out to be harmless to the patient.
As can be seen from Fig. 5.9 and Fig. 5.10, the inclusion of medication
effect in the model allows it to capture the lows around time~1000 min better than
the model which does not take medication data into account. The same phenomena
are observed at time~8000 min and at time~18000 min. In addition, the predictive
ability of the model which takes medication into consideration is better than the
91
model which does not take medication into consideration. This is confirmed by the
MAE values shown in Figures 5.7 through 5.10. In Table 5.4, the parameter ranges
obtained for the different patients are summarized. The values are reasonable but
the range is rather wide (even taking patient-to-patient variability into account).
More work needs to be done to verify this aspect of the problem. What we have
succeeded here is in showing that the FOPTD model produces acceptable and
adequate results that matches those obtained with first principles based models (see
Loganathan et al., 2008).
Table 5.4 Range of the parameters for each patient group
Group 1
Group 2
Group 3
K1
0.00005 to 190.78
0.00005 to 2.023
0.109 to 12.574
τ1 (min)
0.375 to 119
0.00005 to 2.333
0.00005 to 5.228
θ1 (min)
0.875 to 59
0.00005 to 7.225
0.00005 to 1.094
K2
0.00005 to 1.129
0.332 to 0.981
0.075 to 2.631
τ2 (min)
0.25 to 67.269
0.291 to 4.58
0.5 to 4.949
θ2 (min)
1.123 to 16.874
0.00005 to 0.961
0.00005 to 1.078
K3
-0.046 to -0.004
-0.177 to -0.00005
-0.00005 to -0.063
τ3 (min)
0.5 to 112
0.562 to 36.624
0.235 to 100
θ3 (min)
0.461 to 49
1.116 to 18.98
0.001 to 29.908
5.4.5 Analysis of Home Monitoring Diabetes Data
To check the robustness of the MISO FOPTD structure for purposes of
blood glucose modeling, the methodology described above was applied to patient
data that came from home monitoring. Thus, this is non-ICU data provided by Dr
92
Tibor Deutsch (Applied Logic Laboratory, Hungary). The data made available to
us was on 5 patients and consisted of three inputs (glucose, short acting insulin and
intermediate acting insulin) and one output (blood glucose values recorded 6 times
daily around after patients’ meal time over a period of 2 years). The results of
model building and validation for the patients with the highest and the lowest MAE
are given in Fig. 5.11and Fig. 5.12 respectively.
Table 5.5, Fig. 5.11 and Fig. 5.12 indicate that the MISO-FOPTD structure
is a very promising tool to capture the dynamics of blood glucose dynamics in
home monitored diabetic patients as well. Table 5.5 shows that the FOPTD model
results in considerably low MAE value not only in training set but also in test set
for all datasets studied. Small MAE differences between training and test set show
the stability and consistency of FOPTD performance in handling both old and new
data. However, as can be seen from Figures 5.13 and 5.14, the MISO-FOPTD
model predictions have low correlations with actual measured data. This is a point
of concern and must be addressed in future work. The mismatch between model
prediction and actual data may be due to other factors (such as illness, stress in
daily life, etc.) which are not captured by the model.
Table 5.6 summarizes the range of estimated model parameters in home
monitoring datasets. In addition to the advantages of FOPTD, as can be seen from
Table 5.6, all parameters obtained from this model lie inside the reasonable
boundaries. The time constant for all input is still less than 90 minutes and the time
delay values do not exceed 30 minutes. These are considered to be reasonable and
realistic values.
93
Table 5.5 MAE value for training and test samples using home monitoring data
MAE training
MAE test
Pat 10
1.164
1.2374
Pat 214
0.981
0.331
Pat 913
0.562
0.335
Pat 117
1.162
1.154
Pat 45
1.0436
0.947
Predicted and Actual Blood Glucose Value for Pat 10
14
Validation
Training
Actual
Blood glucose (mmol/L)
12
MAE=1.2374
10
8
6
4
2
0
0
0.5
1
1.5
2
Time (min)
2.5
3
3.5
5
x 10
Fig. 5.11 Results with the FOPTD model for the patient with the highest MAE
(home monitoring dataset)
94
Predicted and Actual Blood Glucose Value for Pat 214
12
Validation
Training
Actual
10
Blood glucose (mmol/L)
MAE=0.331
8
6
4
2
0
0
0.5
1
1.5
2
2.5
3
Time (min)
3.5
4
4.5
5
5
x 10
Fig. 5.12 Results with the FOPTD model for the patient with the lowest MAE
(home monitoring dataset)
Table 5.6 Range of estimated parameters for home monitoring data
Home monitoring data
K3a*
-68.874 to -0.00005
τ3a (min)
0.0005 to 18.896
θ3a (min)
0.011 to 6
K2
0.00005
τ2 (min)
0.289 to 82.04
θ2 (min)
0.266 to 28.578
K3b*
-57.5 to -0.00005
τ3b (min)
0.023 to 18.919
θ3b(min)
1.697 to 6.29
* a and b refers to short and intermediate acting insulin
95
Predicted and Actual Blood Glucose Value for Pat 214
Predicted and Actual Blood Glucose Value for Pat 913
160
100
140
Predicted Blood Glucose(mg/dL)
Predicted Blood Glucose(mg/dL)
Corr = 0.1307
Corr = 0.3511
90
80
70
60
50
40
100
80
60
40
30
20
20
30
40
50
60
70
Actual Blood Glucose(mg/dL)
80
90
20
20
100
160
40
60
80
100
120
140
Actual Blood Glucose(mg/dL)
160
180
200
Predicted and Actual Blood Glucose Value for Pat 10
Predicted and Actual Blood Glucose Value for Pat 117
250
180
Corr = 0.4114
Corr = 0.2955
200
Predicted Blood Glucose(mg/dL)
140
120
100
80
60
40
150
100
50
20
0
0
20
40
60
80
100
120
140
Actual Blood Glucose(mg/dL)
160
180
200
0
0
50
100
150
Actual Blood Glucose(mg/dL)
200
250
Predicted and Actual Blood Glucose Value for Pat 45
200
Corr = 0.1082
180
Predicted Blood Glucose(mg/dL)
Predicted Blood Glucose(mg/dL)
120
160
140
120
100
80
60
40
20
40
60
80
100
120
140
160
Actual Blood Glucose(mg/dL)
180
200
220
Figure 5.13 Actual glucose and model fit for all 5 home monitoring patients
96
Predicted and Actual Blood Glucose Value for Pat 10
Predicted and Actual Blood Glucose Value for Pat 913
160
90
Corr = 0.136
Corr = 0.4674
140
80
Predic ted Blood Gluc ose(m g/dL)
Predicted Blood Glucose(mg/dL)
85
75
70
65
60
55
50
55
60
65
70
Actual Blood Glucose(mg/dL)
75
80
60
60
80
100
120
140
Actual Blood Glucose(mg/dL)
160
180
70
75
75
70
Corr =0.1435
Predicted Blood Glucose(mg/dL)
160
40
Predicted and Actual Blood Glucose Value for Pat 214
Predicted and Actual Blood Glucose Value for Pat 45
Predicted Blood Glucose(mg/dL)
80
20
20
85
180
140
120
100
80
60
Corr = 0.2719
65
60
55
50
45
40
40
20
20
100
40
50
45
45
120
40
60
80
100
120
140
Actual Blood Glucose(mg/dL)
160
180
200
35
35
40
45
50
55
60
Actual Blood Glucose(mg/dL)
65
Predicted and Actual Blood Glucose Value for Pat 117
160
Corr =0.2452
Predicted Blood Glucose(m g/dL)
140
120
100
80
60
40
20
20
40
60
80
100
120
140
Actual Blood Glucose(mg/dL)
160
180
Figure 5.14 Actual glucose and model prediction for all 5 home monitoring patients
To summarize, in this chapter, the use of a MISO-FOPTD structure has been
proposed and evaluated to model ICU patients’ blood glucose level. FOPTD is
applied to data from 19 ICU patients and is seen to give satisfactory result in fitting
and predicting blood glucose values. In addition, its simplicity enables FOPTD to
be easily extended when additional input variables become available. The FOPTD
97
model was also applied to data collected from home monitored diabetes patients
and promising results were obtained.
98
Chapter 6
Conclusions and Recommendations
I do the very best I know how- the very best I can; and I mean to keep
on doing so until the end
Abraham Lincoln (1809-1865)
Former US President
6.1 Conclusions
In Chapter 3, the performance of new classifier, DPCCM is tested on two
food product classification case studies. The performance of DPCCM is compared
with well established classifiers such as LDA, CART, TreeNet and SVM. In the
wine case study, DPCCM performance is comparable to LDA and is better than
other classifiers. It is noteworthy that, in this case, there is an improvement in
performance with increase in the order of partial correlations. This fact indicates the
presence of multivariate interactions and indirect relationships between the
variables. In the cheese classification problem, DPCCM gives the best classification
result and it is comparable to SVM. Also, the use of original variables without
projecting them to new dimensional space is a positive aspect of DPCCM.
The utility of DPCCM and other advanced machine learning tools like
CART and TreeNet in handling data from medical domain and extracting
information from them is checked by applying DPCCM to WBC and WDBC.
DPCCM as well as TreeNet not only give the best overall accuracy on the test data
set but are also able to classify all cancerous cells perfectly to their respective
99
classes in the WBC dataset. This indicates the promising performance of DPCCM
for medical applications. In addition, DPCCM appears to be the most suitable
classifier on WDBC case study since it can perfectly classify all test samples to
their corresponding class. This study confirms the ability of DPCCM as a strong
classifier since its performance is not only good for food product datasets but is also
good for biomedical datasets.
This thesis also examined the feasibility of DOA classification using
DPCCM, CART, TreeNet, VPMCD, ANN and LDA/QDA. The comparison study
was performed with the objective of determining the best classifier i.e. the
capability to correctly classify new samples into their corresponding classes.
According to our analysis, in terms of overall accuracy, CART and QDA are
observed to be the best classifier models for DOA classification using
cardiovascular features and AEP features respectively. Even when classifiers are
built using a subset of features, the superiority of CART and QDA in DOA
classification using cardiovascular dataset and AEP features respectively is
confirmed. Another interesting fact that came out of this study is the significant
performance improvement after applying variable selection method in both
cardiovascular and AEP features datasets. This also highlighted the importance of
variable selection in DOA analysis. Overall, the analysis indicated the lack of
generality of methods and highlighted the necessity of designing case specific
decision support system based on best performing classifier and variable selection
method.
In this thesis, the performance of the classifiers is also examined using heart
disease data sets. The existence of categorical data in this dataset precludes some
classifiers because of their inability to handle categorical data. Therefore, this study
100
was conducted using only TreeNet and CART. Based on our results on heart
disease classification, it can be concluded that CART is the most suitable classifier
for heart disease prediction using all attributes available in the heart disease dataset.
CART is able to predict patients with heart disease more accurately than TreeNet.
However, CART performance on predicting the other class tends to be poorer than
TreeNet. Based on these results, CART is the recommended classifier for heart
disease identification since patients with heart disease must be identified correctly
for medical treatment. On the other hand, if the objective is to identify healthy
patients, TreeNet can be applied to the dataset.
A new First Order plus Time Delay (FOPTD) model for capturing the
dynamics of blood glucose in ICU patients has also been proposed and evaluated in
this thesis. The FOPTD model structure was applied to data sets obtained from ICU
patients’ as well as from diabetes patients under home monitoring. The results show
that FOPTD model gives a considerably low MAE value and is able to predict the
blood glucose values in the patient data. In addition, it is simple and the model can
be easily applied for controller tuning. Also, it offers the luxury of including
additional phenomena such as the effect of medication without any difficulty. When
compared with the results reported in the literature, with 1 hour sampling frequency
and the pre-processing of consistent patient cohort from a larger pool, the FOPTD
model gives comparably accurate results.
6.2 Recommendations
To date, machine learning has been widely used especially to solve
problems related to classification in medicine and food product quality. However,
according to our knowledge, machine learning application in other aspect of studies
such as industrial process improvement and business application has not been
101
thoroughly explored. Some studies done by (Filipic and Junkar, 2000; Chen and
Hsiao, 2008) have shown the use of data mining approach in those two aspects.
Therefore, in future, one could attempt the application of the new developed
method (DPCCM) and other classifiers in those fields.
Hybrids of existing classifiers may become an interesting field to be
explored further. The idea is to use the first classifier for variable selection and the
second classifier for solving classification problem. This hybrid system will be
beneficial in applications characterized by large number of variables and small
number of samples. Filipic and Junkar (2000) and Sahan et al. (2007) have
successfully applied this idea in improving k-nearest neighbor accuracy for WDBC
datasets. However, according to our best knowledge, this hybrid system has not
been used in food identification problems.
Confidence interval calculation for classification problem is another aspect
that could be studied further. Confidence interval following classifier accuracy
could give some information about the classifier’s reliability. This could be very
important when dealing with biomedical data. Classification in dynamic mode
could also be considered as future work. The idea is to update the classification
model using new data samples so that the accuracy of the model can be maintained
for a longer time period.
For the study of ICU patients’ blood glucose data done in chapter 5, the
hypocounts were taken every 4 hours. As a result, the dynamics of the patients’
blood glucose value is hard to be accurately captured since we do not know what
happened in between. Therefore, frequent sampling is really needed to increase the
model accuracy and to make the model suitable for tight glycaemic control. Further
study on a larger pool of patients with more frequent monitoring of blood glucose
102
needs to be done to validate the structure of the model and to determine other inputs
which may affect blood glucose values. The possibility to integrate FOPTD model
with first principles model is another issue that may be worth exploring. Since first
principles models capture some specific phenomena, they are not amenable to
expansion (via addition of new differential equations or new terms in existing
equations) when new uncharacterized variables/phenomena are encountered.
Therefore, one can think of developing hybrid models – using first principles model
to capture essential phenomena (structural support to the modeling problem)
augmented by FOPTD models for the new inputs. This hybrid method could be
very promising for blood glucose modeling.
103
References
Ackerman, E., L. C. Gatewood, J. W. Rosevear and G. D. Molnar (1965).
"Model Studies of Blood Glucose Regulation." The Bulletin of Mathematical
Biophysics 27: 21.
Anonymous (2001, 28 November 2008). "Internet World Stats: Usage and
Population Statistic." from .
Asuncion, A. and D. J. Newman (2007). UCI Machine Learning Repository.
University of
California, Department of Information and Computer Science, Irvine, CA.
Baba, K., R. Shibata and M. Sibuya (2004). "Partial correlation and
conditional correlation as measures of conditional independence." Australian
and New Zealand Journal of Statistics 46 (4): 657-664.
Bagui, S. C., S. Bagui, K. Pal and N. R. Pal (2003). "Breast cancer detection
using rank nearest neighbor classification rules." Pattern Recognition 36(1):
25-34.
Baldi, P., S. Brunak, Y. Chauvin, C. A. F. Andersen and H. Nielsen (2000).
"Assessing the accuracy of prediction algorithms for classification: an
overview." Bioinformatics 16(5): 412-424.
Beltrán, N. H., M. A. Duarte-Mermoud, M. A. Bustos, S. A. Salah, E. A.
Loyola, A. I. Peña-Neira and J. W. Jalocha (2006). "Feature extraction and
classification of Chilean wines." Journal of Food Engineering 75(1): 1-10.
Bergman, R. N., L. S. Phillips and C. Cobelli (1981). "Physiologic Evaluation
of Factors Controlling Glucose Tolerance in Man." Journal of clinical
investigation 68: 1456.
Berrueta, L. A., R. M. Alonso-Salces and K. Héberger (2007). "Supervised
pattern recognition in food analysis." Journal of Chromatography A 1158(12): 196-214.
Bertolini, M., A. Rizzi and M. Bevilacqua (2007). "An alternative approach to
HACCP system Implementation." Journal of Food Engineering 79: 1322-1328.
Bevilacqua, M., M. Braglia and R. Montanari (2003). "The classification and
regression tree approach to pump failure rate analysis." Reliability
Engineering & System Safety 79(1): 59-67.
Bevilacqua, M., F. E. Ciarapica and G. Giacchetta (2008). "Industrial and
occupational ergonomics in the petrochemical process industry: A regression
trees approach." Accident Analysis & Prevention 40(4): 1468-1479.
104
Bollie, V. W. (1961). "Coefficients of Normal Blood Glucose Regulation."
Journal of Applied Physiology 16: 783.
Breiman, L., J. H. Friedman, R. A.Olshen and C. J. Stone (1983). Classification
and Regression Trees. Monterey, CA, Wadsworth International Group.
Brown, D. J. (2007). "Using a global VNIR soil-spectral library for local soil
characterization and landscape modeling in a 2nd-order Uganda watershed."
Geoderma 140(4): 444-453.
Canu, S., Y. Grandvalet, V. Guigue and A. Rakotomamonjy (2005) . SVM and
Kernel Methods Matlab Toolbox. Perception Systèmes et Information.
Chase, G. J., X.-W. Wong, I. Singh-Levett, L. J. Hollingsworth, C. E. Hann, G.
M. Shaw, T. Lotz and J. Lin (2008). "Simulation and initial proof-of-concept
validation of a glycaemic regulation algorithm in critical care." Control
Engineering Practice 16(3): 271-285.
Chase, J. G., G. M. Shaw, J. Lin, D. C. V., C. Hann, T. Lotz, G. C. Wake and
B. Broughton (2005). "Targeted Glycemic Reduction in Critical Care Using
Closed-Loop Control." Diabetes Technology and Therapeutics 7: 274.
Chen, L.-H. and H.-D. Hsiao (2008). "Feature selection to diagnose a business
crisis by using a real GA-based support vector machine: An empirical study."
Expert Systems with Applications 35(3): 1145-1155.
Cheng, H. D., X. J. Shi, R. Min, L. M. Hu, X. P. Cai and H. N. Du (2006).
"Approaches for automated detection and classification of masses in
mammograms." Pattern Recognition 39(4): 646-668.
Chiang, L. H. and R. D. Braatz (2003). "Process monitoring using causal map
and multivariate statistics: fault detection and identification." Chemometrics
and Intelligent Laboratory Systems 65(2): 159-178.
Cobelli, C., G. Federspil, G. Pacini, A. Salvan and C. Scandellari (1982). "An
integrated mathematical model of the dynamics of blood glucose and its
hormonal control." Mathematical Biosciences 58(1): 27-60.
Cobelli, C. and A. Mari (1983). "Validation of mathematical models of
complex endocrine-metabolic systems. A case study on a model of glucose
regulation." Medical and Biological Engineering and Computing 21(4): 390399.
Cover, T. and P. Hart (1967). "Nearest neighbor pattern classification."
Information Theory, IEEE Transactions on 13(1): 21-27.
Dahl, F. A. (2007). "Convergence of random k-nearest-neighbour imputation."
Computational Statistics & Data Analysis 51(12): 5913-5917.
105
de la Fuente, A., N. Bing, I. Hoeschele and P. Mendes (2004). "Discovery of
meaningful associations in genomic data using partial correlation coefficients."
Bioinformatics 20(18): 3565-3574.
Deconinck, E., T. Hancock, D. Coomans, D. L. Massart and Y. V. Heyden
(2005). "Classification of drugs in absorption classes using the classification
and regression trees (CART) methodology." Journal of Pharmaceutical and
Biomedical Analysis 39(1-2): 91-103.
Dickerson, R. N., C. E. Swiggart, L. M. Morgan, G. O. Maish Iii, M. A. Croce,
G. Minard and R. O. Brown (2008). "Safety and efficacy of a graduated
intravenous insulin infusion protocol in critically ill trauma patients receiving
specialized nutritional support." Nutrition 24(6): 536-545.
Dr. Earl H. Tilford, J. (2000). THE INFORMATION REVOLUTION AND
NATIONAL SECURITY T. E. Copeland.
Duda, R. O., P. E. Hart and D. G. Stork (2000). Pattern Classification. New
York, John Wiley.
Ebrahimi, N., E. Maasoumi and E. S. Soofi (1999). "Ordering univariate
distributions by entropy and variance." Journal of Econometrics 90(2): 317336.
Eisen, M. B., P. T. Spellman, P. O. Brown and D. Botstein (1998). "Cluster
analysis and display of genome-wide expression patterns." Proceedings of the
National Academy of Sciences of the United States of America 95(25): 1486314868.
Elkfafi, M., J. S. Shieh, D. A. Linkens and J. E. Peacock (1998). "Fuzzy logic
for auditory evoked response monitoring and control of depth of anaesthesia."
Fuzzy Sets and Systems 100(1-3): 29-43.
Evans, D. G., L. K. Everis and G. D. Betts (2004). "Use of survival analysis and
Classification and Regression Trees to model the growth/no growth boundary
of spoilage yeasts as affected by alcohol, pH, sucrose, sorbate and
temperature." International Journal of Food Microbiology 92(1): 55-67.
Fedele, G. (2008). "A new method to estimate a first-order plus time delay
model from step response." Journal of the Franklin Institute In Press,
Corrected Proof.
Filipic, B. and M. Junkar (2000). "Using inductive machine learning to
support decision making in machining processes." Computers in Industry
43(1): 31-41.
Fisher, R. A. (1936). "The use of multiple measurements in taxonomic
problems." Annual Eugenics 7: 179-188.
106
Flores, M. J., J. A. Gámez and J. L. Mateo (2008). "Mining the ESROM: A
study of breeding value classification in Manchego sheep by means of attribute
selection and construction." Computers and Electronics in Agriculture 60(2):
167-177.
Freidman, J. H. (1999). "Greedy Function Approximation: A Gradient
Boosting Machine;
technical report on Treenet."
Furey, T. S., N. Cristianini, N. Duffy, D. W. Bednarski, M. Schummer and D.
Haussler (2000). "Support vector machine classification and validation of
cancer tissue samples using microarray expression data." Bioinformatics
16(10): 906-914.
Gascoigne, B. (2008). "History of The Industrial Revolution." from
http://www.historyworld.net/wrldhis/PlainTextHistories.asp?historyid=aa37.
Granitto, P. M., F. Gasperi, F. Biasioli, E. Trainotti and C. Furlanello (2007).
"Modern data mining tools in descriptive sensory analysis: A case study with a
Random forest approach." Food Quality and Preference 18(4): 681-689.
Guyon, I., J. Weston, S. Barnhill and V. Vapnik (2002). "Gene Selection for
Cancer Classification using Support Vector Machines." Machine Learning
46(1): 389-422.
Halsall, P. (1997). Internet Modern History Sourcebook
Hong, J.-H. and S.-B. Cho (2008). "A probabilistic multi-class strategy of onevs.-rest support vector machines for cancer classification." Neurocomputing
71(16-18): 3275-3281.
Jagannathan, G. and R. N. Wright (2008). "Privacy-preserving imputation of
missing data." Data & Knowledge Engineering 65(1): 40-56.
Jerez-Aragonés, J. M., J. A. Gómez-Ruiz, G. Ramos-Jiménez, J. Muñoz-Pérez
and E. Alba-Conejo (2003). "A combined neural network and decision trees
model for prognosis of breast cancer relapse." Artificial Intelligence in
Medicine 27(1): 45-63.
Jiann Shing, S., D. A. Linkens and J. E. Peacock (1999). "Hierarchical rulebased and self-organizing fuzzy logic control for depth of anaesthesia."
Systems, Man, and Cybernetics, Part C: Applications and Reviews, IEEE
Transactions on 29(1): 98-109.
Kelly, J. L., I. B. Hirsch and A. P. Furnary (2006). "Implementing an
Intravenous Insulin Protocol in Your Practice: Practical Advice to Overcome
Clinical, Administrative, and Financial Barriers." Seminars in Thoracic and
Cardiovascular Surgery 18(4): 346-358.
107
Kelly, M. (2001). "Overview of the Industrial Revolution - Industrial
Revolution."
Kitabchi, A. E., A. X. Freire and G. E. Umpierrez (2008). "Evidence for strict
inpatient blood glucose control: time to revise glycemic goals in hospitalized
patients." Metabolism 57(1): 116-120.
Kojima, T., K. Yoshikawa, S. Saga, T. Yamada, S. Kure, T. Matsui, T.
Uemura, Y. Fujimitsu, M. Sakakibara, Y. Kodera and H. Kojima (2008).
"Detection of Elevated Proteins in Peritoneal Dissemination of Gastric Cancer
by Analyzing Mass Spectra Data of Serum Proteins." Journal of Surgical
Research In Press, Uncorrected Proof.
Kressel, U. (1999). Pairwise classification and support vector machines.
Advances in Kernel Methods: Support Vector Learning. MA, MIT Press.
Kurt, I., M. Ture and A. T. Kurum (2008). "Comparing performances of
logistic regression, classification and regression tree, and neural networks for
predicting coronary artery disease." Expert Systems with Applications 34(1):
366-374.
Lam, Z. H., K. S. Hwang, J. Y. Lee, J. G. Chase and G. C. Wake (2002).
"Active insulin infusion using optimal and derivative-weighted control."
Medical Engineering & Physics 24(10): 663-672.
Linkens, D. A. and L. Vefghi (1997). "Recognition of patient anaesthetic levels:
neural network systems, principal components analysis, and canonical
discriminant variates." Artificial Intelligence in Medicine 11(2): 155-173.
Little, R. J. A. and D. B. Rubin (1986). Statistical Analysis with Missing Data.
new york, john wiley.
Liu, K.-H. and D.-S. Huang (2008). "Cancer classification using Rotation
Forest." Computers in Biology and Medicine 38(5): 601-610.
Liu, W. Z., A. P. White, M. T. Hallissey and J. W. L. Fielding (1996).
"Machine learning techniques in early screening for gastric and oesophageal
cancer." Artificial Intelligence in Medicine 8(4): 327-341.
Loganathan, P., S. Lakshminarayanan and R. G. Pandu. (2008). Blood Glucose
Patient Modelling using First Principle Model. Singapore, ChBE NUS.
Magoulas, G. D. and A. Prentza (2001). "Machine learning in medical
applications." Lecture notes in artificial intelligence 2049: 300-307.
Mahesh, V. and S. Ramakrishnan (2007). "Assessment and classification of
normal and restrictive respiratory conditions through pulmonary function test
and neural network." Journal of Medical Engineering & Technology 31(4):
300 - 304.
108
Mahfouf, M. (2006). Intelligent systems modeling and decision support in
bioengineering Massachusetts, Artech house.
MATLAB (2005). 7.0.4 (Release 14).
McCabe, G. P. (1984). "Principal Variables." Technometrics 26: 137-144.
Moon, H., H. Ahn, R. L. Kodell, S. Baek, C.-J. Lin and J. J. Chen (2007).
"Ensemble methods for classification of patients for personalized medicine
with high-dimensional data." Artificial Intelligence in Medicine 41(3): 197-207.
Nayak, A. and R. J. Roy (1998). "Anesthesia control using midlatency auditory
evoked potentials." IEEE Transactions on Biomedical Engineering 45(4): 409421.
Nunes, C. S., M. Mahfouf, D. A. Linkens and J. E. Peacock (2005). "Modelling
and multivariable control in anaesthesia using neural-fuzzy paradigms: Part I.
Classification of depth of anaesthesia and development of a patient model."
Artificial Intelligence in Medicine 35(3): 195-206.
Ogunnaike, B. A. and K. Mukati (2006). "An alternative structure for next
generation regulatory controllers: Part I: Basic theory for design, development
and implementation." Journal of Process Control 16(5): 499-509.
Oxford, U. (2005). Oxford Advanced Learner’s Dictionary. S. Wehmeier,
Oxford University Press.
Podgorelec, V., P. Kokol, M. M. Stiglic, M. Hericko and I. Rozman (2005).
"Knowledge discovery with classification rules in a cardiovascular dataset."
Computer Methods and Programs in Biomedicine 80(Supplement 1): S39-S49.
Polat, K. and S. Günes (2008). "Principles component analysis, fuzzy weighting
pre-processing and artificial immune recognition system based diagnostic
system for diagnosis of lung cancer." Expert Systems with Applications 34(1):
214-221.
Puckett, W. R. (1992). Dynamic modelling of Diabetes Mellitus. Department of
Chemical Engineering, University of Wisconsin-Madison. PhD.
Puckett, W. R. and E. N. Lightfoot (1995). "A model for multiple subcutaneous
insulin injections developed from individual diabetic patient data." Am J
Physiol Endocrinol Metab 269(6): E1115-1124.
Raghuraj Rao, K. and S. Lakshminarayanan (2007a). "Partial correlation
based variable selection approach for multivariate data classification
methods." Chemometrics and Intelligent Laboratory Systems 86(1): 68-81.
Raghuraj Rao, K. and S. Lakshminarayanan (2007b). "VPMCD: Variable
interaction modeling approach for class discrimination in biological systems."
FEBS Letters 581(5): 826-830.
109
Raghuraj Rao, K. and S. Lakshminarayanan (2007c). "Variable interaction
network based variable selection for multivariate calibration." Analytica
Chimica Acta 599(1): 24-35.
Raj Kiran, N. and V. Ravi (2008). "Software reliability prediction by soft
computing techniques." Journal of Systems and Software 81(4): 576-583.
Ramprasad, Y. (2004). Model Based Controllers for Blood Glucose Regulation
in Type 1 Diabetics. Chemical and Biomolecular Engineering. Singapore,
National University of Singapore. M.Eng: 92.
Razi, M. A. and K. Athappilly (2005). "A comparative predictive analysis of
neural networks (NNs), nonlinear regression and classification and regression
tree (CART) models." Expert Systems with Applications 29(1): 65-74.
Rilley, M. (1993). "Data-analysis using hot deck multiple imputation." The
Statistician 42(3): 307-313.
Roggo, Y., P. Chalus, L. Maurer, C. Lema-Martinez, A. Edmond and N. Jent
(2007). "A review of near infrared spectroscopy and chemometrics in
pharmaceutical technologies." Journal of Pharmaceutical and Biomedical
Analysis 44(3): 683-700.
Rousu, J., L. Flander, M. Suutarinen, K. Autio, P. Kontkanen and A.
Rantanen (2003). "Novel computational tools in bakery process data analysis:
a comparative study." Journal of Food Engineering 57(1): 45-56.
Sahan, S., K. Polat, H. Kodaz and S. Günes (2007). "A new hybrid method
based on fuzzy-artificial immune system and k-nn algorithm for breast cancer
diagnosis." Computers in Biology and Medicine 37(3): 415-423.
Salford System (2007b). CART San Diego, California.
Salford Systems (2007a). TreeNet San Diego, California, Salford system.
Saraiva, P. M. and G. Stephanopoulos (1992). "Continuous process
improvement through inductive and analogical learning." AIChE Journal
38(2): 161-183.
Sharma, A. and K. K. Paliwal (2008). "Cancer classification by gradient LDA
technique using microarray gene expression data." Data & Knowledge
Engineering 66(2): 338-347.
Sokal, R. and F. J. Rohlf (1995). Biometry: The principles and practice of
statistics in biological
research. New York, W. H. Freeman & co.
Spurgeon, S. E. F., Y.-C. Hsieh, A. Rivadinera, T. M. Beer, M. Mori and M.
Garzotto (2006). "Classification and Regression Tree Analysis for the
110
Prediction of Aggressive Prostate Cancer on Biopsy." The Journal of Urology
175(3): 918-922.
Statnikov, A., C. F. Aliferis, I. Tsamardinos, D. Hardin and S. Levy (2005). "A
comprehensive evaluation of multicategory classification methods for
microarray gene expression cancer diagnosis." Bioinformatics 21(5): 631-643.
Steuer, R., J. Kurths, O. Fiehn and W. Weckwerth (2003). "Observing and
interpreting correlations in metabolomic networks." Bioinformatics 19(8):
1019-1026.
Tamaki, M., T. Shimizu, A. Kanazawa, Y. Tamura, A. Hanzawa, C. Ebato, C.
Itou, E. Yasunari, H. Sanke, H. Abe, J. Kawai, K. Okayama, K. Matsumoto, K.
Komiya, M. Kawaguchi, N. Inagaki, T. Watanabe, Y. Kanazawa, T. Hirose, R.
Kawamori and H. Watada (2008). "Efficacy and safety of modified Yale
insulin infusion protocol in Japanese diabetic patients after open-heart
surgery." Diabetes Research and Clinical Practice 81(3): 296-302.
Taylor, B. E., M. E. Schallom, C. S. Sona, T. G. Buchman, W. A. Boyle, J. E.
Mazuski, D. E. Schuerer, J. M. Thomas, C. Kaiser, W. Y. Huey, M. R. Ward,
J. E. Zack and C. M. Coopersmith (2006). "Efficacy and Safety of an Insulin
Infusion Protocol in a Surgical ICU." Journal of the American College of
Surgeons 202(1): 1-9.
Timm, N. H. (2002). Applied Multivariate Analysis. New York, Springer.
Tittonell, P., K. D. Shepherd, B. Vanlauwe and K. E. Giller (2008).
"Unravelling the effects of soil and crop management on maize productivity in
smallholder agricultural systems of western Kenya--An application of
classification and regression tree analysis." Agriculture, Ecosystems &
Environment 123(1-3): 137-150.
Toher, D., G. Downey and T. B. Murphy (2007). "A comparison of modelbased and regression classification techniques applied to near infrared
spectroscopic data in food authentication studies." Chemometrics and
Intelligent Laboratory Systems 89(2): 102-115.
Tominaga, Y. (1999). "Comparative study of class data analysis with PCALDA, SIMCA, PLS, ANNs, and k-NN." Chemometrics and Intelligent
Laboratory Systems 49(1): 105-115.
Umpierrez, G. E., A. Palacio and D. Smiley (2007). "Sliding Scale Insulin Use:
Myth or Insanity?" The American Journal of Medicine 120(7): 563-567.
Van den Berghe, G. (2003). "Insulin therapy for the critically ill patient."
Clinical Cornerstone 5(2): 56-63.
Vanhorebeek, I., C. Ingels and G. Van den Berghe (2006). "Intensive Insulin
Therapy in High-Risk Cardiac Surgery Patients: Evidence from the Leuven
111
Randomized Study." Seminars in Thoracic and Cardiovascular Surgery 18(4):
309-316.
Vapnik, V. N. (1995). The Nature of Statistical Learning Theory, Springer.
Wang, J., K. N. Plataniotis, J. Lu and A. N. Venetsanopoulos (2008). "Kernel
quadratic discriminant analysis for small sample size problem." Pattern
Recognition 41(5): 1528-1538.
Wolfberg, W. H. and O. L. Mangasarian (1990). "Multi surface method of
pattern separation for medical diagnosis applied to breast cytology." Proc.
Natl. Acad. Sci 87: 9193-9196.
Zhang, X., X. Song, H. Wang and H. Zhang (2008). "Sequential local least
squares imputation estimating missing value of microarray data." Computers
in Biology and Medicine 38(10): 1112-1120.
Zhou, Z.-H., Y. Jiang, Y.-B. Yang and S.-F. Chen (2002). "Lung cancer cell
identification based on artificial neural network ensembles." Artificial
Intelligence in Medicine 24(1): 25-36.
112
APPENDIX A. CV of the Author
MELISSA ANGELINE SETIAWAN
melissa.angeline.84@gmail.com
93372640
Blk 301 Bukit Batok Street 31 #04-05, S 650301
Female 24 • Christian Chinese • Indonesian, In process for Singapore PR
QUALIFICATION
•
•
•
able to work in a target-oriented, performance-centric environment
Possess critical thinking and problem solving skills; quick learner.
Independent as well as team player, organized and self-motivated
Academic Qualification
Master of Engineering (thesis under examination)
January 2007- December 2008, National University of Singapore (NUS), Singapore
Department of Chemical and Biomolecular Engineering
Research Topic: Machine Learning and Data Analysis
• Proposed a new data driven model to design insulin advisory system for ICU
patients at NUH
• Data mining methods for food product classification
• Classification technique development to detect the depth of anesthesia (DOA) for
patients who undergo surgery
• Microarray data analysis
• Generated rules to differentiate nearly-identical cancer cells
• Development of a MATLAB toolbox for classification problems
• Information extraction from historical data for process optimization
GPA: 4.25 (5-point scale)
Bachelor of Engineering
2002-2006, Bandung Institute of Technology (ITB), Bandung
Industrial Engineering Faculty, Chemical Engineering Department
Major in Chemical Engineering, Food Technology and Bioprocess Engineering
Research Topic: Edible Oil from Indonesian Brassica species
• Extracted oil from Brassica’s seeds
• Performed oil characterization analysis
Project Design: Refined Bleached Deodorized Palm Kernel Oil (RBD PKO) production
from Palm Kernel
• Designed the most efficient process to convert raw material to product
• Designed the equipment involved in production line
• Designed all utility systems and required waste treatment processes
• Complete economic analysis for plant’s profitability
GPA: 3.83 (4-point scale)
Working Experience
•
Teaching assistant (1 semester: 2007/2008) – Dept of Chemical and Molecular
Engineering, NUS : Tutor for one undergraduate module (Process Design 1)
113
•
•
Teaching assistant (3 semester: 2003/2004 and 2004/2005) – Chemical Engineering
Department : Tutor for three undergraduate modules (Chemical Engineering
Mathematics course, Laboratory Technology for Chemical Engineering, and Transport
Phenomena)
Internship for 1.5 month in Wall’s Ice Cream Factory, PT. Unilever Tbk,
Cikarang, Indonesia : Did one project in reducing paper wrapping waste in
production line (optimization of raw material used in production line and waste
reduction)
Research Experience
•
•
•
Data analysis, Programming and Modeling related to microarray data, and medical
data sets
Machine learning applications for industrial process improvement both batch and
continuous
Characterizing oil extracted from Indonesian Brassica species (edibility). The outcome
of this research is that the oil cannot be used as edible oil, but it can be processed
further to create a biofuel.
Computational Skills
•
•
•
Bioinformatics: Real time Microarray data analysis
Systems Biology: Survival analysis using clinical diagnostic data and Depth of
Anesthesia prediction in surgery.
Software package: Worked intensively with MATLAB, CART and Treenet,
SignalMap, Affymetrix, HYSYS, and MS Office.
Participation on Seminar and Training
•
•
•
•
•
Microteaching and Tutoring skills, July 2007
Communication and Presentation skill workshop, NUS, 2007
MATLAB and Simulink workshop, BTI, 2007
Industrial Management and Business, PT. Sampoerna Tbk, Surabaya, May 2006
Modern Biotechnology at ATMAJAYA University, Jakarta
Awards and Achievement
•
•
•
•
•
Obtained the AUN-SEED Net Scholarship to pursue Master’s degree in Chemical
Engineering at NUS.
Graduated as Cumlaude from Chemical Engineering Department, ITB: only 20% of
chemical engineering students fulfilled Cumlaude criteria.
Member of Indonesia Sampoerna Best Student 2006: only 80 students are nominated
from all Indonesian universities.
First rank at Jakarta Senior High School Chemistry Competition, Dinas Pendidikan
Menengah dan Tinggi (equivalent to MOE in Singapore).
Semifinalist at Junior High School Mathematics Competition.
Organizational experience and leadership
•
•
•
•
Member of Church Music Ministry at GKI GunSa (period 1999-now) and BBPC
(2007-2009)
Chairman of Easter Commitee 2001 and vice chairman of easter commitee 2008
Organizing Committee Member of 3rd Regional Symposium on Membrane Science
and Technology: ITB
Organizing Committee Member of National Plant Design Competition, ITB 2005
114
Publications/Presentations
•
Setiawan Melissa, A., Rao Raghuraj, K. and S. Lakshminarayanan, “Partial
Correlation Metric based Classifier for Food Product Characterization”, Accepted for
publication in Journal of Food Engineering (June, 2008)
•
“Machine Learning in Medicine” presented at the AUN/SEED-Net field-wise seminar
in Chemical Engineering, Thailand.
•
“Decision Rules for Cancer Tumor Identification” presented at the Graduate Student
Symposium in Biological and Chemical Engineering – 2007
•
Setiawan Melissa, A., Rao Raghuraj, K. and S. Lakshminarayanan, “Performance of
data mining tools in classifying depth of anesthesia”, under review for publication in
Artificial Intelligence in Medicine.
•
Setiawan Melissa, A., Rao Raghuraj, K. and S. Lakshminarayanan, “Variable
Interaction Structure Based Machine Learning Technique for Cancer Tumor
Classification”, will be presented on International conference on Biomedical
Engineering 2008
•
Setiawan Melissa, A., Wulan Sari and Tatang H. Soerawidjaja, “Oil and fat from
Indonesian Brassica”
in Indonesian language
Other
•
•
Excellent in English and Indonesian, both oral and written
Excellent in Piano, organ and keyboard
Availability: immediate
References:
Dr. S. Lakshminarayanan (M.Eng Supervisor at NUS)
Asst. Prof., Dept. of Chemical and Biomolecular Engg.
National University of Singapore, Singapore.
Tel: +65-65168484, Email : chels@nus.edu.sg
Rao Raghuraj
Research fellow Singapore-Delft Water Alliance
NUS, Singapore
Tel.: +65 6516 8304, email: cverr@nus.edu.sg
Dr. Tatang Hernas S. (B. Eng Supervisor at ITB)
Dept. of Chemical Engineering
Bandung Institute of Technology, Indonesia
Tel: +628122349474
115
[...]... dealing with data complexity The success of data analysis and modeling efforts is highly dependent on the data set itself Poor quality and/ or quantity of data as well as missing data can make data analysis even harder Some biological and medical datasets are too huge in size Therefore, it is a bit too hard for some computers to handle this kind of dataset owing to limitations of hardware and software... huge “need” for information among people and provide solid proof that our society is transforming into an “information based society” As a result of this transformation, data and information have a great effect in decision making in various spheres of human activity To satiate this hunger for accurate and quick information, methodologies that can generate accurate information from raw data must be... 61 Table 4.6 Sensitivity and specificity values for each classifier in DOA classification 64 Table 4.7 Analysis result for WBC dataset using LDA, CART, TreeNet, DPCCM and VPMCD 66 Table 4.8 Analysis result for WDBC dataset using LDA, CART, TreeNet, DPCCM and VPMCD 67 Table 4.9 Classification result on heart disease dataset using CART and TreeNet 69 Table... in modeling ICU patients’ blood glucose value as a function of food, glucose and insulin could help the doctor to predict the amount of glucose and insulin to be administered to the patient to avoid hypoglycemia and hyperglycemia Hence it will increase the number of survive patient in the ICU 4 1.4 Challenges in Data Analysis and Modeling Work There are some challenges in doing data analysis and modeling. .. values for training and test samples using Group3 patient data 88 Table 5.4 Range of the parameters for each patient group 92 Table 5.5 MAE value for training and test samples using home monitoring data 94 Table 5.6 Range of estimated parameters for home monitoring data 95 xiii LIST OF FIGURES Page Fig 3.1 PCCM profiles for IRIS data 32 Fig 3.2 Variable correlation shade map for. .. to improve classifier performance on medical data sets 5 • Identifying the limitations of existing blood glucose modeling methods in diabetics (surgical ICU patients and patients under home monitoring) and evaluation of a new modeling methodology Section 1.6 provides more detailed information of this work This present work mainly focuses on information extraction and data analysis covering food product... quality and quantity Data set with a few samples will give insufficient classification information to the classifier hence its performance will be low Large data sets, which has many variables, can potentially provide enough information, but the analysis will be time consuming and computationally expensive Therefore, in problems involving large (in the number of variables) data sets (e.g micro array data) ,... to data analysis and information extraction are addressed in this present study They are: • Evaluating the performance of a newly developed method (DPCCM) by implementing it on problems from various domains such as food quality and medicine (cancer identification and depth of anesthesia classification) and comparing its performance with some existing leading machine learning methods • Applying and. .. developed 1.2 Analysis Techniques in Data Rich Area – Problem Definition High quality information at a high speed is sought by many people in all walks of life This is more so with people engaged in business, research, or 2 manufacturing Before we discuss further about information, its existence and its importance, it will be better for us to define information The Oxford English Dictionary defines information... DPCCM is introduced in chapter 3 Herein, the performance of DPCCM is compared to some existing and established classification methods such as CART, Treenet, and LDA Chapter 4 discusses data mining in the context of medical applications Some classification methods are applied and evaluated for early detection of cancer, heart disease identification and for DOA level maintenance during surgery process ... in Data Analysis and Modeling Work There are some challenges in doing data analysis and modeling work The main one relates to dealing with data complexity The success of data analysis and modeling. . .DATA ANALYSIS AND MODELING FOR ENGINEERING AND MEDICAL APPLICATIONS MELISSA ANGELINE SETIAWAN (B.Tech, Bandung Institute of Technology, Bandung, Indonesia) A THESIS SUBMITTED FOR THE... modeling efforts is highly dependent on the data set itself Poor quality and/ or quantity of data as well as missing data can make data analysis even harder Some biological and medical datasets