ImputationMethodstoDealwithMissingValueswhenDataMining
Trauma Injury Data
Kay I Penny
Centre for Mathematics and Statistics, Napier University, Craiglockhart Campus,
Edinburgh, EH14 1DJ
k.penny@napier.ac.uk
Thomas Chesney
Nottingham University Business School, Jubilee Campus, Wollaton Road, Nottingham,
NG8 1BB
Thomas.Chesney@nottingham.ac.uk
Abstract.
Methods for analysing traumainjury
data withmissing values, collected at a UK
hospital, are reported. One measure of injury
severity, the Glasgow coma score, which is
known to be associated with patient death, is
missing for 12% of patients in the dataset. In
order to include these 12% of patients in the
analysis, three different dataimputation
techniques are used to estimate the missing
values. The imputed data sets are analysed by an
artificial neural network and logistic regression,
and their results compared in terms of sensitivity,
specificity, positive predictive value and negative
predictive value.
Keywords Data mining, missingdata
imputation, trauma injury.
1. Introduction
Trauma injury is the most common cause of
loss of life to those under forty [1]. In 1991 a
trauma system was put in place at the North
Staffordshire Hospital (NSH) in Stoke-on-Trent
in the U.K. It records injury details including
Injury Severity Score (ISS) [2], Abbreviated
Injury Scores (AIS) [3], the Glasgow Coma
Score (GCS) [4], the patient's sex and age,
management and interventions, and the outcome
of the treatment, including whether the patient
lived or died during their hospital stay.
North Staffordshire Hospital is a major
trauma centre in the area and receives patient
referrals from surrounding hospitals. Oakley [5]
analysed data for only the most severely injured
patients admitted between 1992 to 1998, and
found determinants of mortality for this subset of
patients included age, head AIS, chest AIS,
abdominal AIS, external injury AIS, mechanism
of injury, primary receiving hospital and
calendar year of admission. Further analysis
includes a comparison of several artificial neural
network (ANN) models and logistic regression
(LR) to predict death during hospital stay [6].
Factors found to be important in the modelling
were age, mechanism of injury, whether the
patient was referred from another hospital, and
several injury severity scores including GCS
motor and GCS verbal scores.
Missing data do not always cause concern
when using datamining techniques, however,
these data have 12% of GCS scores missing.
Applying the standard practice of complete-case
analysis therefore means that 12% of the dataset
has been excluded from the modelling since
these patients do not have recorded values for the
three GSC scores. Exclusion of this subset of
patients may lead to bias in the results, as
patients who have not had their GCS scores
recorded may not be a representative sample of
the population of traumainjury patients e.g. it
may be that these patients tend to be more
seriously injured than the average or typical
patient, hence the scores were not recorded due
to lack of time, or that they presented with a
different type or combination of injuries etc.
The aim of this research is to investigate the
accuracy of modelling patient death following
trauma injury in conjunction withmissing value
imputation.
2. Methods
The study involves trauma audit data from
patients treated at the North Staffordshire
WK
,QW&RQI,QIRUPDWLRQ7HFKQRORJ\,QWHUIDFHV,7,-XQH&DYWDW&URDWLD
Hospital from 1993 to 1999 and from 2001 to
2004. The gap was due to lack of resources
which affected data collection during this period.
Only the most severely injured patients i.e.
patients with an ISS greater than 15 are included
in this study, resulting in a total of 1658 patients
in the dataset. Hence these results are
generalisable to severely injured patients only.
Table 1. Factors considered for inclusion in the
analyses
Sex (Male or Female)
Age group (years): 0-15; 16-25; 26-35;
36-50; 51-70; over 70
Year of admission (1992 - 8, 2001-5)
Month of admission (Jan – Dec)
Day of admission (Mon – Sun)
Time of admission (0000 - 0359;
0400 -0759; 0800 - 1159; 1200 - 1559;
1600 - 1959; 2000 - 1359)
Referred from another hospital (yes or no)
Mechanism of injury group:
Motor vehicle crash; Fall greater than 2m;
Fall less than 2m; Assault; Other
Type of trauma: blunt (yes or no)
penetrating (yes or no)
Abbreviated injury scores (AIS):
Head Face Lower limb
Neck Chest External
Abdomen Cervical-spine
Upper limb Thoracic-spine
Spine Lumbar-spine
Glasgow coma scores (GCS):
Eye response; Motor response;
Verbal response
Factors considered for inclusion in the
analysis are summarised in Table 1. Two
different approaches to the statistical analysis of
these data were carried out; datamining using an
artificial neural network (ANN) and logistic
regression modelling (LR). All analysis was
carried out using the statistical packages SPSS
12, Clementine 7.0, and Solas 3.0.
2.1. DataMiningMethods
ANNs attempt to mimic the biological
structure and the connectivity of a natural neural
network, using the human brain as an analogy.
Input is fed through the neurons in the network
which transform them to output a probability, in
this case, the probability that a patient will die.
An exhaustive prune was used to create the
ANN. All the neurons are fully connected and
each is a feed-forward multi layer perceptron
which uses the sigmoid transfer function [7]. The
learning technique used is back propagation.
This means that, starting with the given
topology, the network is trained, then a
sensitivity analysis is performed on the hidden
units and the weakest are removed. This
training/removing is repeated for a set length of
time. The ANN used in this study has 3 hidden
layers with 30, 20 and 10 neurons respectively
and the following learning rates: alpha=0.9,
eta=0.3, as previous analysis found that this
architecture works well for traumainjurydata
[6].
As well as datamining using an ANN, LR
modelling is included for comparison. The LR
models were developed to determine a
parsimonious model with good predictive ability,
yet as simple a model as possible. Hence this
approach is more subjective than the ANN.
In medical applications it is often the case that
a logistic regression model is developed using
the complete data set, and the model is then
tested on the same set of data used to build it.
However, it is not ideal to test the model with the
same data used to build it, and to allow
comparison with the dataminingmethods
presented in this paper, a k-fold cross-validation
technique was used to test all of the models, with
k set to five. This technique is good practice
when building neural networks with medical data
[8]. Using this technique the data were split into
five subsets. Four data subsets are used to train
each model, and the fifth is used to test it. This is
then repeated another four times so that each data
subset is used to test the models once.
When splitting the dataset, those patients who
lived were selected independently of those
patients who died, in order to keep the same
proportions of patients who died in each of the k
data subsets. This is necessary since the data
outcome variable, patient death, is very
imbalanced; 79% of patients lived and 21% died
during their hospital stay.
2.2. Missing value imputations
Previous work [6] compared the results of
four different ANN models as well as LR to
predict death during hospital stay following
injury. Both GCS motor and GCS Verbal were
found to have high importance in two of the
ANNs, and GCS motor was statistically
significant in the LR model. In order for these
variables to be included in the models, 12% of
the sample, i.e. patients whose GCS scores were
not recorded, were excluded from the analysis.
Hence missing value imputation is considered
here in order that all patients can be included in
the modelling process. The GCS is a
measurement of severity of head injury and
comprises three components, each measured on
an ordinal scale: eye response (1-4), verbal
response (1-5) and motor response (1-6).
Three methods of dataimputation are
considered in this study:
1. Hot-deck imputation
2. Predictive model-based imputation
3. Propensity score imputation
Hot-deck imputation [9] involves substituting
individual values drawn from patients with
observed data who are “similar” to the patient
with the missing value. In terms of the GCS
scores, this would involve imputing a GCS score
drawn from a subset of patients who are
“similar” to the patient with the missing GCS
score. In order to impute a particular GCS score,
this method sorts patients both with observed
values and those withmissingvalues for this
score into a number of subsets according to a set
of covariates which are associated with the GCS
scores. In this application, the imputation subsets
comprise patients with the same values of the
injury severity scores: AIS head, AIS chest, AIS
lumbar spine and AIS cervical spine. Patients
with missing GCS scores will then have their
missing values replaced with observed values
selected at random, with replacement, from
patients in the same subset i.e. patients who are
similar with respect to these covariates. If there
are no observed values in the corresponding
subset of patients, then the subset is collapsed by
one level, and this process is repeated until an
observed value can be found.
Predictive model-based imputation involves
imputing a missing value by using an ordinary
least-squares regression method to estimate a
missing GSC score. Firstly, a predictive model is
estimated from the observed data, which contains
no missingvalues for the GCS score of interest.
Let Y be the GCS variable to be imputed, and let
X be the same set of covariates used in the hot-
deck imputation listed above. Let Y
obs
be the
observed values in Y, Y
mis
be the missingvalues
in Y, and let X
obs
be the covariates corresponding
to Y
obs
. By regressing Y
obs
on X
obs
, predictions for
the missingvalues are obtained from the
equation:
mismis
bXaY
ˆ
(1)
Let represent the constant in the model, and
b
represent the vector of regression coefficients.
Using this estimated model, a random element is
incorporated in the estimate of the missing
values. Parameter values from the regression
model are drawn from their posterior distribution
given the data, using non-informative priors [10]
[11]. In this way, the extra uncertainty due to the
fact that the regression parameters can be
estimated, but not determined, from the observed
data is reflected.
a
Propensity score imputation [12] is based on
the underlying assumption that the “missingness”
of an imputation variable can be explained by a
set of covariates using a logistic regression
model. A binary indicator variable is created to
represent whether the variable to be imputed is
missing or observed for each individual. This
indicator variable is the dependent variable in the
logistic regression modelling, and the
independent variables are a set of covariates
which is thought to be related to the variable to
be imputed. Using the regression coefficients
from the logistic regression model, the
propensity that a patient would have a missing
value can be calculated. The propensity score for
a patient is the conditional probability of
“missingness”, given the observed covariates.
Missing values of the imputation variable y are
imputed by values randomly drawn from a subset
of observed values of y, that is, its donor pool. In
this study, five donor pool subgroups have been
created. The patients in the dataset are sorted in
ascending order according to their assigned
propensity scores, and then divided into five
equal sized subgroups according to their
propensity scores. For each missing value, an
observed value is selected for imputation, at
random with replacement, from the
corresponding donor pool.
2.3 Evaluation methods
The five-fold cross-validation design results
in five training datasets and five corresponding
validation datasets. Each of the three imputation
methods described above are applied to each of
these ten datasets and results are compared for
the ANN and the LR models. The overall
performance of a model under a particular
imputation method is then the mean performance
of the five validation data sets. In many data
mining efforts the evaluation criterion is the
overall accuracy i.e. the percentage of correct
classifications made by an algorithm, however,
in medical datamining consideration must be
given to the percentage of false positives and
false negatives made. The evaluation criteria
included for testing the classification algorithms
are sensitivity (sens), specificity (spec), positive
predictive value (PPV) and negative predictive
value (NPV).
A cut-point of 0.5 is used for in the logistic
regression modelling to allow comparability
between the three imputation methods. A
receiver operator curve (ROC) analysis is carried
out to compare the logistic regression results.
3. Results
The results for the k-fold cross-validations for
each data-mining method applied to each of the
three sets of imputed data subsets are presented
in Table 2 along with the results when no
imputation (complete-case) was performed. The
mean accuracy measures of the five validation
datasets are given along with the between-
validation standard errors. The performance of
the complete-case analysis is included for
comparison.
For the LR modelling, there is very little
difference in performance between the three
missing dataimputation methods, and all three
perform almost as well as the complete-case
model. Although the specificity for all three LR
results is high, the sensitivity measures are all
fairly low, with just over half of those who die,
predicted correctly. However, the cut-point of
0.5 could be lowered to increase the sensitivity
of the models, thereby decreasing specificity.
The results of the ROC analysis gave areas under
the curve and between-validation standard errors
of 0.86 (0.012) for both the hot-deck and the
model-based results, and 0.85 (0.013) for the
propensity scoring method, whereas the area
under the ROC curve for the complete-case
analysis was 0.89.
Similarly there is little difference between the
three imputationmethodswhen modelling the
data with an ANN. However, all imputation
methods slightly improve the positive predictive
value of the ANN models compared with
complete-case analysis.
Table 2. Evaluations of Methods
Evaluation Criteria Data
mining/
imputation
method
Sens
(SE)
Spec
(SE)
PPV
(SE)
NPV
(SE)
ANN:
hot-deck 46%
(1.8)
92%
(0.7)
0.61
(0.017)
0.86
(0.003)
model-
based
45%
(2.2)
92%
(0.5)
0.62
(0.014)
0.86
(0.004)
propensity 41%
(5.4)
93%
(0.9)
0.61
(0.026)
0.85
(0.011)
complete-
case
58% 86% 0.53 0.88
LR:
hot-deck 51%
(1.8)
93%
(0.7)
0.66
(0.017)
0.88
(0.003)
model-
based
51%
(2.2)
93%
(0.4)
0.67
(0.007)
0.88
(0.004)
propensity 50%
(1.1)
94%
(0.6)
0.69
(0.020)
0.88
(0.002)
complete-
case
56% 94% 0.71 0.89
Table 3 contains a listing of the factors
included in the training models. Many of the
factors considered for inclusion in the models
(Table 1) are correlated with each other, hence
the models do not include the same subsets of
factors to have high importance (ANNs) or
statistical significance (LRs). A typical LR
model shows increased odds of death if involved
in a motor vehicle crash, having a blunt or
penetrating injury, older age, not being referred
from another hospital, and having a more severe
injury according to several AIS scores and the
three GCS scores. The three GCS scores were
often found to be statistically significant in the
training models, and all training models included
at least two of the GCS scores.
Ten factors included in a typical ANN
training model are listed in order of importance
(Table 3). Two GCS scores are important in this
model.
Table 3. Factors included in the training models
LR models ANN models
Age group AIS cervical spine
Patient referred AIS thoracic spine
Mechanism of injury AIS external
Blunt injury GCS eye
Penetrating injury GCS motor
GCS eye AIS head
GCS motor AIS spine
GCS verbal AIS legs
AIS head AIS face
AIS abdomen Year of admission
AIS external
4. Conclusions
There is little distinction between the three
imputation methods in terms of results observed,
for both the LR and the ANN models. According
to the sensitivity and specificity measures, the
results from the imputations are almost as good
as the complete-case results, for both the LR and
ANN models. This is also confirmed by the ROC
analysis, which shows that the model from the
complete-case analysis (0.89) is slightly more
accurate than those based on the imputed data
(0.86, 0.86 and 0.85).
In this study, single imputation is used i.e.
each missing value is replaced with a single
imputed value, and then the data are analysed as
for a complete-case analysis. The authors did
consider using multiple imputation techniques
[9], where each missing value is replaced with
2t
M
imputed values, resulting in M
completed datasets. The M complete-data
inferences can be combined to form one
inference that reflects the uncertainty due to
“missingness” under that model. Although
multiple imputation has not been used in this
application, the same missingvalues are
effectively estimated five times under the k-fold
cross-validation design, since a patient is
included in a validation dataset once and in a
training dataset four times. Since different
imputations are created for a particular missing
value for each of the different data subsets, an
element of between–imputation variability has
been incorporated into the results.
Although these results do not lead to more
accurate classification of patient death or
survival following traumainjury than the
complete-case analysis, they do allow
classification of patients whose Glasgow coma
scores are missing. These patients would not
have been included in either building or testing
the models in the complete-case analysis. In
other words, it would not have been possible to
make a prediction for a patient withmissing GCS
values, whereas using imputation allows a
prediction to be made.
Further work to investigate how well the
different imputationmethods correctly estimate
the missing GCS scores would be useful. One
approach would be to carry out a simulation
study using the complete-case data only, where a
subset of GCS scores is deleted to mimic the
pattern of missingness in the observed data. This
would allow the assessment of the different
imputation techniques to correctly estimate the
deleted GSC scores. Also, similar techniques
could then be applied to the whole trauma injury
dataset which includes patients with all levels of
injury severity, not only those most severely
injured with ISS > 15.
5. References
[1] The Trauma Audit and Research Network;
2006.
https://www.tarn.ac.uk/content/downloads/3
6/FirstDecade.pdf [23/01/06].
[2] Baker SP, O'Neill B, Haddon Jr W, Long
WB. The injury severity score: a Method for
describing patients with multiple injuries
and evaluating patient care. Journal of
Trauma 1974; 14: 187-96.
[3] Association For The Advancement Of
Automotive Medicine. The abbreviated
injury scale, 1990 revision. Des Pleines, IL,
Association for the Advancement of
Automotive Medicine; 1990.
[4] Teasdale G, Jennett B. Assessment of coma
and impaired consciousness. A practical
scale. Lancet 1974; (ii): 81-3.
[5] Oakley PA, Mackenzie G, Templeton J, Cook
AL, Kirby, RM. Longitudinal trends in
trauma mortality and survival in Stoke-on-
Trent 1992-1998. Injury 2004; 35: 379-85.
[6] Chesney T, Penny K, Oakley P, Davies S,
Chesney D, Maffulli N, Templeton J. Data
mining medical information: Should
artificial neural networks be used to analyse
trauma audit data? Int J of Healthcare
Information Systems and Informatics 2006;
1(2): 51-64.
[7] Watkins D Clementine's Neural Networks
Technical Overview; 1997.
http://www.cs.bris.ac.uk/~cgc/METAL/Con
sortium/secure/neural_overview.doc
[12/01/06].
[8] Cunningham P, Carney J, Jacob S. Stability
problems with artificial neural networks and
the ensemble solution. Artificial Intelligence
in medicine 2000; 20(3): 217-25.
[9] Little RJA, Rubin DB. Statistical Analysis
with Missing Data. New Jersey: John Wiley
& Sons; 2002.
[10]Rubin DB. Multiple Imputation for
Nonresponse in Surveys. New York: John
Wiley; 1987.
[11]Gelman A, Carlin J, Stern H, Rubin DB.
Bayesian Data Analysis. New York:
Chapman and Hall; 1995.
[12]Rosenbaum PR, Rubin DB. The central role
of the propensity score in observational
studies for causal effects. Biometrika 1983;
70: 41-55.
. Imputation Methods to Deal with Missing Values when Data Mining Trauma Injury Data Kay I Penny Centre for Mathematics and Statistics,. value. Keywords Data mining, missing data imputation, trauma injury. 1. Introduction Trauma injury is the most common cause of loss of life to those under forty [1]. In 1991 a trauma system. complete data set, and the model is then tested on the same set of data used to build it. However, it is not ideal to test the model with the same data used to build it, and to allow comparison with