qweqweqwe
2089_book.fm Page 433 Tuesday, May 10, 2005 3:38 PM 12 Evaluation Strategies for Medical-Image Analysis and Processing Methodologies Maria Kallergi CONTENTS 12.1 12.2 12.3 12.4 Introduction Validation Models and Clinical Study Designs Clinical Performance Indices Nonobserver Evaluation Methodologies 12.4.1 Computer ROC Test 12.4.2 Computer FROC Test 12.4.3 Segmentation Validation Tests 12.5 Observer Evaluation Methodologies 12.5.1 ROC Test 12.5.2 LROC Test 12.5.3 FROC Test 12.5.4 AFC and MAFC Tests 12.5.5 Preference Tests 12.6 Study Power and Biases 12.6.1 Database Generation 12.6.1.1 Database Contents and Case/Control Selection 12.6.1.2 Database Size and Study Power 12.6.1.3 Ground Truth or Gold Standard 12.6.1.4 Quality Control 12.6.2 Algorithm Training and Testing and Database Effects 12.6.3 Estimation of Performance Parameters and Rates 12.6.4 Presentation Setup 12.6.5 Statistical Analysis 12.7 Discussion and Conclusions Acknowledgments References Copyright 2005 by Taylor & Francis Group, LLC 2089_book.fm Page 434 Tuesday, May 10, 2005 3:38 PM 434 Medical Image Analysis 12.1 INTRODUCTION Image-processing and pattern-recognition methodologies have found a variety of applications in medical imaging and diagnostic radiology Medical-image processing has been an area of intensive research in the last two decades with remarkable results A variety of classical methodologies from the signal-processing and pattern-recognition domains and new ones have been implemented and tested for diverse applications Based on the output, the various approaches can be categorized in one of the three groups shown in the block diagram in Figure 12.1 These groups involve one of the following processes: Image analysis can be defined as the process where the input to an operator is an image and the output is a measurement This group includes such processes as automated detection and diagnosis of disease, organ area and volume segmentation, size measurements, and risk estimates [1–6] Image processing can be defined as the process where the input to an operator is an image and the output is another image with similar contents to the original but different in appearance This group includes such processes as image enhancement, restoration, compression, registration, and reconstruction [7–10] Image understanding can be defined as the process where the input to an operator is an image and the output is a different level of description, such as transforms and pixel mappings [11] Depending on the goal of the application, the operator in Figure 12.1 could be a signal processing algorithm, a pattern-recognition algorithm, a contrast-enhancement or noise-reduction function, a transformation, a mathematical measurement, or combinations of these The most extensive and successful development so far has Image-in Operator Image-out Measurement-out Data Transform FIGURE 12.1 Block diagram of the various medical-image processes Depending on the operator type, the output may be an image, a measurement, or a transformation Copyright 2005 by Taylor & Francis Group, LLC 2089_book.fm Page 435 Tuesday, May 10, 2005 3:38 PM Evaluation Strategies for Medical-Image Analysis 435 occurred in the fields of computer-aided detection (CAD detection) and computeraided diagnosis (CAD diagnosis), i.e., in the image-analysis field, with image enhancement following closely behind CAD detection is now a clinical reality for breast and lung cancer imaging Several commercial systems are now available for breast cancer imaging and screen/film (SFM) or full-field direct digital mammography (FFDM) [2] Similar systems are currently in beta testing stages for lung cancer imaging using computed radiography, standard chest radiography, or computed tomography (CT) CAD detection usually refers to the process where areas suspicious for disease are automatically detected on medical images, and their locations are pointed out to the observer for further review [1, 2] In addition to pointing out the location of a potential abnormality, CAD detection algorithms may include a segmentation step, namely a process where the margins of the detected lesion, such as lung nodules in lung cancer images, calcifications or masses in mammograms, are outlined, and the outline may be presented to the reader as opposed to merely a pointer of the lesion’s location [12] CAD diagnosis differs from CAD detection in that the detected lesions (either by the observer or by the computer) are differentiated (classified) in groups of disease and nondisease lesions [13, 14] In this chapter, following historical precedence, the plain CAD term will be used to refer to both technologies, i.e., both detection and diagnosis algorithms, but we will differentiate by adding a detection or diagnosis extension to the term where a specific and unique reference is required As new medical-image analysis and processing tools become available and new versions of existing algorithms appear in the market, the validation of the new and updated methodologies remains a critical issue with ever-increasing complications and needs The general goal of validation is twofold: (a) ensure the best possible performance (efficacy) of each step of the process outlined in Figure 12.1 that would yield optimum output results and (b) determine the real-world impact of the entire process (effectiveness) [15] The first goal is usually achieved in the laboratory with retrospective patient data of proven pathology and disease status and various statistical analysis tools that not involve human observers or experts The second goal usually requires the execution of clinical trials that involve experts and usually prospective patient data Clinical studies are, in most medical applications, inevitable and are the gold standard in medical technology validation However, the laboratory or nonobserver studies that precede them are critical in establishing the optimum technique that will be tested by the observers so that no funds, time, or effort are wasted [15, 16] Furthermore, laboratory tests are sufficient when validating updated versions of algorithms once the original versions have demonstrated their clinical significance This chapter will not elaborate on the aspects of clinical trials or theoretical validation issues Rather, it focuses on the major and practical aspects of the preclinical and clinical evaluation of diagnostic medical-image analysis and processing methodologies and computer algorithms We will further narrow down our discussion to selected tests and performance measures that are currently recognized as the standard in the evaluation of computer algorithms that are designed to assist physicians in the interpretation of medical images We will discuss observer vs nonobserver tests and ROC vs non-ROC tests and related interpretation and analysis Copyright 2005 by Taylor & Francis Group, LLC 2089_book.fm Page 436 Tuesday, May 10, 2005 3:38 PM 436 Medical Image Analysis aspects Our goal is to provide a basic and practical guide of the methods commonly used in the validation of computer methodologies for medical imaging in an effort to improve the evaluation of these techniques, advance development, and facilitate communication within the scientific community Section 12.2 provides a brief overview of the current validation models and designs of clinical trials Section 12.3 introduces the standard performance measurements and tests applicable in medical imaging Section 12.4 summarizes the most important nonobserver validation methodologies that usually precede the observerbased validation techniques described in Section 12.5 Section 12.6 discusses practical issues in the implementation of the various validation strategies Conclusions and new directions in validation are summarized in Section 12.7 12.2 VALIDATION MODELS AND CLINICAL STUDY DESIGNS Entire industry conferences are dedicated to issues of validation and clinical study design, including the annual meetings of the Medical Image Perception Society (MIPS) and the Medical Imaging Symposium of the Society of the Photo-optical Instrumentation Engineers (SPIE) At least two workshops have also been organized in the U.S since 1998 on clinical trial issues for radiology, sponsored by the U.S Public Health Service’s Office on Women’s Health, the National Cancer Institute, and the American College of Radiology One workshop, entitled Methodological Issues in Diagnostic Clinical Trials: Health Services and Outcome Research in Radiology, was held on March 15, 1998, in Washington, DC, and participating papers were published in a dedicated supplement issue of Academic Radiology [17] A second workshop, entitled Joint Working Group on Methodological Issues in Clinical Trials in Radiological Screening and Related Computer Modeling, was held on January 25, 1999, and yielded recommendations on various aspects of clinical trials, a summary of which can be found at http://www3.cancer.gov/bip/method_issues.pdf Validation models usually start with tests of the diagnostic performance of the imaging modality or computer methodology, followed by measurements of the clinical impact or efficacy of the diagnostic test on patient management and followup, and ending with broader clinical studies on patient health effects (morbidity and mortality) and societal impact, including cost analysis Clinical study types are differentiated usually by the nature of the patient data used and can be categorized as: (a) observational vs experimental, (b) cohort vs case control, and (c) prospective vs retrospective There is an extensive, in-depth bibliography on the various aspects of clinical studies, the various types, and their advantages and disadvantages [18–20] An excellent glossary summary of the various terms encountered in clinical epidemiology and evidence-based medicine is given by Gay [21] Fryback and Thornbury proposed a six-tiered hierarchical model of efficacy that is now embraced by the medical-imaging community involved in outcomes research and technology assessment [15, 17, 22] Different measures of analyses are applied at the various levels of the model Level is called “technical efficacy” and corresponds to the “preclinical evaluation” stage In this level, the technical parameters of a new system are defined and measured, including resolution and image noise measurements, pixel distribution characteristics, probability density functions, and Copyright 2005 by Taylor & Francis Group, LLC 2089_book.fm Page 437 Tuesday, May 10, 2005 3:38 PM Evaluation Strategies for Medical-Image Analysis 437 error and standard deviation estimates [15, 22] Clinical efficacy is measured in the next three levels of the model, with tests to determine the “diagnostic accuracy efficacy” (Level 2), the “diagnostic thinking efficacy” (Level 3), and the “therapeutic efficacy” (level 4) [15, 22] Levels and correspond to what imaging scientists often term “clinical evaluation” and include measurements of performance parameters and observer experiments that are the focus of this chapter and will be further discussed in the following subsections Level is more specific to therapy-related systems and is not within the scope of this discussion, which deals with diagnostic systems Level deals with “patient outcome efficacy” and Level with “societal efficacy” [15], both beyond the scope of this review This six-tiered model provides an excellent guide for pharmaceutical and therapy trials Its extension to screening and diagnostic medical-imaging technologies is less straightforward due to the unique characteristics of the target population, the diversity of the applications, the observer variability, and the issues of low-prevalence for several disease types including cancer In some cases the model appears to be noninclusive; in other cases it is not entirely applicable or is not linearly applicable Hendee [23] suggested the expansion of the model to include a factor related to the development stage or phase of evolution of the validated technology This may lead to a model more applicable to imaging Another approach recommended for medical-imaging technology validation was developed by Phelps and Mushlin [23, 24] This approach is recommended as a way to define “challenge regions” and as a preliminary step guiding the design of the more expensive and time-consuming clinical trials to test the efficacy of the technology as proposed by Fryback and Thornbury [15] The Phelps and Mushlin model, however, seems to be limited in scope and applicability, and an expansion is necessary to accommodate a broader spectrum of imaging technologies [23] Different clinical study designs may be applicable to levels and of the Fryback and Thornbury model The most commonly used design is the observational, casecontrol, retrospective study that could use a variety of performance measures The current standard for these studies in medical imaging is the receiver operating characteristic (ROC) experiment with the corresponding measure being the ROC curve [25, 26] ROC experiments are time consuming and expensive Hence, nonROC approaches are explored and applied either as less-expensive precursors or as replacements to the more extensive and expensive ROC studies Non-ROC studies may or may not involve observers The selection of one method over the other depends on the application and the question to be answered There is a vast literature on CAD development Numerous algorithms have been reported, and the majority of reports include some type of validation that depends on the investigators’ knowledge of the field but mostly on available medical and statistical resources at the time The lack of an agreement on “appropriate” methodologies leads to a lack of standard criteria and a “how-to” guide that could significantly improve scientific communications and comparisons Only recently we find publications that present broader methodological issues of validation and offer some guidelines Nishikawa [27] discusses the differences in the validation of CAD detection and CAD diagnosis methodologies and offers a good summary of the ways ROC and free-response ROC (FROC), computer- or observer-based, can Copyright 2005 by Taylor & Francis Group, LLC 2089_book.fm Page 438 Tuesday, May 10, 2005 3:38 PM 438 Medical Image Analysis TABLE 12.1 Clinical Performance Indices Observer or computer response positive Observer or computer response negative Signal or Disease Present Signal or Disease Absent Hit (TP) Miss (FN) False alarm (FP) Correct rejection (TN) Source: Beytas, E.M., Debatin, J.F., and Blinder, R.A., Invest Radiol., 27, 374, 1992 (With permission.) be used in algorithm validation Houn et al [28] and Wagner et al [29] discuss issues of ROC study design and analysis in the evaluation of breast cancer imaging technologies particular to the U.S Food & Drug Administration (FDA) concerns but also applicable to the broader scientific community King et al [30] present alternative validation approaches through observer-based non-ROC studies This chapter follows the spirit of these latest efforts It attempts to provide a short, practical guide through the maze of problems and methodologies associated with the validation of medical-image analysis and processing methodologies in the form of a summary of the most critical elements of validation and the most “popular” and “recognized” methodologies in the field The prerequisite for this chapter is that the reader be familiar with the basic theoretical concepts of ROC analysis that plays a major role in medical-image validation studies There is a vast literature in the field, and there are several Web sites with free ROC software and lists of related articles that the novice reader could use to become familiar with the topic [31, 32] 12.3 CLINICAL PERFORMANCE INDICES The clinical performance of a medical test, including imaging, is usually determined by estimating indices for the true positive (TP), true negative (TN), false positive (FP), false negative (FN), sensitivity (SENS), specificity (SPEC), positive predictive value (PPV), negative predictive value (NPV), and accuracy In medical imaging, the response to the question, “Is there a signal in the image or not?” or “Is there disease present in the image or not?” is given by a human observer or by a computer The answer to these questions is often depicted in the terms presented in Table 12.1, borrowed from signal-detection theory [33] A TP is a case that is both test positive and disease positive Test here represents the outcome of the observer or the computer process A TN is a case that is both test negative and disease negative Test here represents the outcome of the observer or the computer process A FP is a case that is test positive but disease negative Such case misclassification is undesirable because it has a major impact on health-care costs and healthcare delivery These cases are equivalent to a statistical Type I error (α) A FN is a case that is test negative but disease positive Such case misclassification is undesirable because it leads to improper patient follow-up and missed cases with disease These cases are equivalent to a statistical Type II error (β) Copyright 2005 by Taylor & Francis Group, LLC 2089_book.fm Page 439 Tuesday, May 10, 2005 3:38 PM Evaluation Strategies for Medical-Image Analysis 439 Sensitivity is the probability of a positive response for the cases with presence of signal or disease, and it is defined as SENS = TP TP + FN Specificity is the probability of a negative response for the cases with absence of signal or disease, and it is defined as SPEC = TN TN + FP Positive and negative predictive values of radiological tests are then defined as PPV = TP TN ; NPV = TP +FP TN + FN PPV and NPV depend on sensitivity and specificity but are also directly related to prevalence, namely the proportion of cases in the test population with signal or disease that is defined as PR = TP + FN TP + FP + TN + FN The higher the prevalence, the higher the predictive value is Accuracy depends linearly on prevalence and it is defined as ACCURACY = PR × (SENS − SPEC) + SPEC Accuracy is equal to specificity at 0% prevalence and is equal to sensitivity at 100% prevalence Note that for oncology applications, one needs to be a little more explicit on what can be considered a positive response because a positive interpretation may be an interpretation that leads to the recommendation for biopsy or an interpretation where a suspicious finding is identified and further work-up is requested before biopsy is recommended These two definitions lead to different estimates of the sensitivity, specificity, and predictive values and need to be carefully reviewed prior to the design of a validation experiment in this field A condition that is often considered in medical studies and causes some confusion in their design is incidence, and this is worthy of a brief discussion here Incidence is the proportion of new cases in the test population with the signal or disease of interest The incidence rate is a smaller number than the prevalence rate Copyright 2005 by Taylor & Francis Group, LLC 2089_book.fm Page 440 Tuesday, May 10, 2005 3:38 PM 440 Medical Image Analysis because the latter includes old and new cases having the disease within a certain period of time (usually one year) The use of incidence or prevalence rate to configure a study population depends on the study aims, the imaging modality, and the tested parameters In CAD validation experiments, the incidence-vs.-prevalence dilemma may be bypassed altogether by focusing on sensitivity and specificity estimates and avoiding PPV and accuracy measurements that depend on prevalence Validation of medical-image-processing schemes aims at relative or absolute estimates of one or more of the above indices of performance before and after the process is applied; sensitivity and specificity are usually the parameters most often targeted Theoretically, one should be able to estimate these parameters accurately for any diagnostic procedure with a sufficiently large sample size But the latter was and continues to be the biggest, and often unsurpassable, obstacle in medicalimaging research For example, a prohibitively large sample size is required to evaluate the impact of a CAD detection algorithm on mammography’s sensitivity using standard statistical methods Specifically, approximately 10,000 screening mammograms are needed to detect a change in sensitivity of 0.05 caused by the use of a CAD system, from 0.85 to 0.90, with a standard error of 5% assuming that breast cancer incidence is 0.5% (i.e., out of 1000 screened women will have breast cancer) [16] Similar estimates are obtained for other imaging modalities and processes Consequently, statistical methodologies such as the ROC type of tests are highly desirable because they require significantly fewer resources than classical statistical approaches, and their results can be used to determine the above performance indices ROC curves, for example, combine (SENS) and (1-SPEC) data in the same plot for different test cutoff values Hence, the curves can be used to establish the best cutoff for a test with variable parameters The optimum cutoff depends on the relative costs of FP and FN cases Accuracy could also be determined by a single point on an ROC curve However, accuracy is a composite index (depends on prevalence) and could generate confusion, as mentioned earlier So it is better to be avoided and replaced by sensitivity and specificity indices instead, which are prevalence independent In addition to the sample size, the availability of expert observers to participate in a study is often another major obstacle in the validation process Hence, there is a need for nonobserver validation strategies that could still measure performance indices without the experts and without large sample sizes Computer ROC and FROC are two such methods that will be discussed in more detail in the following sections 12.4 NONOBSERVER EVALUATION METHODOLOGIES Nonobserver evaluation methodologies are primarily used for the optimization and validation of a computer algorithm before testing its clinical efficacy They are the first step toward final development and provide valuable information to the researcher on the direction of the work and the likelihood of its success These approaches are usually low-cost, easy, and fast to implement They may not yield the higher power of the observer-based studies, but they provide sufficient information to optimize the methodology and ensure that the best technique will be tested clinically The list Copyright 2005 by Taylor & Francis Group, LLC 2089_book.fm Page 441 Tuesday, May 10, 2005 3:38 PM Evaluation Strategies for Medical-Image Analysis 441 of techniques presented in this section is by no means comprehensive It includes, however, the most commonly used nonobserver methodologies and those that are accepted for the validation of medical-image analysis and processing schemes It should be noted that measurements of the physical image quality parameters, as in the case of image display or restoration techniques [34], and mathematical error analysis, as in the case of compression techniques [8], might also be considered as nonobserver validation techniques However, these measurements usually precede the nonobserver experiments described in this section Physical and mathematical error analysis is specific to the algorithm and application, and these will not be discussed in this chapter, the only exception being error-analysis issues pertaining to the validation of image-segmentation techniques Image segmentation holds a major role in medical-image analysis and processing and poses unique challenges in validation In this chapter, we will give an overview of these challenges and of the options and metrics available and commonly used for segmentation validation 12.4.1 COMPUTER ROC TEST Computer ROC analysis is an adaptation of the standard observer ROC analysis that will be discussed in more detail in the following section [26, 35] In this form, ROC principles are implemented for the laboratory testing of pattern-recognition and classification algorithms [27] Classification schemes usually differentiate between two conditions such as benign and malignant lesions, diseased and nondiseased cases, and disease type and disease type cases Pairs of sensitivity and specificity indices can thus be generated by adjusting an algorithm’s parameters and setting conventions on how the numbers of correctly and incorrectly classified cases are to be determined The results are plotted as a true positive fraction (TPF) vs false positive fraction (FPF) curve using standard ROC analysis software [32] Figure 12.2 shows typical computer ROC curves obtained from the preclinical, computer ROC evaluation of four CAD diagnosis systems that differentiate between benign and malignant mammographic microcalcification clusters [13, 36] The global, regional, and local metrics of the standard observer ROC analysis can also be used to quantify absolute and relative performance in computer ROC experiments These metrics include: The area under the curve (global performance index), which ranges from 0.5 to 1, where 0.5 corresponds to random responses (guessing) and to the ideal observer [26, 27] The curves of Figure 12.2 have all areas greater than 0.9 The partial area under the curve (regional performance index), which is estimated at selected sensitivity or specificity thresholds, e.g., 0.9 TPF or 0.1 FPF and provides more meaningful results in clinical applications where high sensitivity is desirable and needs to be maintained [37] The partial sections of the curves in Figure 12.2 at a 0.9 TPF threshold are shown in Figure 12.3 There is no publicly available software today for estimating the area under these curves However, a polygon method [25] or the method described by Jiang et al [37] can be implemented for this purpose Copyright 2005 by Taylor & Francis Group, LLC 2089_book.fm Page 442 Tuesday, May 10, 2005 3:38 PM 442 Medical Image Analysis 0.9 True Positive Fraction 0.8 0.7 0.6 0.5 0.4 0.3 System #1 System #2 System #3 System #4 0.2 0.1 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 False Positive Fraction 0.8 0.9 True Positive Fraction FIGURE 12.2 Computer ROC curves obtained from the laboratory evaluation of four CAD diagnosis schemes designed to differentiate between benign and malignant microcalcification clusters in digitized screen/film mammography 0.98 System #1 0.96 System #2 0.94 System #3 0.92 System #4 0.9 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 False Positive Fraction FIGURE 12.3 Partial curves used to estimate the partial area indices of the computer ROC data shown in Figure 12.2 Operating points (local performance indices), i.e., selected (TPF, FPF) pairs that provide insight on the potential clinical impact and benefits of the method 12.4.2 COMPUTER FROC TEST Computer FROC is the laboratory adaptation of the observer FROC analysis, which will also be discussed in more detail below Computer FROC is the method of choice Copyright 2005 by Taylor & Francis Group, LLC 2089_book.fm Page 457 Tuesday, May 10, 2005 3:38 PM Evaluation Strategies for Medical-Image Analysis 457 Statistical significance level, α, or Type I error, or FP rate Power (1 − β), or Type II error, or (1 − FN) rate Treatment performance or effect Estimate of treatment performance or effect Estimate of standard deviation, if dealing with means and treatment differences For most studies, α = 0.05 (or 5% significance level) and β = 0.2 (or 80% power) Treatment is the standard of practice and treatment is the new methodology that will be tested against the standard For a study in lung cancer imaging, for example, treatment might be chest radiography and treatment helical CT imaging For breast cancer imaging, treatment might be mammography and treatment mammography with CAD The effect of treatment is usually found in the clinical literature The effect of the new treatment is estimated either from pilot studies or by defining a clinically important effect The latter can be estimated by considering the effect required to change the current clinical practice Remember that justification is necessary Simply stating a desired effect is not only insufficient, but also risks being unrealistically high and could lead a study to failure Based on the five parameters above, tables or standard statistical equations or software can be used for sample size estimates [97] 12.6.1.3 Ground Truth or Gold Standard We have already discussed this issue with respect to the requirements of imagesegmentation validation Detection and classification algorithms, however, have slightly different requirements, and they may not always need an outline of the area of interest, as does segmentation Generally, ground truth in medical imaging is established by: Clinical proof that includes image information from radiology (may be single or multimodality imaging), clinical information from laboratory and clinical examinations, and pathology information from biopsy reports Opinion of the expert(s) participating in the study If a panel of experts is used, ground truth may be established by relative decision rate or majority rule or consensus among the experts Opinion of expert(s) not participating in the study This can be done before the study as a review or after as a feedback to the overall process 12.6.1.4 Quality Control The implementation of a quality control program is necessary to ensure that database generation conforms with generally accepted standards, that digitized or digital images are of the highest quality, that artifacts during image acquisition or during film digitization are avoided, and that the same image quality is achieved over time Film digitizers pose the greatest challenge in database generation Test films and phantom images can be used to monitor image quality and ensure high-quality data for further processing [98] Copyright 2005 by Taylor & Francis Group, LLC 2089_book.fm Page 458 Tuesday, May 10, 2005 3:38 PM 458 Medical Image Analysis Finally, we should mention here the consistent and diligent effort over the past several years of academic and federal institutions to develop publicly available, centralized, large, and well-documented data sets for validating various medicalimaging applications Efforts have been initiated in human anatomy [48], breast cancer [99, 100], and lung cancer research [101, 102] It is anticipated that these databases will provide valuable resources to the researchers that not have immediate access to data and will advance development and relative evaluation They may also provide a common reference that will allow comparison of different algorithms or processes In addition, metrics of performance widely used in other fields may now attract our attention One might consider, for example, the automatic target recognition (ATR) analysis method applied to the evaluation of detection and classification algorithms for military imaging applications [103] In ATR, algorithm performance is measured through a set of probabilities that resemble the true and false rate definitions above Although not used in medical imaging right now, due primarily to the small sample sizes that are traditionally available for medical studies, ATR principles may be of increasing interest and applicability now that larger data sets are planned and will soon be available to the scientific community 12.6.2 ALGORITHM TRAINING AND TESTING AND DATABASE EFFECTS The database(s) used for algorithm training and testing and the way they are used may be another source of bias in development The bias usually comes from the small sample size, from inadequate representation of cases and controls in the set, from poor criteria applied to the learning process, and from using learning techniques that are likely to overestimate performance This is a large area of research, and we will not discuss it here in detail We will only review the generally accepted procedures for training and testing algorithms on small data sets, as in the case of medical applications Given that most algorithms are developed, trained, and tested on small data sets, mathematical methods are required in the learning process to reduce the smallsample estimation bias and variance contributions, to stop the algorithms’ training at the right point, and to construct an unbiased rule for future predictions The major methodologies recommended and often applied to the statistical and nonstatistical pattern-recognition algorithms in medical imaging and CAD are summarized in Table 12.4 This table is by no means comprehensive and only aims at pointing out the major differences between the various terms and methodologies that are often used and confused in the medical-imaging literature The reader is prompted to consult the excellent publications in this field for more in-depth theoretical analysis and review of applications [104–107] The method missing from Table 12.4 is the one where the same set of cases is used for training and testing an algorithm This is not an accepted approach, although it is often used by investigators, because it significantly overestimates algorithms’ performance and yields unrealistic results A few more interesting remarks before we leave this subject: If you have ever asked the question, “How many cases are needed for training an artificial neural network or any classification algorithm?” you would Copyright 2005 by Taylor & Francis Group, LLC 2089_book.fm Page 459 Tuesday, May 10, 2005 3:38 PM Evaluation Strategies for Medical-Image Analysis 459 TABLE 12.4 Methods Commonly Used and Recommended for Estimating the Error Rate of a Prediction or Decision Rule Method Principle Estimated Parameter Comments Split-sample or hold-out validation k-fold cross validation Data divided in two subsets: one set for training, the other set for error estimation Data are divided in k subsets of equal size: k − subsets are used for training, with the one left out used for error estimation (k is usually 10) Cross-validation with k equal to the sample size: k − cases are used for training, with the one left out used for error estimation Generalization error function (prediction error) Generalization error function (prediction error) No crossing of samples; used for early stopping; not robust for small sets Superior for small sets Generalization error function (prediction error) Jackknifing Same as leave-one-out Bias of a statistic Bootstrapping N subsamples of the data are used for learning: each subsample has size k and is randomly selected with replacement from the full data set Generalization error function (prediction error), confidence intervals Better than k-fold crossvalidation for continuous error functions but may perform poorly for noncontinuous ones Complex version could estimate generalization error Less variability than cross-validation in many cases; several versions; best using the 632+ rule Leave-one-out cross validation or round robin Note: Methods used as in the case of algorithms that employ statistical and nonstatistical patternrecognition algorithms Sources: Tourassi, G.D and Floyd, C.E., Medical Decision Making, 17, 186, 1997; Efron, B., The Jackknife, the Bootstrap, and Other Resampling Plans, Society for Industrial and Applied Mathematics, Philadelphia, 1982; Efron, B and Tibshirani, R., J Am Stat Assoc., 92, 548, 1997; Efron, B and Tibshirani, R., Science, 253, 390, 1991 know that the answers vary from “as many as possible” to “it all depends on how representative the set is,” but it never comes with a specific number The reason is that a specific number depends on several factors, and there is not just one good answer for all applications Most pattern-recognition algorithms are trained based on sets of features or feature vectors extracted from the medical image The size of the input feature set and the sample size have a direct relationship, particularly when the latter is small It is known that “the size of the training data grows exponentially with the dimensionality of the input space,” a phenomenon referred to as the “curse of dimensionality” [108] If we are forced to work with limited data sets Copyright 2005 by Taylor & Francis Group, LLC 2089_book.fm Page 460 Tuesday, May 10, 2005 3:38 PM 460 Medical Image Analysis (as in the case of medical imaging), we cannot afford to ever increase the dimensionality of the input space To enhance the accuracy of our algorithm, the number of variables must be reduced or the model must be simplified [109] As a rule of thumb, that number of predictors or features f in a classification scheme should be f < n/10, where n is the size of the training sample [109] Zheng et al [110] have proposed to use the trend in a system’s performance as a function of training-set size to assess adequacy of the training data set in the development of a CAD scheme The authors studied the impact that the number of training regions has on the performance of a CAD system developed for the differentiation between signal- vs no-signal-containing region (presence or absence of a mass in regions obtained from mammograms) They found that as the number of the regions in the training set increased, training CAD performance decreased and plateaued but performance improved on the testing set Another question that arises when using a statistical or nonstatistical classifier is how to present the algorithm’s output to the clinician For example, when using a neural network to classify benign and malignant masses in mammograms, the output can be presented in a binary form (benign or malignant) or as a likelihood of malignancy The former is a straightforward mapping of the network’s ranked ordered output The latter represents the probability that a mass is cancerous This probability can be determined from the network’s output through some type of transformation Jiang et al [111] proposed the use of the maximum-likelihood estimated binormal model in ROC analysis The required data are provided by the LABROC4 program [70, 112] 12.6.3 ESTIMATION OF PERFORMANCE PARAMETERS AND RATES The conventions and methods used to measure performance parameters can be another source of bias This effect is more critical for the detection and segmentation algorithms than the classification algorithms because the former need good groundtruth information that is not always available, while the latter usually depend on pathology outcomes that are usually less variable and more reliable In detection and segmentation algorithms, the estimations of TP, TN, FP, and FN rates that are necessary for the ROC curves or the sensitivity and specificity indices depend on the conventions and criteria followed by the investigators and the quality of the ground-truth information We have studied these issues for CAD detection algorithms and proposed several conventions that could provide a standard and allow relative evaluations [38] Estimating the same performance rates above for CAD diagnosis algorithms is less complicated because, in this case, usually two states are considered — benign vs malignant, normal vs abnormal, disease vs nondisease, etc — which are usually defined in pathology or clinical reports Another related aspect in the estimation of the performance rates that could significantly change the outcome is whether rates are determined on a per image or per case basis; often a patient’s examination may involve more than one image, e.g., a mammogram involves four views (two of each breast) Setting clear conventions Copyright 2005 by Taylor & Francis Group, LLC 2089_book.fm Page 461 Tuesday, May 10, 2005 3:38 PM Evaluation Strategies for Medical-Image Analysis 461 and criteria ahead of time and maintaining them during the evaluation process is critical in reporting results and obtaining consistent and unbiased performances 12.6.4 PRESENTATION SETUP The way an algorithm’s output is presented to the observer in observer-based validation studies can influence the validation outcome [113] Current commercial CAD systems for mammography and lung radiography point out suspicious areas using a specific symbol assigned to a type of abnormality, e.g., a triangle for a calcification cluster or a circle for a site of potential mass These CAD outputs can be presented in a hard-copy form, i.e., a printout of the original image marked with the CAD output (if any), or in soft-copy form, i.e., the image displayed on a low-resolution computer monitor marked with the CAD output These displays are presented sideby-side with the regular film or digital display, i.e., next to the multiviewer if films are reviewed or next to the monitors used for primary diagnosis if digital images are reviewed, as in the case of CTs and MRIs Key elements in the presentation of an algorithm’s output are: Hanging protocol and work flow: The sequence of image presentation with and without CAD should be designed to have minimum impact on the standard work flow The addition of processed images should not significantly delay the interpretation process A reasonable reading load should be maintained throughout the study to avoid fatigue and unbalanced case interpretation Type of computer monitor used for soft-copy display: The spatial resolution and luminance of the selected monitor should match the imaging task and application and ensure the highest possible image quality Pairs of CRTs are usually recommended for all medical-imaging applications; 1-Mpixel CRTs are used for general radiography, and 5-Mpixel systems are used for mammography Recently, significant technological advances have been achieved in LCD flat-panel displays that are currently being evaluated for medical applications but are not yet clinically acceptable [114] A quality control program should be established for the display systems that meets established standards [115] Use of color vs black-and-white image display: This is an ambiguous issue because the bulk of medical images today involves only gray-scale information, and the readers are trained in gray-scale interpretation Color has certain advantages, however, and some segmentation and three-dimensional reconstruction algorithms have used color effectively, showing a positive impact on physician performance Workstation–user interface: This interface should be user friendly and intuitive It should offer both automatic and manual adjustments for interactive processes and, if made in-house, validated independently before being used in an algorithm-validation study [116] Observer training: This is critical to the success of a validation study Observers should be thoroughly trained on a separate but representative data set Copyright 2005 by Taylor & Francis Group, LLC 2089_book.fm Page 462 Tuesday, May 10, 2005 3:38 PM 462 Medical Image Analysis on how to interpret processed data, how to report, what criteria to use during the study, how the algorithm operates, and what its output means Knowledge of the laboratory performance of the tested algorithm is useful in extrapolating to its potential clinical significance Readers should also become familiar with the rating approach and apply consistent rating criteria for all cases throughout the study Algorithms designed to assist in the detection of disease (CAD detection) are usually easier to evaluate than classification algorithms (CAD diagnosis schemes) that present a pathology prediction, particularly when the latter outperform the human reader In this case, there is substantial risk for the reader to be biased by the algorithm’s performance and accept its recommendation without conducting a critical review [20] Environment and ambient light conditions: Reading conditions are critical for both hard-copy (film) and soft-copy (computer monitor) reading Ambient light should be controlled and conform to specifications for radiology environments Cross-talk between monitors or display devices and light sources should be eliminated Readers should be positioned at recommended height and distance levels from the display Ergonomics should be fully considered to avoid fatigue and viewing distortions [117] Reporting mechanisms: Dictation, hard-copy forms, or computer interfaces are all options for reporting, and choosing one over another is a practical issue A computer interface offers the greatest flexibility for the investigator because it allows exporting information directly to the analysis tools and minimizes error 12.6.5 STATISTICAL ANALYSIS Earlier, we discussed the statistical analysis associated with ROC types of studies; this is usually part of the publicly available software packages Non-ROC studies also require statistical analysis to test the differences between groups of data, the differences between variables, or the relationship between measured variables The data or variables may be, for example, any of the performance indices described in Section 12.3 There are numerous statistical tests to choose from, depending on the data type and experimental conditions A biostatistician’s guidance in selecting the right test cannot be overemphasized Table 12.5 summarizes tests that could be used for the statistical analysis of data from non-ROC validation studies of computer algorithms in medical imaging We observe that some of these tests are common among studies Namely, they are used in the analysis of both ROC and non-ROC experiments or both observer and nonobserver experiments Generally, it is the characteristics of the data set(s) and the input and output variables that determine the type of test to be used The first step in selecting the right statistical test is to determine whether the data follow a Gaussian or normal distribution (parametric) or not (nonparametric) Note that for every parametric test, there is a nonparametric equivalent; nonparametric tests apply when the sample size is too small for any distribution assumptions to be made The next step looks at the data type, e.g., continuous, nominal, categorical [40] Finally, one Copyright 2005 by Taylor & Francis Group, LLC 2089_book.fm Page 463 Tuesday, May 10, 2005 3:38 PM Evaluation Strategies for Medical-Image Analysis 463 TABLE 12.5 Statistical Tests Commonly Used for Analysis of Single-Measurement Data Obtained from Medical-Image Analysis and Processing Algorithms Approach Type of Data When Used Test Parametric Continuous To compare means of two dependent groups To compare means of two independent groups To compare means of two or more independent groups To measure association of two variables To compare two dependent samples To compare two dependent groups To compare two independent groups Paired t-test To compare two independent groups To compare two or more independent groups To compare two or more independent groups To measure association To measure agreement Wilcoxon Mann Whitney Wilcoxon rank sum test Pearson’s χ2 test Continuous Continuous Continuous Nonparametric Binomial Continuous Binomial Nominal Categories Continuous Binomial Continuous Continuous Binomial Nominal Categories Continuous Unpaired t-test Analysis of variance (ANOVA) Pearson’s correlation coefficients McNemar’s test Sign test Wilcoxon signed rank test Pearson’s χ2 test Fisher’s exact test Kruskal-Wallis test Log-rank test (for survival data) Spearman’s correlation coefficient Cohen’s (weighted) kappa Note: Tests are grouped in parametric and nonparametric approaches Depending on the type of data and the goal of the application, one or more tests may be applicable Details and examples of these and other tests can be found in the extensive biostatistics literature Source: Dr Ji-Hyun Lee of the Biostatistics Core at the H Lee Moffitt Cancer Center & Research Institute has contributed valuable comments on the role and use of the various statistical tests Her assistance in the generation of this table is greatly appreciated should determine whether the data or variables to be tested are independent (unpaired or unmatched or uncorrelated) or dependent (paired or matched or correlated) Dependent groups of data are common in medical-imaging validation studies where images from the same patient are used multiple times in a data set or when the same cases are reviewed by the same observers [40] An example of a paired t-test application is the analysis of image segmentation data that include breast density measurements from mammograms pre- and posttreatment for breast cancer patients The unpaired t-test can be applied when breast Copyright 2005 by Taylor & Francis Group, LLC 2089_book.fm Page 464 Tuesday, May 10, 2005 3:38 PM 464 Medical Image Analysis density measurements are compared between two treatment groups Analysis of variance (ANOVA) is applicable when breast density measurements are compared among three patient groups that receive different drug treatments Finally, Pearson’s correlation coefficients are appropriate for correlating lung nodule detection rates of CAD detection schemes for chest radiography and CT Nonparametric tests are applicable to the same examples when the sample size is small Note that Table 12.5 only lists methods applied to data obtained from single measurements as opposed to data acquired from repeated measurements The latter should be treated with longitudinal analysis methods that are applicable to data from the same experimental parameter collected over time [118] Repeated data collection is commonly done in medical imaging when monitoring a biological process or a patient’s response to treatment or other type of long-term intervention Computer algorithms such as segmentation methods that are applied to repeated images of the same patient should be analyzed with appropriate longitudinal analysis methodologies to avoid biased p-values 12.7 DISCUSSION AND CONCLUSIONS This chapter summarizes the most popular and accepted methodologies applicable to the evaluation of image analysis and processing techniques for medical-imaging applications The approaches described here can be used to: (a) discriminate early on in the development the methods that are most likely to succeed and rank performances, and (b) assess the clinical value of top candidates and test specific hypotheses The development of new medical-image-processing methodologies should be thoroughly scrutinized through robust but low-cost and fast validation approaches Feasibility studies that test new image-processing ideas or new medical-imaging applications could avoid the observer-based ROC-type tests Preference studies or computer-based ROC-type experiments or mathematical error analysis could provide the information necessary to discriminate, compare, and optimize methodologies at the early stages of development Proven concepts could then be tested with observerbased, retrospective ROC experiments New developments in the field of validation methodologies address the limitations of existing techniques For example, a differential ROC (DROC) methodology was proposed by Chakraborty et al [119, 120] for measuring small differences in the performance of two medical-imaging modalities in a timely and cost-effective way In DROC, the observer sees pairs of images of the same patient (one from each modality), selects one that is preferred for the assigned task, and rates it using a five-point rating scale similar to the one used in ROC The method seems promising in that it may require fewer cases and fewer observers than an ROC study while yielding equivalent power The method is likely to be more applicable to the evaluation of different imaging modalities, e.g., MRI and mammography, digital and screen/film mammography, or image-processing techniques, e.g., image enhancement and compression algorithms It may be less applicable to the evaluation of CAD detection or CAD diagnosis schemes Another new development could lead to a new ROC tool that can handle more than two classes This is required, for example, Copyright 2005 by Taylor & Francis Group, LLC 2089_book.fm Page 465 Tuesday, May 10, 2005 3:38 PM Evaluation Strategies for Medical-Image Analysis 465 for the analysis of three-class data obtained from classifiers that differentiate between benign, malignant, and false-positive computer detections on medical images [121] But does validation stop here? And if not, what comes next? According to Fryback and Thornbury’s model [15], these evaluation steps only take us half way to the final goal of total efficacy assessment of a diagnostic tool Following these experiments, the computer algorithms that showed significant potential to have a positive clinical impact should be further tested for therapeutic efficacy, patient outcome efficacy, and societal efficacy So, prospective clinical trials are what should come next Unfortunately, we not have a historical precedent to demonstrate the way such a trial should be conducted in the field we are discussing Commercial CAD systems for mammography were the first ones to enter the clinic But they did so without going through traditional clinical trials and based only on the positive outcome of ROC retrospective studies This is probably the reason for the controversial findings that followed regarding their clinical value, which will continue to be questioned until a large prospective clinical trial is performed [122] Finally, this chapter attempted to provide practical, albeit limited, solutions to admittedly complicated issues of validation, such as the problem of ground-truth definition required for the validation of image segmentation techniques or the sample size We should strive to find absolute and robust solutions to the validation problems, but the lack thereof should not hinder algorithm development, considering the high rate of technological advancements in medical-imaging equipment and diagnostic procedures Establishment of standards on metrics and validation criteria and consensus on the use of the techniques currently available could ease the burden for unattainable perfection while satisfying our current requirements, significantly improving the validation process, and yielding meaningful results ACKNOWLEDGMENTS The author would like to thank Robert A Clark, John J Heine, Ji-Hun Lee, Lihua Li, Jerry A Thomas, Anand Manohar, Joseph Murphy, Angela Salem, and Mugdha Tembey for their valuable discussions and comments on algorithm evaluation issues and their assistance in the preparation of this manuscript REFERENCES Giger, M.L., Computer-aided diagnosis, in Syllabus: a Categorical Course in Physics: Technical Aspects of Breast Imaging, Haus, A.G and Yaffe, M.J., Eds., RSNA, Chicago, IL, 1993, p 283 Feig, S.A., Clinical evaluation of computer-aided detection in breast cancer screening, Semin Breast Dis., 5, 223, 2002 Li, L et al., Improved method for automatic identification of lung regions on chest radiographs, Acad Radiol., 8, 629, 2001 Vaidyanathan, M et al., Comparison of supervised MRI segmentation methods for tumor volume determination during therapy, Magn Resonance Imaging, 13, 719, 1995 Copyright 2005 by Taylor & Francis Group, LLC 2089_book.fm Page 466 Tuesday, May 10, 2005 3:38 PM 466 Medical Image Analysis Heine, J.J and Malhotra, P., Mammographic tissue, breast cancer risk, serial image analysis, and digital mammography: Part 1, tissue-related risk factors, Acad Radiol., 9, 298, 2002 Heine, J.J., and Malhotra, P., Mammographic tissue, breast cancer risk, serial image analysis, and digital mammography: Part 2, serial breast tissue change and related temporal influences, Acad Radiol., 9, 317, 2002 Clarke, L.P et al., Hybrid wavelet transform for image enhancement for computerassisted diagnosis and telemedicine applications, in The Frequency and Wavelets in Biomedical Signal Processing, Akay, M., Ed., IEEE Press Series in Biomedical Engineering, IEEE, New York, 1998, chap 21 Yang, Z et al., Effect of wavelet bases on compressing digital mammograms, IEEE Eng Med Biol Mag., 14, 570, 1995 Masero, V., Leon-Rojas, J.M., and Moreno, J., Volume reconstruction for health care: a survey of computational methods, Ann N.Y Acad Sci., 980, 198, 2000 10 Sallam, M.Y and Bowyer, K.W., Registration and difference analysis of corresponding mammogram images, Medical Image Anal., 3, 103, 1999 11 Deans, S.R et al., Wavelet transforms, in Encyclopedia of Electrical and Electronics Engineering, Webster, J.G., Ed., J Wiley & Sons, New York, 1999 12 Qian, W et al., Computer-assisted diagnosis for digital mammography, IEEE Eng Med Biol Mag., 14, 561, 1995 13 Kallergi, M., Computer-aided diagnosis of mammographic microcalcification clusters, Med Phys., 31, 314, 2004 14 Shiraishi, J et al., Computer-aided diagnosis to distinguish benign from malignant solitary pulmonary nodules on radiographs: ROC analysis of radiologists’ performance—initial experience, Radiology, 227, 469, 2003 15 Fryback, D.G and Thornbury, J.R., The efficacy of diagnostic imaging, Medical Decision Making, 11, 88, 1991 16 Pepe, M.S., The Statistical Evaluation of Medical Tests for Classification and Prediction, Oxford University Press, Oxford, U.K., 2002 17 Methodological issues in diagnostic clinical trials: health services and outcomes research in radiology, Symposium Proceedings, Washington DC, USA, March 1516, 1998, Acad Radiol., 6, Suppl 1, S1-136, 1999 18 Hulley, S.B., et al., Designing Clinical Research: An Epidemiologic Approach, 2nd ed., Lippincott, Williams & Wilkins, Philadelphia, PA, 2000 19 Friedman, L.M., Furberg, C., and Demets, D.L., Fundamentals of Clinical Trials, 3rd ed., Springer-Verlag, Heidelberg, 1998 20 Zhou, X.H., McClish, D.K., and Obuchowski, N.A., Statistical Methods in Diagnostic Medicine, Wiley, New York, 2002 21 Gay, J., Clinical Epidemiology & Evidence Based Medicine Glossary: Experimental Design and Statistics Terminology, August 22, 1999, Washington State University; Available online at http://www.vetmed.wsu.edu/courses-jmgay/GlossExpDesign htm, last accessed 3/05 22 Thornbury, J.R., Intermediate outcomes: diagnostic and therapeutic impact, Acad Radiol., 6, S58, 1999 23 Hendee, W.H., Technology Assessment, National Cancer Institute Imaging Sciences Working Group Technology Evaluation Committee, Final Report, December 16, 1997; Available online at http://imaging.cancer.gov/reportsandpublications/Reportsand Presentations/ImagingSciencesWorkingGroup/page2, last accessed 3/05 24 Phelps, C.E and Mushlin, A.I., Focusing technology assessment using medical decision theory, Medical Decision Making, 8, 270, 1988 Copyright 2005 by Taylor & Francis Group, LLC 2089_book.fm Page 467 Tuesday, May 10, 2005 3:38 PM Evaluation Strategies for Medical-Image Analysis 467 25 Chesters, M.S., Human visual perception and ROC methodology in medical imaging, Phys Med Biol., 37, 1433, 1992 26 Metz, C.E., ROC methodology in radiologic imaging, Invest Radiol., 21, 720, 1986 27 Nishikawa, R., Assessment of the performance of computer-aided detection and computer-aided diagnosis systems, Semin Breast Dis., 5, 217, 2002 28 Houn, F et al., Study design in the evaluation of breast cancer imaging technologies, Acad Radiol., 7, 684, 2000 29 Wagner, R.F et al., Assessment of medical imaging and computer-assist system: lessons from recent experience, Acad Radiol., 9, 1264, 2002 30 King, J.L et al., Identification of superior discriminators during non-ROC studies, Proc SPIE, 4686, 54, 2002 31 Zou, K.H., Receiver Operating Characteristic (ROC) Literature Research, Department of Radiology, Brigham and Women’s Hospital, Department of Health Care Policy, Harvard Medical School; Available online at http://splweb.bwh.harvard.edu:8000/ pages/ppl/zou/roc.html, last accessed 3/05 32 Medical Image Perception Society, ROC References–ROC & Related Programs to Analyze Observer Performance; Available online at http://www.mips.ws, last accessed 3/05 33 Beytas, E.M., Debatin, J.F., and Blinder, R.A., Accuracy and predictive value as measures of imaging test performance, Invest Radiol., 27, 374, 1992 34 Li, H.D et al., Markov random field for tumor detection in digital mammography, IEEE Trans Medical Imaging, 14, 565, 1995 35 Kallergi, M., Interpretation of calcifications in screen/film, digitized, and waveletenhanced, monitor-displayed mammograms: a receiver operating characteristic study, Acad Radiol., 3, 285, 1996 36 Tembey, M., Computer-Aided Diagnosis for Mammographic Microcalcification Clusters, M.S Thesis, Computer Science Department, College of Engineering, University of South Florida, Tampa, FL, 2003 37 Jiang, Y., Metz, C.E., and Nishikawa, R.M., A receiver operating characteristic partial area index for highly sensitive diagnostic tests, Radiology, 201, 745, 1996 38 Kallergi, M., Carney, G., and Gaviria, J., Evaluating the performance of detection algorithms in digital mammography, Med Phys., 26, 267, 1999 39 Li, L et al., False-positive reduction in CAD mass detection using a competitive strategy, Med Phys., 28, 250, 2001 40 Mould, R.F., Introductory Medical Statistics, 3rd ed., Institute of Physics, Philadelphia, 1998 41 Pal, N.R and Pal, S.K., A review on image segmentation techniques, Pattern Recogn., 126, 1277, 993 42 Zhang, Y.J., A survey on evaluation methods for image segmentation, Pattern Recogn., 29, 1335, 1996 43 Zhang, Y.J., A review of recent evaluation methods for image segmentation, Proc 6th Int Symp Signal Processing and Its Applications (ISSPA), Kuala Lumpur, Malaysia, August 13-16, 2001, IEEE, Piscataway, NJ, 148-151, 2001 44 Udupa, J.K et al., A methodology for evaluating image segmentation algorithms, Proc SPIE, 4684, 266, 2002 45 Filippi, M et al., Intra- and inter-observer variability of brain MRI lesion volume measurements in multiple sclerosis: a comparison of techniques, Brain, 118, 1593, 1995 46 Kallergi, M et al., A simulation model of mammographic calcifications based on the ACR BIRADS, Acad Radiol., 5, 670, 1998 Copyright 2005 by Taylor & Francis Group, LLC 2089_book.fm Page 468 Tuesday, May 10, 2005 3:38 PM 468 Medical Image Analysis 47 Kallergi, M et al., Resolution effects on the morphology of calcifications in digital mammograms, in Medicon 98, Proc VIII Mediterranean Conf Medical and Biological Engineering and Computing, Lemesos, Cyprus, 1998 48 United States National Library of Medicine, National Institutes of Health, The Visible Human Project®; Available online at http://www.nlm.nih.gov/research/visible/visible_human.html, last accessed 3/05 49 Zubal, I.G et al., Computerized three-dimensional segmented human anatomy, Med Phys., 21, 299, 1994 50 Gerig, G et al., A new validation tool for assessing and improving 3-D object segmentation, MICCAI, 2208, 516-528, 2001; Available online at http://www.cs.unc edu/Research/MIDAG/pubs/papers/MICCAI01-gerig-valmet.pdf, last accessed 3/05 51 Chalana, V and Kim, Y., A methodology for evaluation of boundary detection algorithms on medical images, IEEE Trans Medical Imaging, 16, 642, 1997 52 Kelemen, A., Székely, G., and Gerig, G., Elastic model-based segmentation of 3-D neuroradiological data sets, IEEE Trans Medical Imaging, 18, 828, 1999 53 Motulsky, H., Intuitive Biostatistic, Oxford University Press, Oxford, U.K., 1995 54 Mould, R.F., Introductory Medical Statistics, 3rd ed., Institute of Physics Publishing, Philadelphia, 1998 55 Bland, J.M and Altman, D.G., Statistical methods for assessing agreement between two methods of clinical measurement, Lancer, 1, 307, 1986 56 Yoo, T.S., Ed., Insight into Images: Principles and Practice for Segmentation, Registration, and Image Analysis, A.K Peters LTD, Wellesley, MA, 2004; ITK software available online at http://www.itk.org, last accessed 3/05 57 Goodenough, D.J., Rossman, K., and Lusted, L.B., Radiographic applications of signal detection theory, Radiology, 105, 199, 1972 58 Chesters, M.S., Human visual perception and ROC methodology in medical imaging, Phys Med Biol., 37, 1433, 1992 59 Dorfman, D.D., Berbaum, K.S., and Lenth, R.V., Multireader, multicase receiver operating characteristic methodology: a bootstrap analysis, Acad Radiol., 2, 626, 1995 60 Judy, P.F et al., Measuring the observer performance of digital systems, in Computed Digital Radiography in Clinical Practice, Green, R.E and Oestmann, J.W., Eds., Thieme Medical Publishers, New York, 1992, p 59 61 Berbaum, K.S., Dorfman, D.D., and Franken, E.A., Measuring observer performance by ROC analysis: indications and complications, Invest Radiol., 24, 228, 1989 62 Roe, C.A and Metz, C.E., Dorfman-Berbaum-Metz method for statistical analysis of multireader, multimodality receiver operating characteristic data: validation with computer simulation, Acad Radiol., 4, 298, 1997 63 Dorfman, D.D et al., Monte Carlo validation of a multireader method for receiver operating characteristic discrete rating data: factorial experimental design, Acad Radiol., 5, 591, 1998 64 Beam, C.A., Strategies for improving power in diagnostic radiology research, AJR, 159, 631, 1992 65 Rockette, H.E., Gur, D., and Metz, C.E., The use of continuous and discrete confidence judgments in receiver operating characteristic studies of diagnostic imaging techniques, Invest Radiol., 27, 169, 1992 66 Kallergi, M., Hersh, M.R., and Thomas, J.A., Using BIRADS categories in ROC experiments, Proc SPIE, 4686, 60, 2002 67 Metz, C.E., Some practical issues of experimental design and data in radiological ROC studies, Invest Radiol., 24, 234, 1989 Copyright 2005 by Taylor & Francis Group, LLC 2089_book.fm Page 469 Tuesday, May 10, 2005 3:38 PM Evaluation Strategies for Medical-Image Analysis 469 68 Beiden, S.V et al., Independent vs sequential reading in ROC studies of computerassist modalities: analysis of components of variance, Acad Radiol., 9, 1036, 2002 69 Chakraborty, D., Counterpoint to analysis of ROC studies of computer-assisted modalities, Acad Radiol., 9, 1044, 2002 70 ROC software, Kurt Rossman Laboratories for Radiologic Image Research, Department of Radiology, The University of Chicago; Available online at http://wwwradiology.uchicago.edu/krl/roc_soft.htm, last accessed 3/05 71 ROC Software, The Medical Image Perception Laboratory, Department of Radiology, The University of Iowa; Available On-line at http://perception.radiology uiowa.edu/, last accessed 3/05 72 Gatsonis, C and McNeil, B.J., Collaborative evaluations of diagnostic tests: experience of the radiology diagnostic oncology group, Radiology, 175, 571, 1990 73 Angelos-Tosteson, A.N and Begg, C.B., A general regression methodology for ROC curve estimation, Medical Decision Making, 8, 204, 1988 74 Starr, S.J et al., Visual detection and localization of radiographic images, Radiology, 116, 533, 1975 75 Swensson, R.G., Unified measurement of observer performance in detecting and localizing target objects on images, Med Phys., 23, 1709, 1996 76 Swensson, R.G et al., Using incomplete and imprecise localization data on images to improve estimates of detection accuracy, Proc SPIE, 3663, 74, 1999 77 Kallergi, M et al., Improved interpretation of digitized mammography with wavelet processing: a localization response operating characteristic study, AJR, 182, 697, 2004 78 Chakraborty, D.P., Maximum likelihood analysis of free-response receiver operating characteristic (FROC) data, Med Phys., 16, 561, 1989 79 Chakraborty, D.P and Winter, L.H.L., Free-response methodology: alternate analysis and a new observer-performance experiment, Radiology, 174, 873, 1990 80 Burgess, A.E., Comparison of receiver operating characteristic and forced choice observer performance measurement methods, Med Phys., 22, 643, 1995 81 Pisano, E.D et al., Radiologists’ preferences for digital mammographic display, the International Digital Mammography Development Group, Radiology, 216, 820, 2000 82 Strotzer, M et al., Clinical application of a flat-panel X-ray detector based on amorphous silicon technology: image quality and potential for radiation dose reduction in skeletal radiography, AJR, 172, 835, 1999 83 Rosen, E.L and Soo, M.S., Tissue harmonic imaging sonography of breast lesions: improved margin analysis, conspicuity, and image quality compared to conventional ultrasound, Clin Imaging, 25, 379, 2001 84 Volk, M et al., Digital radiography of the skeleton using a large-area detector based on amorphous silicon technology: image quality and potential for dose reduction in comparison with screen-film radiography, Clin Radiol., 55, 615, 2000 85 Sivaramakrishna, R et al., Comparing the performance of mammographic enhancement algorithms: a preference study, AJR, 175, 45, 2000 86 Kheddache, S and Kvist, H., Digital mammography using storage phosphor plate technique: optimizing image processing parameters for the visibility of lesions and anatomy, Eur J Radiol., 24, 237, 1997 87 Caldwell, C.B et al., Evaluation of mammographic image quality: pilot study comparing five methods, AJR, 159, 295, 1992 88 Davidson, R.R and Farquhar, P.H., A bibliography on the method of paired comparisons, Biometrics, 32, 241, 1976 Copyright 2005 by Taylor & Francis Group, LLC 2089_book.fm Page 470 Tuesday, May 10, 2005 3:38 PM 470 Medical Image Analysis 89 Silverstein, D.A and Farrell, J.E., An efficient method for paired-comparison, J Electron Imaging, 10, 394, 2001 90 Beam, C.A., Fundamentals of clinical research for radiologists: statistically engineering the study for success, AJR, 179, 47, 2002 91 Beam, C.A., Strategies for improving power in diagnostic radiology research, AJR, 159, 631, 1992 92 Kallergi, M., Clark, R.A., and Clarke, L.P., Medical-image databases for CAD applications in digital mammography: design issues, Stud Health Technol Inform., 43, Pt B, 601, 1997 93 Nishikawa, R.M et al., Effect of case selection on the performance of computeraided detection schemes, Med Phys., 21, 265, 1994 94 Zink, S and Jaffe, C.C., Medical-imaging databases: a National Institutes of Health workshop, Invest Radiol., 28, 366, 1993 95 Noether, G.E., Sample size determination of some common nonparametric tests, JASA, 82, 645, 1987 96 Gur, D et al., Practical issues of ROC analysis: selection of controls, Invest Radiol., 25, 583, 1990 97 Woodard, M., Epidemiology: Study Design and Data Analysis, Chapman & Hall, CRC Press, Boca Raton, FL, 1999 98 Kallergi, M et al., Evaluation of a CCD-based film digitizer for digital mammography, Proc SPIE, 3032, 282, 1997 99 Digital Database for Screening Mammography (DDSM), University of South Florida, Digital Mammography Home Page; Available online at http://marathon.csee.usf.edu/ Mammography/Database.html, last accessed 3/05 100 Digital Mammographic Imaging Screening Trial, National Cancer Institute; Available online at http://cancer.gov/dmist, last accessed 3/05 101 Lung Imagining Database Consortium (LIDC), Cancer Imaging Program, National Cancer Institute; Available online at http://imaging.cancer.gov/programsandresources/ InformationSystems/LIDC, last accessed 3/05 102 Fifth National Forum on Biomedical Imaging in Oncology, Bethesda, MD, 2004; Available On-line at http://cancer.gov/dctd/forum/summary04.pdf last accessed 3/05 103 Target Recognizer Definitions and Performance Measures, Report of the Joint U.S Department of Defense and Industry Working Group on Automatic Target Recognizer, ATRWG No 86-001, 1986, Storming Media, Washington, DC 104 Tourassi, G.D and Floyd, C.E., The effect of data sampling on the performance evaluation of artificial neural networks in medical diagnosis, Medical Decision Making, 17, 186, 1997 105 Efron, B., The Jackknife, the Bootstrap, and Other Resampling Plans, CBMS-NSF Regional Conference Series in Applied Mathematics, Society for Industrial and Applied Mathematics, Philadelphia, 1982 106 Efron, B and Tibshirani, R., Improvement on cross-validation: the 632+ Bootstrap Method, J Am Stat Assoc., 92, 548, 1997 107 Efron, B and Tibshirani, R., Statistical data analysis in the computer age, Science, 253, 390, 1991 108 Bishop, C.W., Neural Networks for Pattern Recognition, Clarendon Press, Oxford, U.K., 1995 109 Harrell, F.E., Jr., Lee, K.L., and Mark, D.B., Multivariate prognostic models: issues in developing models, evaluating assumptions and adequacy, and measuring and reducing errors, Statistics Med., 15, 361, 1996 Copyright 2005 by Taylor & Francis Group, LLC 2089_book.fm Page 471 Tuesday, May 10, 2005 3:38 PM Evaluation Strategies for Medical-Image Analysis 471 110 Zheng, B et al., Adequacy testing of training sample sizes in the development of a computer-assisted diagnosis scheme, Acad Radiol., 4, 497, 1997 111 Jiang, Y et al., Improving breast cancer diagnosis with computer-aided diagnosis, Acad Radiol., 6, 22, 1999 112 Metz, C.E., Herman, B.A., and Shen, J.H., Maximum-likelihood estimation of receiver operating (ROC) curves from continuously distributed data, Statistics Med., 17, 1033, 1998 113 Begg, C.B and McNeil, B.J., Assessment of radiologic tests: control of bias and other design considerations, Radiology, 167, 565, 1988 114 Muka, E., Blume, H., and Daly, S., Display of medical images on CRT soft-copy displays: a tutorial, Proc SPIE, 2431, 341, 1995 115 Digital Imaging and Communications in Medicine (DI COM) Part 14: Grayscale Standard Display Function, National Electrical Manufacturers Association (NEMA), Rosslyn, VA, 2003; Available online at http://medical.nema.org/dicom/2003/03_ 14PU.PDF, last accessed 3/05 116 Gohel, H.J et al., A workstation interface for ROC studies in digital mammography, Proc SPIE, 3031, 440, 1997 117 Abdullah, B.J.J and Ng, K.H., In the eyes of the beholder: what we see is not what we get, BJR, 74, 675, 2001 118 Diggle, P.J., Liang, K.Y., and Zeger, S.L., Analysis of Longitudinal Data, Oxford University Press, Oxford, U.K., 1994 119 Chakraborty, D.P et al., The differential receiver operating characteristic (DROC) method, Proc SPIE, 3338, 234, 1998 120 Chakraborty, D.P., Howard, N.S., and Kundel, H.L., The differential receiver operating characteristic (DROC) method: rationale and results of recent experiments, Proc SPIE, 3663, 82, 1999 121 Edwards, D.C et al., Estimating three-class ideal observer decision variables for computerized detection and classification of mammographic mass lesions, Med Phys., 31, 81, 2004 122 James, J.J., The current status of digital mammography, Clin Radiol., 59, 1, 2004 Copyright 2005 by Taylor & Francis Group, LLC