P(rep) the probability of replicating an effect

Prep: the Probability of Replicating an Effect Peter R Killeen Arizona State University Killeen@asu.edu Abstract Prep, gives the probability that an equally powered replication attempt will provide supportive evidence—an effect of the same sign as the original, or, if preferred, the probability of a significant effect in replication Prep is based on a standard Bayesian construct, the posterior predictive distribution It may be used in modes: to evaluate evidence; to inform belief; and to guide action In the first case the simple prep is used; in the second, it is augmented with estimates of realization variance and informed priors; in the third it is embedded in a decision theory Prep throws new light on replicability intervals, multiple comparisons, traditional α levels, and longitudinal studies As the area under diagnosticity vs detectability curve, it constitutes a criterion-free measure of test quality The issue The foundation of science is the replication of experimental effects But most statistical analyses of experiments test, not whether the results are replicable, but whether they are unlikely if there were truly no effect present This inverse inference creates many problems of interpretation that have become increasingly evident to the field One of the consequences has been an uneasy relationship between the science of psychology and its practice The irritant is the scientific inferential method Not the method of John Stewart Mill, Michael Faraday, or Charles Darwin; but that of Ronald Aylmer Fisher, Egon Pearson, and Jerzy Neyman All were great scientists or statisticians Fisher both But they grappled with scientific problems on different scales of time, space, and complexity than clinical psychologists In all cases their goal was to epitomize a phenomenon with simple verbal or mathematical descriptions, and then to show that such a description has legs: That it explains data or predicts outcomes in new situations But increasingly it is being realized that results in biopsychosocial research are often lame: Effect sizes can wither to irrelevance with subsequent replications, and credible authorities claim that “most published research findings are false” Highly significant effects can have negligible therapeutic value Something in our methodology has failed Must clinicians now turn away from such toxic “evidence-based” research, back to clinical intuition? The historical context In the physical and biological sciences precise numerical predictions can sometimes be made: The variable A should take the value a A may have been a geological age, a vacuum permittivity, or the deviation of a planetary orbit The more precise the experiment, the more difficult it is for errant theories to pass muster In the behavioral sciences it is rare be able to make the prediction A -> a Our questions are typically not “does my model of the phenomenon predict the observed numerical outcome?”, but rather “is my candidate causal factor C really affecting the process?”; “Does childhood trauma increase the risk of adult PTSD?” It then becomes a test of two candidate models: No effect: A + C ≈ A -> a; or some effect: A + C -> a + c, where we cannot specify c beforehand, but rather prefer that it be large, and in a particular direction Since typically we also cannot specify the baseline or control level a, we test to see whether the difference in the effects were reliably different than zero: testing experimental (A + C), and control (A) groups, and asking whether a + c ¿=? a, that is does the difference in outcome between experimental and control groups equal zero: (a + c) - a = 0? Since the difference will almost always be different from due to random variation, the question evolves to: Is it sufficiently larger than so that we can have some confidence that the effect is real—that is, that it will replicate? We want to know whether (a + c) - a is larger than some criterion How large should that criterion be? It was for such situations that Fisher formalized and prior work into the analysis of variance ANOVA ANOVA estimates the background levels of variability—error, or noise— combining the variance within each of the groups studied, and asking whether the variability between groups the treatment effect, or signal sufficiently exceeds that noise The signal-tonoise ratio is the F statistic If the errors are normally distributed and the groups independent, with no true effect (that is, all are drawn from the same population, so that A + C = A, and thus c ≈ 0) we can say precisely how often the F ratio will exceed a criterion α (alpha) If our treatment effect exceeds that value, it is believed to be unlikely that the assumption of “no effect” is true Because of its elegance, robustness, and refinement over the decades, ANOVA and its variants are the most popular inferential statistics in psychology These virtues derive from certain knowledge of the ideal case, the null hypothesis, with deviations being precisely characterized by p-values significance levels But there are problems associated with the uncritical use of such null-hypothesis statistical tests (NHST), ones well known to the experts, and repeated anew to every generation of students (e.g., Krueger 2001) Among them: One cannot infer from NHST either the truth or falsity of the null hypothesis; nor can one infer the truth or falsity of the alternative ANOVA gives the probability of data assuming the null, not the probability of the null given the data (see, e.g., Nickerson 2000) Yet, rejection of the null is de facto the purpose to which the results are typically put Even if the null is (illogically) rejected, significance levels not give a clear indication of how replicable a result is It was to provide such a measure that prep, the probability of replication, was introduced (Killeen 2005a) The logic of prep Prep is a probability derived from a Bayesian posterior predictive distribution (ppd) Assume you have conducted a pilot experiment on a new treatment for alleviating depression, involving 20 control and 20 experimental participants, and found that the means and standard deviations were: 40 (12), 50 (15) The effect size, d, the difference of means divided by the pooled estimate of standard deviation (13.6), is a respectable 0.74 Your t-test reports p < 05; indicating that this result is unlikely under the null hypothesis Is the result replicable? The answer depends on what you consider a replication to be, and what you are willing to assume about the context of the experiment First the general case, and then the particulars Most psychologists know that a sampling distribution is the probability of finding a statistic such the effect size, d given the “true” value of the population parameter, δ (delta) Under the null, δ = 0: The sampling distribution, typically a normal or t-distribution, is centered on If the experimental and control groups are the same size and sum to n, then the variance of the distribution is approximately 4/(n - 4) This is shown in Figure The area to the right of the initial result, d1 = 0.74, is less than α = 05, so the result qualifies as significant To generate a predictive distribution, move the sampling distribution from to its most likely place Given knowledge of only your data, that is the obtained effect size, d1 = 0.74 If this was the true effect size δ, then that shifted sampling distribution would also give the probability of a replication: The probability that it would be significant is the area under this distribution that lies to the right of the α cut-off, 0.675, approximately 58% The probability of a replication returning in the wrong direction is the area under this curve to the left of 0—which equals the 1-tailed p-value of the initial study Figure The curve centered on is a sampling distribution for effect size, d, under the null hypothesis Shifted to the right it gives the predicted distribution of effect sizes in replications, in case the true effect size, δ, equals the recorded effect size d1 Since we not know that δ precisely equals d1, because both the initial and replicate will incur sampling error, the variance of the distribution is increased—doubled in the case of an equal-powered replication, to create the posterior predictive distribution (ppd), the intermediate distribution on the right In the case of a conceptual rather than strict replication, additional realization variance is added, resulting in the lowest ppd In all cases, the area under the curves to the right of the origin gives the probability of supportive evidence in replication If we knew that the true effect size was exactly δ = 0.74, no further experiments would be necessary But we not know what δ is; we can only estimate it from the original results There are thus at least two sources of error: the sampling error in the original, and in the ensuing replication This leads to a doubling of the variance in prediction, for a systematic replication attempt of the same power The ppd is located at the obtained estimate of d, d1, and has twice the variance of the sampling distribution—8/(n-4) The resulting probability of achieving a significant effect in replication, the area to the right of 0.675, shrinks to 55% What constitutes evidence of replication? What if an ensuing replication found an effect size of 0.5? That is below your estimate of 0.74, and falls short of significance Is this evidence for or against the original claim? It would probably be reported as “failure to replicate” But that is misleading: If those data had been part of your original study, the increase of n would have more than compensated for the decrease in d, substantially improving the significance level of the results The claim was for a causal factor, and the replication attempt, though not significant, returned evidence that (weakly) supports that claim It is straightforward to compute the probability of finding supporting evidence of any strength in replication The probability of that a positive effect in replication is the area under the ppd to the right of In this case that area is 94, suggesting a very good probability that in replication the result will not go the wrong way and contradict your original results This is the basic version of prep What constitutes a replication? The above assumed that the only source of error was sampling variability But there are other sources as well, especially in the most useful case of replication, a conceptual replication involving a different population of participants, and different analytic techniques Call this “random effects” variability realization variance, here σ2R In social science research it is approximately σ2R = 0.08 across various research contexts This noise reduces replicability, especially for studies with small effect sizes, by further increasing the spread of the ppd The median value of σ2R = 0.08 limits all effect sizes less than 0.5 to prep < 90, no matter how many data they are based on In the case of the above example, it reduces prep from 94 to 88, so that there is chance in that a conceptual replication will come back in the wrong direction What is the best predictor of replicability? In the above example all that was known were the results of the experiment: We assumed “flat priors” a priori ignorance of the probable effect size In fact, however, more than that is typically known, or suspected; the experiment comes from a research tradition in which similar kinds of effects have been studied If the experiment had concerned the effect of the activation of a randomly chosen gene, or of a randomly chosen brain region, on a particular behavior, the prior distribution would be tightly centered close to 0, and the ppd would move down toward If, however, the experiment is studying a large effect that had been reported by other laboratories, the priors would be centered near their average effect size, and the ppd moved up toward them The distance moved depends on the relative weight of evidence in the priors and in the current data Exactly how much weight should be given to each is a matter of art and argument The answer depends largely on which of the following three questions is on the table: How should I evaluate this evidence? To avoid capricious and ever-differing evaluations of replicability of results due to diverse subjective judgments of the weight of more or less relevant priors, prep was presented for the case of flat, ignorance priors This downplays precision of prediction in the service of stability and generality of evaluation; it decouples the evaluation of new data from the sins, and virtues, of their heritage It uses only the information in the data at hand, or that augmented with a standardized estimate of realization variance What should I believe? Here priors matter: Limiting judgment to only the data in hand is shortsighted If a novel experiment provides evidence for extra-sensory pre-cognition, what you should believe should be based on the corpus of similar research, updated by the new data In this case, it is likely that your priors will dominate what you believe What should I do? NHST is of absolutely no value in guiding action, as it gives neither the probability of the null nor of the alternative, nor can it give the probability of replication, which is central to planning Prep is designed to predict replicability, and has been developed into a decision theory for action (Killeen 2006) Figure displays a ppd and superimposed utility functions that describe the value, or utility, of various effect sizes To compute expected value, integrate the product of the utility function with the probability of each outcome, as given by the ppd The utility shown as dashed lines is until effect size exceeds 0, then immediately steps to Its expected value is prep, the area under the curve to the right of Prep has a 1-to-1 relationship with the p-value Thus, NHST (and prep when σ2R is 0) is intrinsically indifferent to size of effect, giving equal weighting to all positive effect sizes, and none to negative ones Figure A candidate utility function is drawn as a power function of effect size (ogive) The value of an effect increases less than proportionately with its size The expected utility of a future course of action is the probability of each particular outcome (the ppd) multiplied by the utility function—the integral of the product of the two functions Because traditional significance tests give no weight to effect size, their implicit utility function is flat (dashed) If drawn as a line at -7 up to the origin, and then at to the right, it sets a threshold for positive utility in replication at the traditional level α = 0.05 Other exponents for the utility function return other criteria, such as the Akaike criterion and the Bayesian information criterion The ogive gives approximately equal weight to effect size and to replicability If the weight on negative effect sizes were -7, then the expected utility of an effect in replication would be negative for all ppd whose area to the left of the origin was greater than 1/7 This sets a criterion for positive action that is identical to the α = 05 criterion Conversely, this traditional criterion α = 05 de facto sets the disutility of a false alarm as seven times the utility of a hit α = 01 corresponds to a 19/1 valuation of false positives to true positives This exposition thus rationalizes the α levels traditional in NHST Economic valuations are never discontinuous like these step functions; rather they look more like the ogive, shown in Figure 2, which is a power function of effect size To raise expected utility above a threshold for action, such ogives require more accuracy—typically larger n—when effect sizes are small than does NHST; conversely large effect sizes pass criteria with smaller values of n—and replicability Depending on the exponent of the utility function, it will emulate traditional decision rules based on AIC, BIC, and adjusted coefficient of determination In the limits, as the exponent approaches it returns the traditional step function of NHST, indifferent to effect size; as it approaches 1, only effect size, not replicability, matters A power of 1/3 weights them approximately equally Thus prediction, built upon the ppd, and modulated by the importance of potential effects, can guide behavior; NHST can not How reliable are predictions of replicability? Does positive psychology enhance well-being, or ameliorate depressive symptoms? A recent meta-analysis of positive psychology interventions found a mean effect size of 0.3 for both dependent variables over 74 interventions (Sin and Lyubomirsky 2009) With an average of 58 individuals per condition, and setting σ2R = 0.08, prep is 88 Of the 74 studies, 65 should therefore have found a positive effect 66 found a positive effect Evaluation of other metaanalyses shows similar high levels of accuracy for prep’s predictions We may also predict that of the studies in this ensemble should have gone the wrong way strongly (its prep >0.85 for a negative effect) What if yours had been one of the studies that showed no or negative effects? The most extreme negative effect had a prep of a severely misleading 88 (for negative replicates)! Prep gives an expected, average estimate of replicability (Cumming 2005); but it, like a p-value, typically has a high associated variance (Killeen 2007) It is because we cannot say beforehand whether you will be one of the unlucky few, that some experts (e.g., Miller 2009) have disavowed the possibility of predicting replicability in general, and of individual research results in particular Those with a more Bayesian perspective are willing to bet that your results will not be the most woeful of the 74, but rather closer to the typical It is your money, to bet or hold; but as a practitioner, you must eventually recommend a course of action Whereas reserving judgment is a traditional retreat of the academic, it is can be an unethical one for the practitioner Prep, used cautiously, provides a guide to action What else can be done with the ppd? Replicability intervals While more informative than p-values, confidence intervals are underused and generally poorly understood Replicability intervals delimit the values within which a replication is likely to fall 50% replicability intervals are approximately equal to the standard error of the statistic These traditional measures of stability of estimation may be centered on the statistic, and de facto constitute the values within which replications will fall half the time Multiple comparisons If a number of comparisons have been performed, how we decide if the ensemble of results is replicable? We are appropriately warned against alpha inflation in such circumstances, and similar considerations affect prep But some inferences are straightforward If the tests are independent (as assumed, for example in ANOVA), then the probability of a replication showing all effects to be in the same direction (or significant, etc.) is simply the product of the replicabilities of all individual tests The probability that none will again achieve your definition of replication is the complement of the product of the complements of each of the preps Is there a simple way to recalibrate the replicability of one of k tests, post hoc? If all the tests asked the exactly same question that is, constituted within-study replications—the probability that all would replicate is the focal prep raised to the kth power This conservative adjustment is similar in spirit to the Šidák correction, and suitably reins-in predictions of replicability for a post-hoc test Model comparison and longitudinal studies Ashby and O’Brien (2008) have generalized the use of prep for the situation of multiple trials with a small number of participants, showing how to evaluate alternate models against different criteria (e.g., AIC, BIC) Their analysis is of special interest both to psychophysicists, and to clinicians conducting longitudinal studies Diagnosticity vs detectability Tests can succeed in two ways: they can affirm when the state of the world is positive (a hit), and they can deny when it is negative (a correct rejection) Likewise they can fail in two ways: affirm when the state of the world is negative (a false alarm, a Type I error), and deny when it is positive (a miss, a Type II error) The detectability of a test is its hit rate; the diagnosticity is its correct rejection rate Neither alone is an adequate measure of the quality of a test: Detectability of a test can be perfect if we always affirm, driving the diagnosticity to 0—We can detect 100% of children with ADHD if the test is “Do they move?” A Relative Operating Characteristic, or ROC gives the hit rate as a function of the false alarm rate The location on the curve gives the performance for a particular criterion If the criterion for false alarms is set at α = 05, then the ordinate gives the power of the test But that criterion is arbitrary What is needed to evaluate a test is the information it conveys independently of the particular criterion chosen The area under the ROC curve does just that: It measures the quality of the test independently of the criterion for action Irwin (2009) has shown that this area is precisely the probability computed by prep: prep thus constitutes a criterion-free measure of the quality of a diagnostic test Efficacy vs effectiveness In the laboratory an intervention may show significant effects of good size—its efficacy—but in the field its impact—its effectiveness—will vary, and will often disappoint There are many possible reasons for this difference, such as differences in the skills of administering clinicians, the need to accommodate individuals with comorbidities, and so on These variables increase realization variance, and thus decrease replicability Finding that effectiveness is generally less than efficacy is but another manifestation of realization variance How can prep improve the advice given to patients? What is the probability that a depressed patient will benefit from a positive psychology intervention? A representative early study found that group therapy was associated with a significant decrease in Beck Depression Inventory scores for a group of mildly to moderately depressed young adults The effects were enduring, with an effect size of 0.6 at 1-year posttest Assuming a standard realization variance of 0.08, prep is 81 But that is for an equal-powered replication What is the probability that your patient could benefit from this treatment? Here we are replicating with an n of 1, not 33 Instead of doubling the variance of the original study, we must add to it the variance of the sampling distribution for n = 1; that is, the standard deviation of effect size, This returns a prep of 0.71 Thus, there is about a 70% chance that positive psychotherapy will help your patient for the ensuing year which, while not great, may be better than the alternatives Even when the posterior is based on all of the data in the meta-analysis, n > 4000, it does not change the odds for your patient, as that is here limited by the effect sizes for these interventions, and your case of You nonetheless have an estimate to offer her, insofar as it may be in her interest Why is prep controversial? The original exposition contained errors (Doros and Geier 2005), later corrected (Killeen 2005b, 2007) Analyses show that prep is biased when used to predict the coincidence in the sign of the effects of two future experiments; strongly biased when the null is true (Miller and Schwarz 2011) or when the true effect size is stipulated But prep is not designed to predict the coincidence of future experiments It is designed to predict the replication of known data, based on those data (Lecoutre and Killeen 2010) If we know, or postulate, that the null is true, then the correct prior in prep is δ = 0, with variance of No matter what the first effect size, the probability of replication is ½ and that is precisely what prep will predict Prep was developed for what scientists can know, not what statisticians can stipulate; and all that scientists are privy to are data, not parameters Prep is often computed incorrectly It is easily computed from p-values: When σ2R = 0, in Excel code as prep = NORMSDIST[NORMSINV(1-p)/SQRT(2)] In general, compute a p-value with the standard error increased to SE = [2(σ2Err + σ2R)]1/2 where σErr is the standard error of the statistic being evaluated Then the complement of that p-value is the probability of replication A common mistake is to use 2-tailed p-values directly, rather than first halving them (Lecoutre, Lecoutre, and Poitevineau 2010) Another problem is that prep does not dictate an absolute criterion such as α = 025 For that, one needs to embed the ppd in a full-fledged decision theory The vast majority of individuals who assume the null don’t believe it; but they don’t know what else to assume Evaluating the probability of replication avoids that dilemma To evaluate evidence, use the basic prep, with σ2R set at a standardized value such as To know what to believe, augment the simple version of prep with what you know; whether realization variance or priors If you have a well-defined alternate hypothesis, use Bayesian analyses References (15) Ashby, F Gregory, and Jeffrey B O'Brien 2008 "The prep statistic as a measure of confidence in model fitting." Psychonomic Bulletin & Review no 15:16-27 doi: 10.3758/PBR.15.1.16 Cumming, Geoff 2005 "Understanding the average probability of replication: Comment on Killeen (2005)." Psychological Science no 16:1002-1004 doi: 10.1111/j.14679280.2005.01650 Doros, Gheorghe, and Andrew B Geier 2005 "Comment on "An Alternative to NullHypothesis Significance Tests"." Psychological Science no 16:1005-1006 doi: 10.1111/j.1467-9280.2005.01651.x Irwin, R John 2009 "Equivalence of the statistics for replicability and area under the ROC curve." British Journal of Mathematical and Statistical Psychology no 62 (3):485-487 doi: 10.1348/000711008X334760 Killeen, Peter R 2005a "An alternative to null hypothesis significance tests." Psychological Science no 16:345-353 doi: 10.1111/j.0956-7976.2005.01538 Killeen, Peter R 2005b "Replicability, confidence, and priors." Psychological Science no 16:1009-1012 doi: 10.1111/j.1467-9280.2005.01653.x Killeen, Peter R 2006 "Beyond statistical inference: a decision theory for science." Psychonomic Bulletin & Review no 13:549-562 doi: 10.3758/BF03193962 Killeen, Peter R 2007 "Replication statistics." In Best practices in quantitative methods, edited by J W Osborne, 103-124 Thousand Oaks, CA: Sage Krueger, Joachim 2001 "Null hypothesis significance testing: On the survival of a flawed method." American Psychologist no 56:16-26 doi:10.1037//0003-066X.56.1.16 Lecoutre, Bruno, and Peter R Killeen 2010 "Replication is not coincidence: Reply to Iverson, Lee, and Wagenmakers (2009)." Psychonomic Bulletin & Review no 17 (2):263-269 doi: doi:10.3758/PBR.17.2.263 Lecoutre, Bruno, Marie-Paule Lecoutre, and Jacques Poitevineau 2010 "Killeen's probability of replication and predictive probabilities: How to compute and use them." Psychological Methods no 15:158-171 doi: 10.1037/a0015915 Miller, Jeff 2009 "What is the probability of replicating a statistically significant effect?" Psychonomic Bulletin & Review no 16:617-640 doi: 10.3758/PBR.16.4.617 Miller, Jeff , and Wolf Schwarz 2011 "Aggregate and individual replication probability within an explicit model of the research process." Psychological Methods doi: 10.1037/a0023347 Nickerson, Raymond S 2000 "Null hypothesis significance testing: A review of an old and continuing controversy." Psychological Methods no 5:241-301 doi: 10.1037/1082989X.5.2.241 Sin, Nancy L., and Sonja Lyubomirsky 2009 "Enhancing well-being and alleviating depressive symptoms with positive psychology interventions: A practice-friendly meta-analysis." Journal of Clinical Psychology no 65 (5):467-487 doi: DOI: 10.1002/jclp.20593 Further reading Cumming, Geoff 2012 Understanding The New Statistics: Effect Sizes, Confidence Intervals, and Meta-Analysis Edited by L Harlow, Multivariate Applications New York, NY: Routledge Harlow, Lisa Lavoie, Stanley A Mulaik, and James H Steiger 1997 What if there were no significance tests? Mawah, NJ: Lawrence Erlbaum Associates ... falsity of the null hypothesis; nor can one infer the truth or falsity of the alternative ANOVA gives the probability of data assuming the null, not the probability of the null given the data... control and 20 experimental participants, and found that the means and standard deviations were: 40 (12), 50 (15) The effect size, d, the difference of means divided by the pooled estimate of standard... replication is the complement of the product of the complements of each of the preps Is there a simple way to recalibrate the replicability of one of k tests, post hoc? If all the tests asked the exactly

Định dạng
Số trang	12
Dung lượng	204,66 KB