Journal of Psychopathology and Behavioral Assessment, Vol 12, No 4, 1990 A Decision Tree Approach to Selecting an Appropriate Observation Reliability Index Hoi K Suen, 1,3 Donald Ary, and Wesley C Covalt Accepted: September 17, 1990 Based on the conceptual framework outlined by Cone (1986) and Suen (1988), a practical decision tree is developed as an aid for the selection of observational reliability indices KEY WOROS: interobserver agreement; iatraobserver reliability The purpose of this brief paper is to suggest a practical process through which the appropriate choice of reliability index can be determined, based on the conceptual framework provided by Cone (1986) and Suen (1988) There are numerous methods of assessing interobserver agreement and/or reliability today These methods have been derived from statistical assumptions, are appropriate for different epistemological paradigms, and yield different types of information While many authors have compared various indices conceptually or have shown mathematical relationships, this paper uses a decision tree format to show when different indices are appropriate One can divide observation researchers as to which paradigms they adhere There are those who position themselves closely to the traditional Skinnerian view that only directly observable behaviors can be measured; this is the idiographic-behavior paradigm On the other hand, there are those researchers who attempt to develop observational measures of psychological constructs or mediating processes which cannot be directly 1230 CEDAR Building, Department of Educational & School Psychology & Special Education, The Pennsylvania State University, University Park, Pennsylvania 16802 2Northern Illinois University, De Kalb, Illinois 60115, 3]'o whom correspondence should be addressed 359 0882-2689D0/1200-0359506.00/09 1990PlenumPubEshmgCorporation 360 Suen, Ary, a nd Covalt Table I A • Table Representing the O b s e r v a t i o n a l D a t a o f Two Observers" Observer Observer 1 0 a c b d q2 p2 pl ql ~0 = nonoccurrence of behavior; = occurrence of behavior; p~ = proportion of occurrence reported by Observer 1; qi = proportion of nonoccurrence reported by Observer 1; p2 = p r o p o r t i o n of occurrence reported by Observer 2; q2 = proportion of nonoccurrence reported by Observer measured; this is of the nomothetic-trait paradigm (cf Cone, 1986; Suen, 1988) Because the object of measurement is a directly observable behavior, those concerned with idiographic-behavior measures need to show correspondence between the behavior and some external criterion Thus, a criterion-referenced interpretation of data, with its external criterion, best serves the idiographic-behavior paradigm The nomothetic-trait view has a wider scope of acceptable events to measure Therefore, either a criterionreferenced or a norm-referenced interpretation of the data will be appropriate, depending on the selection of assessment: a set, external criterion or a subject's relation to a normed group With these points in mind, observational reliability indices can be discussed Observational reliability indices can be divided into three groups, labeled interobserver agreement, intraobserver reliability, and intraclass generalizability Interobserver agreement indices are appropriate for nomothetic or idiographic paradigms and criterion-referenced interpretations They are essentially measures of observer interchangeability in situations of multiple observers with a single subject or event All interobserver agreement indices are omnibus indicators which not distinguish between systematic and random measurement errors Some commonly used or frequently recommended interobserver agreement indices include proportion agreement (Po), occurrence agreement (Poet), nonoccurrence agreement (Pnon), kappa (~:), Dice's index (SD), Scott's pi (r0, and Maxwell and Pillemer's rll (Hartmann, 1982) Whereas ~:, r~, and rll adjust for chance agreement, the other interobserver agreement indices not Furthermore, using the conventional cell labels presented in Table I, it can be added that n and rtl are influenced by b-c and a-d differences; Pocc, Pnon, and S D Reliability Index 361 Observational Reliability Indices I I Nomothetic I Idiographic Orientation I No~- Orientation t I Criterionreferenced referenced f Info t on I Average Single Single observ, across observ observ, L rn I an Chance corrected I r or * r I Single Average across observ Chance corrected Chance corrected observ I Not corrected for chan gc I Treat [ Influenced by b-c and all agree, by b-c a-d diff & disagree, only I & rll l Chance corrected I rc I Influenced indicator I Single Chance Chance corrected corrected I Omnibus sources of error i I of errorl t Info on indicator sources J I t Omnibus k I l Treat all agree & disagree I Pocc' Pnon' Po Fig A decision tree for the selection of observational reliability indices are influenced by b-c agreement but not a d differences; and k and Po treat all b c agreements and disagreements the same Intraobserver indices are appropriate for the nomothetic paradigm and for norm-referenced orientation only The intraobserver reliability index is Pearson's r (or the d~ coefficient) The Pearson's r (or qb) measures the reliability of a single score from a single observer Problems arise if the classical parallel tests assumptions are violated (cf Nunnally, 1978; Suen and Ary, 1989), for then the results are uninterpretable However, once these assumptions are met, r (or qb) becomes an intraobserver reliability index Suen, Ary, and Covalt 362 T a b l e II F o r m u l a e f o r t h e C o m p u t a t i o n o f R e l i a b i l i t y Indices G i v e n T a b l e I: Po = b + c Pocc = b / ( a + b + d ) Pnon = c / ( a + c + d) k = (b + c - piP2 = (bc SD= - 2b/(pl - qlq2)/(1 ad)/(plp2qlq2) v2 - pip2 - qlq2) + pz) 7F = [4(bc - a d ) - (a - d)2]/(p~ + p2)(ql + q2) + p2q2) rl = ( b c - a d ) / ( p i q l Given the following: O"2 = ( M S s - N S ~ ) / K 0"02 = ( M S o - M S ~ ) / N a~2 = M S e w h e r e M S S is s u b j e c t m e a n s q u a r e , M S o is o b s e r v e r m e a n s q u a r e , MSe is r e s i d u a l m e a n s q u a r e , K is n u m b e r o f o b s e r v e r s , a n d N is n u m b e r o f s u b j e c t s : r~ = ~rs~/(a2 + a~ 2) rc = a y ( a ~ + 002 ~ o-e2) a n = K a s / ( K a s + ae 2) a~ = K a s / ( K a s + ao + ae 2) The intraclass measures make calculations based on an ANOVA In an interobserver situation, the subject (event) variance is considered the "true" variance and other estimable variances as error The exception is when certain conditions (or facets) of measurement are considered fixed In these situations, variances due to the interaction between the condition and the subject becomes part of the "true" variance as well (cf Suen, 1990) Commonly recommended intraclass indices for interobserver reliability include Hartmann's coefficient (rn2), Berk's r12 and r22 (denoted re2 and ac, respectively, in this paper) and Cronbach's alpha as suggested by Bakeman and Gottman (a,) (cf Bakeman & Gottman, 1986; Suen, 1988) The r n index is equivalent to Pearson's r or dp yet does not require the restrictive classical parallel tests assumptions Similar to Pearson's r, it is appropriate for nomothetic paradigms and norm-referenced interpretations The a,, is also appropriate for nomothetic paradigms and norm-referenced orientations but is an indicator of reliability of average scores across observers and is appropriate only if the average score across a number of observers equal to the number used in the reliability check is used as the unit of analysis in subsequent comparisons and analyses The ac is the same as cx,, but is used for criterion-referenced data and either nomothetic or idiographic paradigms Finally, the rc2 index is analogous to rn2 and is useful for nomothetic or idiographic paradigms and criterion-referenced orientations B o t h ac a n d re2 provide information of both interobserver agreement and intraobserver reliability yet not require the restrictive Reliability Index 363 classical p a r a l l e l tests a s s u m p t i o n s A l l intr~tclass indices p r o v i d e i n f o r m a t i o n o n t h e specific sources o f m e a s u r e m e n t errors F i g u r e p r o v i d e s a d e c i s i o n t r e e to a i d p r a c t i t i o n e r s in t h e c h o i c e o f o b s e r v a t i o n a l reliability indices T h e s e indices a r e a p p r o p r i a t e w h e n all o t h e r c o n d i t i o n s o f m e a s u r e m e n t a r e c o n t r o l l e d a n d s t a n d a r d i z e d a n d the only d i m e n s i o n s t h a t c h a n g e a r e t h e o b s e r v e r u s e d a n d t h e b e h a v i o r o b served A l s o p r e s e n t e d ( T a b l e II) is a list o f f o r m u l a s to c a l c u l a t e e a c h o f t h e a b o v e indices REFERENCES Bakeman, R., & Gottman, J M (1986) Observing interaction: An introduction to sequential analysis London: Cambridge University Press Cone, J D (1986) Idiographic, nomothetic, and related perspectives in behavioral assessment In R O Nelson & S C Hayes (Eds.), Conceptual foundations of behavioral assessment (pp 111-128) New York: Guilford Hartmann, D P (1982) Assessing the dependability of observational data In D P Hartmann (Ed.), Using observers to study behavior (pp 51-66) San Francisco, CA: Jossey-Bass NunnaUy, J C (1978) Psychometric theory New York: McGraw-Hill Suen, H K (1988) Agreement, reliability, accuracy, and validity: Toward a clarification Behavioral Assessment, 10, 343-366 Suen, H K (1990) Principles of test theories Hillsdale, NJ: Lawrence Erlbaum (in press) Suen, H K., & Ary, D (1989) Analyzing quantitative behavioral observation data Hillsdale, N J: Lawrence Erlbaum ... calculations based on an ANOVA In an interobserver situation, the subject (event) variance is considered the "true" variance and other estimable variances as error The exception is when certain... analysis in subsequent comparisons and analyses The ac is the same as cx,, but is used for criterion-referenced data and either nomothetic or idiographic paradigms Finally, the rc2 index is analogous... Bakeman and Gottman (a, ) (cf Bakeman & Gottman, 1986; Suen, 1988) The r n index is equivalent to Pearson's r or dp yet does not require the restrictive classical parallel tests assumptions Similar