Models of trace decay, eligibility for reinforcement, and delay of reinforcement gradients, from exponential to hyperboloid

G Model BEPROC-2305; No of Pages ARTICLE IN PRESS Behavioural Processes xxx (2011) xxx–xxx Contents lists available at ScienceDirect Behavioural Processes journal homepage: www.elsevier.com/locate/behavproc Models of trace decay, eligibility for reinforcement, and delay of reinforcement gradients, from exponential to hyperboloid Peter R Killeen ∗ Department of Psychology, Box 1104, McAllister St., Arizona State University, Tempe, AZ 85287-1104, United States a r t i c l e i n f o Article history: Received 15 August 2010 Received in revised form 24 December 2010 Accepted 27 December 2010 Keywords: Delay of reinforcement gradients Discounting Forced choice paradigms Magnitude effect Matching paradigms Reinforcement learning Trace decay gradients a b s t r a c t Behavior such as depression of a lever or perception of a stimulus may be strengthened by consequent behaviorally significant events (BSEs), such as reinforcers This is the Law of Effect As time passes since its emission, the ability for the behavior to be reinforced decreases This is trace decay It is upon decayed traces that subsequent BSEs operate If the trace comes from a response, it constitutes primary reinforcement; if from perception of an extended stimulus, it is classical conditioning This paper develops simple models of these processes It premises exponentially decaying traces related to the richness of the environment, and conditioned reinforcement as the average of such traces over the extended stimulus, yielding an almost-hyperbolic function of duration The models account for some data, and reinforce the theories of other analysts by providing a sufficient account of the provenance of these effects It leads to a linear relation between sooner and later isopreference delays whose slope depends on sensitivity to reinforcement, and intercept on that and the steepness of the delay gradient Unlike human prospective judgments, all control is vested in either primary or secondary reinforcement processes; therefore the use of the term discounting, appropriate for humans, may be less descriptive of the behavior of nonverbal organisms © 2011 Elsevier B.V All rights reserved Introduction Pigeons cannot reliably count above (Brannon et al., 2001; Nickerson, 2009; Uttal, 2008), have short time-horizons (Shettleworth and Plowright, 1989), may be stuck in time (Roberts and Feeney, 2009), not ask for the answers to the questions they are about to be asked (Roberts et al., 2009), and fail to negotiate an amount of reinforcement commensurate with the work that they are about to undertake (Reilly et al., 2011) How such simple creatures discount future payoffs as a function of their delay? It is the thesis of this paper that they not That the orderly data in such studies is the simple result of the dilution of the conditioned reinforcers which support and guide that choice, as a function of the delay to the outcome that they signal Classic and generally accepted concepts of causality preclude events from acting backward in time Then what sense we make of Fig 1, a familiar rendition of the control exerted by delayed reinforcers? How the animals know what is coming? Only three accounts come to mind (a) Precognition But causality rules that out (b) It is memory of a past choice that makes contact with reinforcement; the figure should be reversed Or (c) the animals have learned what leads to what There follows an extended argument ∗ Tel.: +1 480 967 0560; fax: +1 480 965 8544 E-mail address: killeen@asu.edu that (b) and (c) are both true, and that in novel contexts, (b) typically leads to (c) When in the course of an animals’ behavior a behaviorally significant event (BSE; or phylogenetically important event (Baum, 2005); or more familiarly, incentive, reinforcer, or unconditioned stimulus) occurs, there immediately arises the question of whence In computer science this is the assignment of credit problem If the organism, or software, takes into account events in the last instant, there are r potential causes for the BSE, where r is a measure of the richness of context An additional r events occurred in the prior, penultimate instant The combination of any one of these with those in the ultimate instant could have been the causal chain that led to the BSE: r2 sequences in toto Extending the account further, to the antepenultimate instant, raises the pool to r3 Continue this process back and the candidate pool of sequences grows as rn , where n is the depth of query If each of these instants of apprehension lasts ı s, then n = d/ı, and the candidate path grows as rd/ı , where d is the delay between event and consequence In the continuous limit, this equals ed/ , where is the time constant of the traces – the inverse of the continuous limit of the richness parameter r This means that the gradients get steeper in rich environments: = 1/r It follows that any one causal path is eligible for 1/ed/ of the credit for reinforcement, everything being equal Of course everything is not equal: the priors on some events are higher than on others, either because of their phylogenetic relevance, or their memorability, which may be enhanced by marking 0376-6357/$ – see front matter © 2011 Elsevier B.V All rights reserved doi:10.1016/j.beproc.2010.12.016 Please cite this article in press as: Killeen, P.R., Models of trace decay, eligibility for reinforcement, and delay of reinforcement gradients, from exponential to hyperboloid Behav Process (2011), doi:10.1016/j.beproc.2010.12.016 G Model BEPROC-2305; No of Pages ARTICLE IN PRESS P.R Killeen / Behavioural Processes xxx (2011) xxx–xxx Fig Traditional delay of reinforcement gradients to two outcomes of different incentive value Fig Reverse Fig to see these, variously named trace decay, decay of eligibility for causal status, and decay of memory gradients Gradients are shown for two responses of different memorability their occurrence with salient stimuli Allowing for such bias, represented by the parameter c, we would expect the causal impact to decrease with time prior as: s′ = ce−d/ , >0 (1) which is the point association1 of an event at d seconds remove from the BSE, as seen in Fig This story for why associability between an event and a subsequent BSE may decay exponentially, retold from Johansen et al (2009) and Killeen (2005), has some empirical support (Killeen, 2001b; Escobar and Bruner, 2007) Eligibility traces play a central role in AI reinforcement learning (Singh and Sutton, 1996) Classic models such as Sutton and Barto’s posit a geometrically decreasing representation of events similar to that developed here, and work to reconcile details of instrumental and Pavlovian conditioning with various instantiations of such traces (Sutton and Barto, 1990; Niv et al., 2002) Alternatively, it is possible to simply posit exponential or hyperbolic decay of memory of the stimulus, and also that these traces may or may not vary with the richness of the environment This has been the productive tactic of most analysts of delay discounting If this disposition is good enough for you, skip the next pages What is the purported mechanism? As developed here it is one of stimulus competition, with richer environments and greater interludes providing more opportunities for interference A stimulus-sampling model of acquisition (Bower, 1994; Estes and Suppes, 1974; Neimark and Estes, 1967; Estes, 1950) provides the basis of a model of acquisition in the face of such contingen- If the duration of a response is ı s, then the impact of reinforcement on it is given by the integral of Eq (1) from d to d + ı For brief events such as responses, this essentially equals ı times the right-hand side of Eq (1) For responses of similar durations this coefficient is absorbed by c Fig Eligibility traces of a response at increasing temporal removes from a reinforcer At greater removes, the right tails have lower associability with reinforcement, as indicated by their height where they intersect the right ordinate Graphing that height above the temporal distance gives the dashed curve, the delay of reinforcement gradient cies degraded by delay and distraction (Killeen, 2001a) It is not repeated here Another way to think of Eq (1) is as a measure of the signal-to-noise ratio of a delay contingency In the case c = 1/, Eq (1) describes a probability distribution, so that identification of one point from the distribution reduces candidate uncertainty by log2 (e) bits What is the relation between eligibility traces and the delay of reinforcement gradient? Fig shows trace gradients for events occurring more and more remote from the BSE The most proximate occurs at the moment of reinforcement, and is visible only as a dot in the upper right corner; it receives the full credit for which it might be eligible An event occurring time step earlier has an impact diluted by about 30% by the time of reinforcement, as inferred from where its trace cuts the origin, the zero delay axis at the right of the graph Draw this measure of eligibility, 0.7, out unit from the right frame, as shown by the arrow, and connect it to the full measure in the corner by a dashed line The event steps back decays by about 50% at the time of reinforcement; draw a line from there extending to the left at 2, and continue the dashed line to it When bored of this construction, stop to consider the shape of the delay of reinforcement gradient – the dashed line When smoothed, it will have exactly the same shape as any of the decay traces, but will be reflected about its new origin at The distinction between these two representations, one of process and the other of product, is important As Fig makes clear, what is present at the time of reinforcement is a decayed trace of a response Differential reinforcer magnitude can have no retroactive effect on the shape or elevation of those traces Reinforcers of different magnitudes not change the decay gradients, but rather act differentially on their tails: a larger reinforcer may be more effective at leveraging the same residual memory than a small one But those tails may be of different elevation – and thus differentially able to receive the effect of the reinforcement – because they are more or less memorable (reflected in c) or because they occur in a richer or bleaker environment (reflected in ) Hyperbolic dilemmas How can gradients be exponential when everyone says that they are hyperbolic? The curves in Fig not cross, whereas most representations of discounted future events of differing value These three figures address the associability of a discrete event at a remove of t from reinforcement They not address situations in which that event leads to an immediate change of state signaling a deferred outcome A signal of change of state marks the precipitating event by immediately singling it out as the precursor of a better (or possibly worse) state of affairs Consider a response that causes the onset of a stimulus, and after a delay of d, a BSE Assume that each of the temporal elements of the stimulus receives associations as given by Eq (1), and that these are otherwise equivalent in time Please cite this article in press as: Killeen, P.R., Models of trace decay, eligibility for reinforcement, and delay of reinforcement gradients, from exponential to hyperboloid Behav Process (2011), doi:10.1016/j.beproc.2010.12.016 G Model BEPROC-2305; No of Pages ARTICLE IN PRESS P.R Killeen / Behavioural Processes xxx (2011) xxx–xxx Fig Disks: the decreasing efficacy of a primary reinforcer as a function of the delay between it and a response The continuous curve is given by Eq (1); the dashed curve by Eq (3) Squares: the decreasing efficacy of a conditioned reinforcer as a function of the maximum delay it signals The continuous curve is given by Eq (2); the dashed curve superimposed on it by Eq (3) The data are from Richards (1981) (that is, that the parameters of Eq (1) not change over the delay) In the case that each element of the stimulus is highly generalizable with the next, these associations add linearly, giving a total associad bility equal to ce−t/ dt This integral assumes that the temporal elements dt make linearly independent contributions to the total association Because one element of the stimulus is, per hypothii, indiscriminable from the next, any one element – in particular the one just following a response – has an average associability given by: d ce−t/ dt s¯ d = d (2) dt s¯ d = c(1 − e−d/ ) d Eq (2) is not discriminable from the inverse linear relation known as hyperbolic (Killeen, 2001a) Fig demonstrates this similarity by fitting both Eq (2), and the hyperbola shyp = c + d/ (3) to data from Richards (1981) that describe the effects of signaled delayed reinforcement on the average response rates of four pigeons The curves through the squares superimpose This makes sense, as Eq (3) is a series approximation2 to Eq (2) Experienced laboratory animals can tell the difference between the start of a long delay and the start of a short one; they are sen- e−d/ = ed/ ≈ 1 + d/ + · · · ∴ c c(1 − e−d/ ) 1− ≈ d d + d/ c s¯ d ≈ + d/ The average absolute deviation between Eqs (2) and (3) over the range from 0.99 to 0.04 is 0.064; however letting the time constant in either equation vary from its value in the other reduces this deviation to 0.023, within experimental error The exponential term may also be approximated with the more standard Maclaurin series: e−d/ = − d/ + (d/) /2! − , but the first approximation is everywhere more accurate The latter approximation deviates from Eq (2) by 4.6 (against 0.06), reduced to 0.33 (against 0.02) by refitting The limit of Eq (2) as d goes to is c, as may be demonstrated using l‘Hôpital’s rule s¯ d = Fig The decreasing efficacy of a reinforcer in establishing a new response as a function of the delay between it and a response The continuous curve is exponential, the dashed curve hyperbolic Error bars are the standard errors of the means The data are from Wilkenfield et al (1992) sitive to time and delay (Moore and Fantino, 1975) The use of Eq (2) requires that, facing start of a long delay to food and a stimulus which – in the best of times – is contiguous with food, control by the stimulus dominates that by time Animals, in other words, are optimists: their behavior is primarily under the control of the most hopeful stimuli rather than some weighted average of predictive stimuli There is good evidence that this is often the case (Horney and Fantino, 1984; Sanabria and Killeen, 2007; Jenkins and Boakes, 1973) Also shown in Fig is the decay trace for unsignaled reinforcement Under the hypothesis of the prior section, it is given by Eq (1), an exponential function, shown as the continuous curve passing near the disks, showing response rates for unsignaled (non-resetting) delays Also shown is the hyperbola, Eq (3), which apparently gives an inferior fit to these data – although this database is too limited to make secure generalizations For unsignaled delayed reinforcement, at least in this case, the exponential gradients are, as predicted, competitive with the more traditional hyperbolic gradients Fig illustrates Lattal’s generalization that “The unsignaled delay gradient is characterized by [generally] lower response rates and a steeper slope than the gradient obtained with otherwise equivalent signaled delays” (Lattal, 2010) Whereas Fig usefully compares the effects of signaled and unsignaled delays, because the unsignaled delays were nonresetting, the actually experienced delays were variable and less, by an unspecified amount, than the abcissae A better test of the sufficiency of Eq (1) comes from Wilkenfield et al (1992), using resetting delays, where the abcissae provide accurate representations of the experienced delays These investigators reported the response rates during acquisition of lever pressing from four groups of rats, nine in each group Their data from the first 100 of acquisition are shown in Fig Again, the exponential provides a plausible model The simple hyperbolic model has been shown adequate for most discount functions for non-verbal animals (Green and Myerson, 2004; Ong and White, 2004; Green et al., 2004) But unlike its cousin the hyperbola, which is ad hoc, Eq (2) has some theoretical motivation: it predicts radical changes in preference as a function of the nature and continuity of the stimuli that bridge the delay between response and BSE, and holds out the promise for quantifying those effects It is consistent with the important role of conditioned reinforcers in preference for delayed outcomes (Williams and Dunn, 1991), and provides a useful refinement to a unified theory of choice (Killeen and Fantino, 1990) In the latter theory, and its precedent (Killeen, 1982a,b), the control by a delayed reinforcer was modeled as the sum of both the primary (i.e., point association with the Please cite this article in press as: Killeen, P.R., Models of trace decay, eligibility for reinforcement, and delay of reinforcement gradients, from exponential to hyperboloid Behav Process (2011), doi:10.1016/j.beproc.2010.12.016 G Model BEPROC-2305; No of Pages ARTICLE IN PRESS P.R Killeen / Behavioural Processes xxx (2011) xxx–xxx response; Eq (1)) and secondary (i.e., terminal link cues; essentially Eq (2)) reinforcement effects Eq (1), and a similar logic for the association of streams of responses with reinforcement, is the heart of the model of coupling in my theory of schedule effects, MPR (Killeen, 1994) The presence of stimuli occurring between a response and BSE may not always be beneficial to conditioning the response Brief stimuli occurring immediately after a response (marking it) may make the response more memorable when the BSE occurs (Lieberman et al., 1979; Thomas et al., 1983) – perhaps by increasing the value of c Alternatively, such stimuli may initiate adjunctive behavior that serves as an extended conditioned stimulus (CS) (Schaal and Branch, 1988) Conversely, brief stimuli occurring just before reinforcement may block control by the response–reinforcer association (Pearce and Hall, 1978) Williams (1999) and Reed and Doughty (2005) demonstrated the power of both effects in the same experiments Whether the effects of primary and secondary reinforcement add or interfere depends on the correlation of each of the contingencies with the behavior measured by the experimenter: a CS whose presentation is not contingent on behavior will only adventitiously strengthen the target response, and, depending on temporal variables, is as likely to compete with it; furthermore, one which signals non-contingent reinforcement will compete with concurrent instrumental responses (Miczek and Grossman, 1971) A CS presented on the instrumental operandum can enhance response rate, whereas one presented on a different operandum can compete with it (Schwartz, 1976) As the duration of a marking stimulus extends into the delay interval, integration of Eq (2) between its endpoints predicts a positively accelerating effectiveness of the stimulus Schaal and Branch (1990) found the predicted increase, but it was negatively accelerated for of the pigeons The association of a CS or response with the measured behavior will also depend on the modality of the CS, the modality of the response (Timberlake and Lucas, 1990), and the contingencies that make the correlation tight or weak (Killeen and Bizo, 1998) For the present argument, these correlations of response and CS with the experimenter’s dependent variable are carried by the constant c The effects of delay on choice To apply Eq (2) to experiments in which an animal is choosing between delayed reinforcers of different magnitudes (a) requires a scale that maps amount into reinforcing effectiveness Perhaps the simplest “utility” function for reinforcement amount is the power function, which is the form assumed in the generalized matching law (Rachlin, 1971; Killeen, 1972; Baum, 1979) It has the advantage of simplicity, and fits most of the available data over its limited range A disadvantage is that it has the effectiveness of reinforcement growing without bound as the amount is increased, which is implausible Rachlin has derived other forms for utility from first principles (1992); his logarithmic, and my (1985) exponential-integral can also accommodate data, as can Bradshaw and associates’ hyperbolic discounting of amount (Bezzina et al., 2007) However, the equations look simpler if we adopt the formalism of the generalized matching law in which the reinforcing power of amount is the power function, u(a) = a˛ Then the associative strength of a response immediately followed by a stimulus change, and d later a BSE of physical magnitude a, is the product of the impact of the BSE, a˛ , on the sum of the primary sd′ and secondary s¯ d effects Assuming for parsimony that in the cases analysed the relative salience of stimulus elements and responses are comparable, then cprimary ≈ csecondary = c, and: sd,a = a˛ c e−d/ + (1 − e−d/ ) d (4) Fig Data from an experiment by Green and associates (2004) in which the amount delivered to pigeons immediately (1/2 s delay) was adjusted to indifference with that given after the delay noted on the x axis The parameter is the magnitude of the delayed reinforcer The curves are drawn by Eqs (5) and (6) 3.1 Methods of adjustment Psychophysical paradigms in which variables are adjusted to cause indifference in preferences or other judgments – “Matching paradigms” (Farell and Pelli, 1999) – are more secure of interpretation than those involving a psychological scale, such as one of value (Hand, 2004; Uttal, 2000) Their units are physical measurements, and they refer to a unique psychological point, that of equivalence This may be determined whether the underlying scale is interval, ordinal, or even nominal How great must an amount a1 be to balance a different amount a2 at a different delay? Set a1 ˛ c e−d1 / + (1 − e−d1 / ) d1 = a2 ˛ c e−d2 / + (1 − e−d2 / ) d2 and solve for a1 : a1 = a2 e−d2 / + (1 − e−d2 / )/d2 e−d1 / + (1 − e−d1 / )/d1 1/˛ (5) Eq (5) gives the relative equivalent value of amount a2 delayed d2 , compared to an alternative delayed d1 Typically, d1 is “immediate” – that is, around 1/2 s, and then Eq (5) gives the relative immediate equivalent amount With d2 > d1 , this ratio will be less than 1, indicating that a smaller immediate amount, relative to a2 , suffices to balance the latter at a remove of d2 Note that neither amount appears in the right hand side; no magnitude effect is predicted: as long as the ratio of delays is the same, the predictions are the same when both amounts are multiplied by a constant In general, no magnitude effect is found in delay discounting experiments with non-human animals (Green et al., 2004; Ong and White, 2004) Fig shows the course of Eq (5), with ˛ = 1.26 and = 2.12 s, passing near the average data from four pigeons in an experiment where the amount delivered after 1/2 s was adjusted to maintain indifference between it and a larger amount (given by the parameter in figure) delivered at a delay The primary and conditioned reinforcing effects are highly correlated; Eq (5) may be simplified by deleting the primary influence of the reinforcers on the choice responses, to yield: a1 = a2 d1 (1 − e−d2 / ) d2 (1 − e−d1 / ) 1/˛ , (6) which draws the continuous curve through the data in Fig But the primary and secondary effects may be dissociated, and when they are, alternatives with both are preferred to those with just primary reinforcement (Marcattilio and Richards, 1981; Lattal, 1984) The hyperbolic approximation to Eq (6) provides a decent fit to these Please cite this article in press as: Killeen, P.R., Models of trace decay, eligibility for reinforcement, and delay of reinforcement gradients, from exponential to hyperboloid Behav Process (2011), doi:10.1016/j.beproc.2010.12.016 ARTICLE IN PRESS G Model BEPROC-2305; No of Pages P.R Killeen / Behavioural Processes xxx (2011) xxx–xxx Fox et al (2008), who asked whether rat models of ADHD (SHRs) would show steeper delay gradients than control (WKY) rats They did Other investigators (Adriani et al., 2003) did not find steeper gradients for SHR, but observed very large individual differences As ˜ et al (2007) the main effect found by Fox and assonoted by Orduna ciates may be due to idiosyncrasies of their control rats (Sagvolden et al., 2009) Discussion Fig Data from experiment of Fox et al (2008) studying relative choice of pellets delayed vs immediate in two strains of rats, with curves drawn by Eq (8) Here “immediate” is set at 1/2 s data as well, but falls noticeably farther from their average than Eqs (5) and (6) In some matching experiments, the delay to one outcome is adjusted, rather than the amount Eq (7) yields no simple prediction, but invoking the series expansion of the exponential term2 that was used in going from Eqs (2) and (3): s¯ di ,ai ≈ a˛ c i + di / , leads to the simple linear relation of Eq (7) d2 = d1 a2 a1 ˛ + a2 a1 ˛ − , di > (7) Operations that increase the sensitivity to reinforcement (increase ˛) or flatten the gradient (increase ) will increase the indifference point, d2 The provenance of the effect can be determined by manipulating d1 , as the former will increase both slope and intercept, and the latter only intercept Some drugs, such as stimulants, may decrease ˛ while increasing (Maguire et al., 2009; Pitts and Febbo, 2004), and their results will thus vary as a function of the balance between the two, largely determined by the value of d1 A linear equation such as (7), based on multiplicative hyperbolic functions of amount and delay, was proposed and validated by Mazur (2001), and independently by Bradshaw’s group (Ho et al., 1999; Bezzina et al., 2007; da Costa Araújo et al., 2009) In Bradshaw’s model, as in Eq (7), the slope depends on relative payoffs regulated by the amount amplifier parameter ˛, and the intercept on a multiplicative function of that and delay sensitivity Their model has also been applied to human delay discounting (Hinvest and Anderson, 2010; Liang et al., 2010) 3.2 Methods of forced choice An alternative psychophysical procedure involves the measurement of the degree of preference between two fixed alternatives, or the frequency of choosing one over the other Eq (4) may be rearranged to predict the outcome of choice experiments in which the delays and outcomes are invariant The relative associative strength of the alternatives is: sd1 ,a1 sd1 ,a1 + sd2 ,a2 = 1+ a2 a1 ˛ −d2 / e + (1 − e−d2 / )/d2 e−d1 / + (1 − e−d1 / )/d1 −1 (8) In the case of unbiased choice there are two free parameters, the rate of diminishing marginal utility for larger amounts, ˛, and the time constant of the memory trace, Note that amounts again appear as a ratio, indicating scale invariance: there is no magnitude effect Fig shows this model follows a path similar to the data of Prospective judgments of equivalent amounts by humans, typical in the delay-discounting literature, require computations that are different in kind from those of paradigms in which real delays are conditioned to discriminative stimuli Humans can be instructed to contemplate the desirability of ten thousand dollars in ten years, and to stipulate how little they would settle for one week hence in lieu of it The performance entails a scale of future time, the value of an outcome deferred by that delay, and concatenation of the non-linear time-scale with a non-linear amount scale, from which a variety of results are imaginable (Killeen, 2009; Rachlin, 2006) Little wonder that there are differences in covering models The only way to so instruct other animals is to expose them to such realities repeatedly The assertion in the opening of this paper that the future cannot act on non-verbal animals was meant to emphasize this difference: on the one hand verbally presented unexperienced hypotheticals that can control human responses, and on the other the conditioning of behavior reinforced by the presentation of conditioned reinforcers signaling real, experienced, delays, that controls pigeon and rat behavior This paper should be read as a grounding of hyperbolic models of delay discounting, not a critique of them It presented a few ideas First, it is observed that Fig is not a model of a process It is a summary of some other kind of process, such as the one proposed in Fig The distinction is important, as thinking of Fig as a process can be misleading I am not alone in this concern: In this [Fig 1] view, reinforcers reach back in time to effect this response in the presence of the remembered stimulus As a model of how an animal adapts to, or learns about, situations with stimulus–behavior delays and response–reinforcer delays, the model has the problem of reinforcer effects spreading backward in time Physiologically, the process cannot act in this way, and physiology must require that the memory of an event flows forward in time, rather than the reinforcer effect flowing backwards But the response-centric view is the dominant view in the study of delayed reinforcers and of self control A simpler, much more likely, and physiologically consistent conceptualization of the adaptation to these delays is shown in [Fig 2] In this view, at the point at which a reinforcer is delivered, it is the conjunction of the memories of both the stimulus and the response at the time of reinforcer delivery that is “strengthened” and, I presume, remembered and subsequently accessed and used This approach suggests a different, and more parsimonious, mechanism for learning and activity that is squarely based on memory When reinforcers are delayed, it is the residual memory of responses times the value of the reinforcers that will describe the effects of reinforcer delay on behavior When responses are delayed following stimuli, it is the residual memory of the stimulus times the value of the reinforcer that will describe the stimulus–reinforcer conjunction, providing a role for stimulus–reinforcer relations (as in momentum theory) (Davison, 2006) The present paper constitutes simply the endorsement of the first paragraph and one realization of the second paragraph Please cite this article in press as: Killeen, P.R., Models of trace decay, eligibility for reinforcement, and delay of reinforcement gradients, from exponential to hyperboloid Behav Process (2011), doi:10.1016/j.beproc.2010.12.016 G Model BEPROC-2305; No of Pages ARTICLE IN PRESS P.R Killeen / Behavioural Processes xxx (2011) xxx–xxx Another idea expressed in this paper was that the hyperbola might be secondary to conditioning processes, not primary One colleague felt that that such grounding is unnecessary, as the hyperbola is justified by its ubiquitous accuracy in characterizing the ‘discounting’ data, that the rationale supporting hyperbolic discounting does not rely on the validity or even plausibility of any internal mechanism Rather that it relies on its predictive ability on its own level, the overt behavior of the whole organism, and its applicability in the real world So, why all the above talk about associations, decaying traces, and assignment of credit? Because, I plead, it puts some meat on the bones, holds out a hand of translation to AI reinforcement theorists, and turns ‘round a figure got backward But, chacon a’ son goût A third idea expressed in this paper is that simple processes of decay (Eq (1)) and average decay (Eq (2)) represent behavioral processes that are void of cognitive representations That is not the case for human delay discounting, as the vast majority (though not all) of the data from it involves hypothetical amounts and delays that are communicated verbally, and have never and will never be experienced by the individual The present treatment is thoroughly behavioral The use of mathematics to represent the conditioning processes has been misunderstood by some colleagues as asserting that the animals must perform such computations That is true in the same sense that a rope suspended at two points evaluates a catenary equation The calculations of pigeon and rope, such as they are, are embodied, not computed; the mathematical representation derives from the scientist, not from the thing he or she uses it to describe The final idea is the importance of the distinction between different mensuration paradigms The matching paradigm, some of whose results are displayed in Fig 6, is different in kind than the forced-choice/preference paradigm, some of whose results are displayed in Fig Why should an animal who prefers alternative A to alternative B not always choose A; but rather choose it, say, only 70% of the time? It does not suffice to say “because it matches”, which offers a result in the guise of an explanation To decline the thing you prefer, you must have balancing considerations, such as cost, or novelty; or be confused; or be irrational Mazur (2010) has shown that in the simple forced choice paradigm non-exclusive preference may be due to experimental designs that confuse the animal That possibility is exacerbated in the concurrent chain version of the forced-choice paradigm Sub-exclusive preference there occurs not because the other 30% of the time the animal prefers B (how often would you choose $30 over $70, once the pleasure of thwarting the experimenter has paled?) – but because the contingencies of reinforcement have made the probability of getting B sufficiently greater at that point in time, primed and awaiting collection, with the preferred A never any closer3 The way in which probabilities on concurrent schedules bend preference from rational exclusivity toward matching was nicely demonstrated by Crowley and Donahoe (2004) But these evolving probabilities are typically treated as externals, measured (e.g Boutros et al., 2009; Davison and Baum, 2003) analysed (MacDonall, 2000, 2005) and modeled (e.g., Grace et al., 2006) in their own right Unfortunately, that research seldom changes the interpretation of relative rates as prima facie measures of preference The dynamically evolving probabilities that concurrent VIs schedule are an intrinsic part of the package the animal must dynamically balance – not a neutral tool to measure it When the negative feedback inherent in those schedules is eliminated in adjustment paradigms where confusion is minimized, animals just about always choose On random interval VI schedules with mean m, the probability of reinforcement on the same key one second after the last peck is always 1/m, whereas on the other it increases toward as − e−t/m , with t the time since the last changeover one outcome over the other, as elegantly demonstrated by Mazur (2010) Magnitude, delay, and probability of reinforcement interact to control choice in concurrent schedules (Elliffe et al., 2008) Some interaction is allowed by Eq (8) due to its many nonlinearities, giving more weight to delay differentials as both delays increase But Ito and Asaki (1982) found substantial monotonic increases in rats’ preference for vs pellets as the equal delays to their receipt increased Ong and White (2004) noted other instances of this effect, and attributed it to increased sensitivity to reinforcer amount when reinforcers are delayed But it is not clear how that is anything other than a magnitude effect; and thus at odds with the results from matching (adjustment) paradigms Whether due to discrimination failure in simple forced choice, or negative feedback contingencies in concurrent chain schedules, non-exclusive preferences are an uncertain metric of what animals value The application of Eqs (5)–(7) for matching paradigms is therefore offered with more confidence than Eq (8) for concurrentchain interval schedules, which require a more complex model, such as that of Christensen and Grace (2010) Acknowledgements I thank Tim Cheung and Ryan Brackney for comments, Robert Kessel for insisting on mathematical precision, Tony Nevin for insisting on conceptual clarity as well; and to all for helping to show me how to achieve those desiderata The remaining significant deviations are mine References Adriani, W., Caprioli, A., Granstrem, O., Carli, M., Laviola, G., 2003 The spontaneously hypertensive-rat as an animal model of ADHD: evidence for impulsive and nonimpulsive subpopulations Neurosci Biobehav Rev 27, 639–651 Baum, W.M., 1979 Matching, undermatching, and overmatching in studies of choice J Exp Anal Behav 32, 269–281 Baum, W.M., 2005 Understanding Behaviorism: Behavior, Culture, and Evolution Blackwell, Malden, MA, p 312 Bezzina, G., Cheung, T.H.C., Asgari, K., Hampson, C.L., Body, S., Bradshaw, C.M., Szabadi, E., Deakin, J.F.W., Anderson, I.M., 2007 Effects of quinolinic acidinduced lesions of the nucleus accumbens core on inter-temporal choice: a quantitative analysis Psychopharmacology 195, 71–84 Boutros, N., Elliffe, D., Davison, M., 2009 Time versus response indices affect conclusions about preference pulses Behav Processes 84, 450–454 Bower, G.H., 1994 A turning point in mathematical learning theory Psychol Rev 101, 290–300 Brannon, E.M., Wusthoff, C.J., Gallistel, C.R., Gibbon, J., 2001 Numerical subtraction in the pigeon: evidence for a linear subjective number scale Psychol Sci 12, 238–243 Christensen, D.R., Grace, R.G., 2010 A decision model for steady-state choice in concurrent chains J Exp Anal Behav 94, 227–240 Crowley, M.A., Donahoe, J.W., 2004 Matching: its acquisition and generalization J Exp Anal Behav 82, 143–159 da Costa Araújo, S., Body, S., Hampson, C.L., Langley, R.W., Deakin, J.F.W., Anderson, I.M., Bradshaw, C.M., Szabadi, E., 2009 Effects of lesions of the nucleus accumbens core on inter-temporal choice: further observations with an adjusting-delay procedure Behav Brain Res 202, 272–277 Davison, M., 2006 Behavior-centric versus reinforcer-centric descriptions of behavior PsyCrit 12 (November), 1–3 Davison, M., Baum, W.M., 2003 Every reinforcer counts: reinforcer magnitude and local preference J Exp Anal Behav 80, 95–129 Elliffe, D., Davison, M., Landon, J., 2008 Relative reinforcer rates and magnitudes not control concurrent choice independently J Exp Anal Behav 90, 169–185 Escobar, R., Bruner, C.A., 2007 Response induction during the acquisition and maintenance of lever pressing with delayed reinforcement J Exp Anal Behav 88, 29–49 Estes, W.K., 1950 Toward a statistical theory of learning Psychol Rev 57, 94–107 Estes, W.K., Suppes, P., 1974 Foundations of stimulus sampling theory In: Contemporary Developments in Mathematical Psychology Farell, B., Pelli, D.G., 1999 Psychophysical methods, or how to measure a threshold and why In: Carpenter, R.H.S., Robson, J.G (Eds.), Vision Research: A Practical Guide to Laboratory Methods Oxford Univ Press, New York Fox, A.T., Hand, D.J., Reilly, M.P., 2008 Impulsive choice in a rodent model of attention-deficit/hyperactivity disorder Behav Brain Res 187, 146–152 Grace, R.C., Berg, M.E., Kyonka, E.G.E., 2006 Choice and timing in concurrent chains: effects of initial-link duration Behav Processes 71, 188–200 Please cite this article in press as: Killeen, P.R., Models of trace decay, eligibility for reinforcement, and delay of reinforcement gradients, from exponential to hyperboloid Behav Process (2011), doi:10.1016/j.beproc.2010.12.016 G Model BEPROC-2305; No of Pages ARTICLE IN PRESS P.R Killeen / Behavioural Processes xxx (2011) xxx–xxx Green, L., Myerson, J., 2004 A discounting framework for choice with delayed and probabilistic rewards Psychol Bull 130, 769–792 Green, L., Myerson, J., Holt, D.D., Slevin, J.R., Estle, S.J., 2004 Discounting of delayed food rewards in pigeons and rats: is there a magnitude effect? J Exp Anal Behav 81, 39–50 Hand, D.J., 2004 Measurement Theory and Practice Oxford University Press, Inc., New York, p 320 Hinvest, N.S., Anderson, I.M., 2010 The effects of real versus hypothetical reward on delay and probability discounting Q J Exp Psychol 63, 1072–1084 Ho, M.Y., Mobini, S., Chiang, T.J., Bradshaw, C.M., Szabadi, E., 1999 Theory and method in the quantitative analysis of “impulsive choice” behaviour: implications for psychopharmacology Psychopharmacology 146, 362–372 Horney, J., Fantino, E., 1984 Choice for conditioned reinforcers in the signaled absence of primary reinforcement J Exp Anal Behav 41, 193–201 Ito, M., Asaki, K., 1982 Choice behavior of rats in a concurrent-chains schedule: amount and delay of reinforcement J Exp Anal Behav 37, 383–392 Jenkins, H.M., Boakes, R.A., 1973 Observing stimulus sources that signal food or no food J Exp Anal Behav 20, 197–207 Johansen, E.B., Killeen, P.R., Russell, V.A., Tripp, G., Wickens, J.R., Tannock, R., Williams, J., Sagvolden, T., 2009 Origins of altered reinforcement effects in ADHD Behav Brain Funct 5, Killeen, P.R., 1972 The matching law J Exp Anal Behav 17, 489–495 Killeen, P.R., 1982a Incentive theory In: Bernstein, D.J (Ed.), Nebraska Symposium on Motivation, vol 1981 Response Structure and Organization, University of Nebraska Press, Lincoln Killeen, P.R., 1982b Incentive theory II: models for choice J Exp Anal Behav 38, 217–232 Killeen, P.R., 1985 Incentive theory IV: magnitude of reward J Exp Anal Behav 43, 407–417 Killeen, P.R., 1994 Mathematical principles of reinforcement Behav Brain Sci 17, 105–172 Killeen, P.R., 2001a Modeling games from the 20th century Behav Processes 54, 33–52 Killeen, P.R., 2001b Writing and overwriting short-term memory Psychon Bull Rev 8, 18–43 Killeen, P.R., 2005 Gradus ad parnassum: ascending strength gradients or descending memory traces? Behav Brain Sci 28, 432–434 Killeen, P.R., 2009 An additive-utility model of delay discounting Psychol Rev 116, 602–619 Killeen, P.R., Bizo, L.A., 1998 The mechanics of reinforcement Psychon Bull Rev, 221–238 Killeen, P.R., Fantino, E., 1990 A unified theory of choice J Exp Anal Behav 53, 189–200 Lattal, K.A., 1984 Signal functions in delayed reinforcement J Exp Anal Behav 42, 239–253 Lattal, K.A., 2010 Delayed reinforcement of operant behavior J Exp Anal Behav 93, 129–139 Liang, C.H., Ho, M.Y., Yang, Y.Y., Tsai, C.T., 2010 Testing the applicability of a multiplicative hyperbolic model of inter-temporal and risky choice in human volunteers Chin J Psychol 52, 189–204 Lieberman, D.A., McIntosh, D.C., Thomas, G.V., 1979 Learning when reward is delayed: a marking hypothesis J Exp Psychol Anim Behav Process 5, 224–242 MacDonall, J.S., 2000 Synthesizing concurrent interval performances J Exp Anal Behav 74, 189–206 MacDonall, J.S., 2005 Earning and obtaining reinforcers under concurrent interval scheduling J Exp Anal Behav 84, 167–183 Maguire, D.R., Rodewald, A.M., Hughes, C.E., Pitts, R.C., 2009 Rapid acquisition of preference in concurrent schedules: effects of D-amphetamine on sensitivity to reinforcement amount Behav Processes 81, 238–243 Marcattilio, A.J.M., Richards, R.W., 1981 Preference for signaled versus unsignaled reinforcement delay in concurrent-chain schedules J Exp Anal Behav 36, 221–229 Mazur, J.E., 2001 Hyperbolic value addition and general models of animal choice Psychol Rev 108, 96–112 Mazur, J.E., 2010 Distributed versus exclusive preference in discrete-trial choice J Exp Psychol Anim Behav Process 36, 321–333 Miczek, K.A., Grossman, S.P., 1971 Positive conditioned suppression: effects of CS duration J Exp Anal Behav 15, 243–247 Moore, J., Fantino, E., 1975 Choice and response contingencies J Exp Anal Behav 23, 339–347 Neimark, E.D., Estes, W.K., 1967 Stimulus Sampling Theory Holden-Day, San Francisco Nickerson, R.S., 2009 Mathematical Reasoning Patterns, Problems, Conjectures, and Proofs Psychology Press, London Niv, Y., Joel, D., Meilijson, I., Ruppin, E., 2002 Evolution of reinforcement learning in uncertain environments: a simple explanation for complex foraging behaviors Adapt Behav 10, 5–24 Ong, E.L., White, K.G., 2004 Amount-dependent temporal discounting? Behav Processes 66, 201–212 ˜ V., Hong, E., Bouzas, A., 2007 Interval bisection in spontaneously hypertenOrduna, sive rats Behav Processes 74, 107–111 Pearce, J.M., Hall, G., 1978 Overshadowing the instrumental conditioning of a leverpress response by a more valid predictor of the reinforcer J Exp Psychol Anim Behav Process 4, 356–367 Pitts, R.C., Febbo, S.M., 2004 Quantitative analyses of methamphetamine’s effects on self-control choices: implications for elucidating behavioral mechanisms of drug action Behav Processes 66, 213–233 Rachlin, H., 1971 On the tautology of the matching law J Exp Anal Behav 15, 249–251 Rachlin, H., 1992 Diminishing marginal value as delay discounting J Exp Anal Behav 57, 407–415 Rachlin, H., 2006 Notes on discounting J Exp Anal Behav 85, 425–435 Reed, P., Doughty, A.H., 2005 Within-subject testing of the signaled-reinforcement effect on operant responding as measured by response rate and resistance to change J Exp Anal Behav 83, 31–45 Reilly, M.P., Posadas-Sanchez, D., Kettle, L.C., Killeen, P.R., 2011 Making the trip worthwhile: rats (Rattus norvegicus) and pigeons (Columba livia) forage prospectively? Behav Processes, in review Richards, R.W., 1981 A comparison of signaled and unsignaled delay of reinforcement J Exp Anal Behav 35, 145–152 Roberts, W.A., Feeney, M.C., 2009 The comparative study of mental time travel Trends Cogn Sci 13, 271–277 Roberts, W.A., Feeney, M.C., McMillan, N., MacPherson, K., Musolino, E., Petter, M., 2009 Do pigeons (Columba livia) study for a test? J Exp Psychol Anim Behav Process 35, 129–142 Sagvolden, T., Johansen, E.B., Wøien, G., Walaas, S.I., Storm-Mathisen, J., Bergersen, L.H., Hvalby, Ø., Jensen, V., Aase, H., Russell, V.A., Killeen, P.R., DasBanerjee, T., Middleton, F.A., Faraone, S.V., 2009 The spontaneously hypertensive rat model of ADHD—the importance of selecting the appropriate reference strain Neuropharmacology 57, 619–626 Sanabria, F., Killeen, P.R., 2007 Temporal generalization accounts for response resurgence in the peak procedure Behav Processes 74, 126–141 Schaal, D.W., Branch, M.N., 1988 Responding of pigeons under variable-interval schedules of unsignaled, briefly signaled, and completely signaled delays to reinforcement J Exp Anal Behav 50, 33–54 Schaal, D.W., Branch, M.N., 1990 Responding of pigeons under variable-interval schedules of signaled-delayed reinforcement: effects of delay-signal duration J Exp Anal Behav 53, 103–121 Schwartz, B., 1976 Positive and negative conditioned suppression in the pigeon: effects of the locus and modality of the CS Learn Motiv 7, 86–100 Shettleworth, S.J., Plowright, C., 1989 Time horizons of pigeons on a two-armed bandit Anim Behav 37, 610–623 Singh, S.P., Sutton, R.S., 1996 Reinforcement learning with replacing eligibility traces Mach Learn 22, 123–158 Sutton, R.S., Barto, A.G., 1990 Time-derivative models of Pavlovian reinforcement In: Gabriel, M., Moore, J (Eds.), Learning and Computational Neuroscience: Foundations of Adaptive Networks MIT Press, Cambridge, MA Thomas, G.V., Lieberman, D.A., McIntosh, D.C., Ronaldson, P., 1983 The role of marking when reward is delayed J Exp Psychol Anim Behav Process 9, 401–411 Timberlake, W., Lucas, G.A., 1990 Behavior systems and learning: from misbehavior to general principles In: Klein, S.B., Mowrer, R.R (Eds.), Contemporary Learning Theories: Instrumental Conditioning Theory and the Impact of Constraints on Learning Erlbaum, Hillsdale, NJ Uttal, W.R., 2000 The War Between Mentalism and Behaviorism: On the Accessibility of Mental Processes Lawrence Erlbaum Associates, Inc., Mahwah, NJ Uttal, W.R., 2008 Time, Space, and Number in Physics and Psychology Sloan Publishing, Cornwall-on-Hudson, NY Wilkenfield, J., Nickel, M., Blakely, E., Poling, A., 1992 Acquisition of lever-press responding in rats with delayed reinforcement: a comparison of three procedures J Exp Anal Behav 58, 431–443 Williams, B.A., 1999 Associative competition in operant conditioning: blocking the response–reinforcer association Psychon Bull Rev 6, 618–623 Williams, B.A., Dunn, R., 1991 Preference for conditioned reinforcement J Exp Anal Behav 55, 37–46 Please cite this article in press as: Killeen, P.R., Models of trace decay, eligibility for reinforcement, and delay of reinforcement gradients, from exponential to hyperboloid Behav Process (2011), doi:10.1016/j.beproc.2010.12.016 ... decent fit to these Please cite this article in press as: Killeen, P.R., Models of trace decay, eligibility for reinforcement, and delay of reinforcement gradients, from exponential to hyperboloid. .. Preference for conditioned reinforcement J Exp Anal Behav 55, 37–46 Please cite this article in press as: Killeen, P.R., Models of trace decay, eligibility for reinforcement, and delay of reinforcement. .. endorsement of the first paragraph and one realization of the second paragraph Please cite this article in press as: Killeen, P.R., Models of trace decay, eligibility for reinforcement, and delay of reinforcement

Định dạng
Số trang	7
Dung lượng	279,57 KB