Variability is an important ingredient in learning

1 Variability is an important ingredient in learning Luca Onnis and Morten H Christiansen Department of Psychology, Cornell University, USA Nick Chater Department of Psychology, University College London, UK Rebecca Gómez Department of Psychology, The University of Arizona at Tucson, USA June 12, 2006 Word count: 10,628 (body) *Corresponding Author: Address: Cornell University, Dept of Psychology, 245 Uris Hall, 14850 Ithaca, NY Email: lo35@cornell.edu Abstract An important aspect of language acquisition involves learning nonadjacent dependencies between words, such as subject/verb agreement for tense or number in English Despite the fact that infants and adults can track adjacent relations in order to infer the structure of sequential stimuli, the evidence is not conclusive for learning nonadjacent items In four experiments, we provide evidence that discovering nonadjacent dependencies is possible, provided it is modulated by the variability of the intervening material between items Detection of nonadjacencies improves significantly with either zero or large variability, and in such conditions is independent of the embedded material In addition, variability-mediated learning also applies to visual abstract stimuli, suggesting that similar statistical computations are computed within sensory modalities Learning nonadjacencies is clearly modulated by the statistical properties of the input, although crucially the obtained U-shape learning curve cannot be explained by current associative mechanisms of artificial grammar learning That human learners may discover structural dependencies in sequences of stimuli through probabilistic relationships inherent in the input has long been proposed as a potentially important mechanism in language development In the 1950s, Harris proposed a number of procedures that capitalized on the distributional properties of relevant linguistic entities (such as phonemes and morphemes) to uncover structural information about languages (Harris, 1955) A simple distributional procedure in English, for instance, would detect that articles precede nouns and not vice versa (the window, but not *window the) Harris also proposed that words that occur within similar contexts are semantically similar (Harris, 1968) Interest in the statistics of language was great outside linguistics also, and spawned fundamental discoveries in information theory Shannon (1950) derived several important insights from the frequency analysis of single letters (the letter ‘e’ is by far the most frequent in the English alphabet) and pairs and triples of letters He developed encryption systems during World War II based on the statistical structure of sequential information, setting the stage for his pioneering work on the mathematical theory of cryptography and information theory At the time that Harris was developing his ideas on the statistics of language, Miller (1958) and Reber (1967) began investigating the processes by which learners responded to the statistical properties of miniature artificial grammars These early experimental studies – now collectively called artificial grammar learning (AGL) – were taken by some to show that adults become sensitive, after limited and often purely incidental exposure, to sequential structure Although conducted with adults, the studies were motivated by the desire to gain insight into how children might learn their first language Thus, the idea that statistical and distributional properties of the input could assist in the discovery of the structures of language, has historically been taken as a serious scientific endeavor In the decades following the first artificial grammar experiments, efforts to employ statistical and distributional analysis withered in the face of a series of theoretical linguistic arguments (e.g., Chomsky, 1957) and interpretations of mathematical results on the unlearnability of certain classes of languages (Gold, 1967) Consequently, the AGL approach largely abandoned the study of language learning in favor of topics such as implicit learning (e.g., Reber, 1993, Reed & Johnson, 1994) In particular, Chomsky – who had been a student of Harris – postulated that constituent structure at the core of linguistic knowledge cannot be learned from surface-level associations by any means (e.g., Chomsky, 1957; cf Lashley, 1951, for a similar criticism being leveled independently in the area of sequential action) Chomsky pointed out that that key relationships between words and constituents are conveyed in nonadjacent (or remotely connected) structure as opposed to adjacent associations In English, for instance, arbitrary amounts of linguistic material may intervene between auxiliaries and inflectional morphemes (e.g., is cooking, has travel ed) or between subject nouns and verbs in number agreement, and may be separated by embedded sentences (e.g., the books that sit on the shelf are dusty ) The presence of embedding and nonadjacent dependencies in language represented a point of difficulty for early associationist approaches, which mainly focused on discovering adjacent relations For instance, a distributional mechanism computing solely neighboring information would parse the following sentence *the books that sit on the shelf is dusty as grammatical because the nearest-neighbor noun (“shelf”) to the verb (“is”) is singular A similar case, *the book that sits on the shelves are dusty suffers from the same problem Such limitations cast doubts on the usefulness of statistical properties of the input to discover structure in language, and contributed to a paradigm shift in studies of language acquisition in the late 1950s and 1960s In this paper we re-evaluate the possibility that non-adjacent dependency learning may be mediated by the statistical properties of sequential stimuli Recently, there has been a revived interest in so-called statistical learning1 after some thirty years, as researchers have begun to investigate empirically how infants might identify aspects of linguistic structure Because much of this research has focused on the learning of adjacent linguistic elements, such as syllables in words, we know little about the conditions under which nonadjancencies may be acquired The aim of the paper is to explore learning of nonadjacent dependencies and to discuss the theoretical implications of these results for acquiring language as well as sequential structure more generally Learning nonadjacencies with artificial languages The use of artificial grammars to tap into mechanisms of language acquisition has recently resurged in language acquisition research under the guise of artificial language learning (e.g., Curtin, Mintz, & Christiansen, 2005; Gómez & Gerken, 1999; Saffran, Aslin, & Newport, 1996; see Gómez & Gerken, 2000, and Saffran, 2003, for reviews) Typically, artificial languages differ from artificial grammars in that they aim to mimick the learnability of certain properties of real natural languages that the experimenter desires to test Conversely, artificial grammars tend to focus more on abstract sequences of shapes, or letters, with less relevance to linguistic structures As such, an artificial language typically uses sequences of pseudo-words such as pel wadim jic, presented auditorily, while a typical example of artificial grammar is MXVXXMX, where an arbitrary set of rules decides the order of randomly selected letters (Reber, 1967) However, both types of experiments tap into the type of knowledge that learners may acquire after exposure to a limited number of examples from the language (Perruchet & Pacton, 2006) Much of the research so far points to a quick and robust ability of infants and adults to track statistical regularities among adjacent elements, e.g., syllables (Saffran et al., 1996; Saffran, Newport & Aslin, 1996) These studies suggest that there is a strong natural bias to make immediate statistical computations among adjacent properties of stimuli Given this bias, tracking nonadjacent probabilities, at least in uncued streams of syllables, has proven elusive (Newport & Aslin, 2004; Onnis, Monaghan, Chater, & Richmond, 2005; Peña, Bonatti, Nespor, & Mehler, 2002) Gómez (2002; see also Gómez & Maye, 2005) proposed that learners exposed to several sources of information (including adjacent and nonadjacent probabilities), may default to the easier-to-process adjacent probabilities If in particular conditions adjacent information is not useful then nonadjacent information may become informative, and thus more salient To test this hypothesis, Gómez exposed infants and adult participants to sentences of an artificial language of the form AXB The language contained three families of nonadjacent pairs, notably A 1_B1, A 2_B2, and A3_B3 She manipulated the variability of the middle element X in four conditions by systematically increasing the pool from which the middle element could be drawn from 2, 6, 12, to 24 word-like elements, while the number and frequency of Ai_Bi pairs was kept constant across conditions In the test phase, participants were required to discriminate correct nonadjacent dependencies, (e.g., A1XB1) from incorrect ones (*A1XB2) Correct and incorrect sentences differed only in the relation between initial and final elements; the same set of X elements occurred in both grammars, and X elements varied freely without restriction, resulting in identical relations between adjacent AiX and XBi elements in the two types of sentences Because the sentences are identical with respect to absolute position of elements and adjacent dependencies, they can only be distinguished by noting the relation between the nonadjacent first and third elements Therefore, tracking first-order adjacent transitional probabilities P(X|Ai) and P(Bi|X) (Saffran et al., 1996) would not lead to distinguishing correct versus incorrect sequences However learners might detect the nonadjacencies by tracking trigram information (Perruchet & Pacteau, 1990) In this scenario, the number of unique sentences (trigrams2) increased from to 72 as variability of X increased, and the frequency of each trigram decreased correspondingly from 72 to repetitions Given plausible memory and processing constraints, if learners were trying to learn trigrams they should be better at learning the AiXBi grammar in conditions of small variability Surprisingly, Gómez found that learners were significantly better at learning when the variability of X elements was highest at 24 These results are counterintuitive They rule out n-gram information, suggesting that learners computed transitional probabilities of nonadjacent dependencies P(Bi|Ai_) What is difficult to explain is the mechanism by which the variability of intervening Xs was beneficial to such computations Indeed, by most accounts high variability should result in increased noise and thus decreased learning Gómez proposed that during her task learners were initially focused on adjacent probabilities; and that nonadjacent information (for which P(B i|Ai_)=1 and frequency of Ai_Bi was kept constant across conditions) would become relatively more salient only as the probabilities of adjacent information decreased, i.e., as the set-size of X elements increased Thus, high variability in the large set-size condition acted beneficially to increase the salience of the nonadjacent elements compared to the middle elements, and facilitated learning Gómez proposed that learning involves a tendency to seek out invariant structure (structure remaining constant across varying contexts; E.J Gibson, 1969; J.J Gibson, 1966) Therefore, if the statistical probability of preferred (adjacent) structure decreases sufficiently, learners should begin to seek out other forms of information (see also Gómez, in press) Consistent with this argument, infants and adults in Gómez (2002) appeared to be focusing on different types of dependencies as a function of their statistical properties The variability insight is the starting point of our investigations We argue that another type of variability could potentially trigger nonadjacent computations, namely when there is no variability at all in intervening X elements If the principle of seeking invariant structure in the input is correct, then intuitively in the case of zero variability learners should perceive the invariance of the X element as potentially uninformative, allowing them to focus on the nonadjacent computations P(Bi|Ai_) This hypothesis is tested in Experiment 1, which complements the results of Gómez, and shows a counterintuitive U-shape of nonadjacent learning mediated by the variability of embedded middle elements Experiments and further confirm that nonadjacent learning is indeed modulated by variability and, crucially, is independent of the embedded items Experiment tests whether learners carry out similar variability-modulated computations with visual stimuli, suggesting that variability-modulated computations are not limited to language-like stimuli This last experiment speaks directly to a debate on the existence of similar sequencing principles involved in language and other cognitive sequential tasks The overall pattern of results in Experiments 1-4 points to a U-shape, and in the discussion section we dwell extensively on how current associative mechanisms based on n-gram information cannot account for this pattern The ensuing discussion should be of relevance to researchers both in implicit learning and language acquisition, and suggest a reunification of the two literatures, which were one early on (Miller, 1958; Reber, 1967) and subsequently drifted apart into different areas of cognitive psychology as briefly discussed in the introduction (see also Perruchet & Pacton, 2006) Experiment 1: Testing the zero variability hypothesis Initial evidence that the nonadjacent dependencies in an A iXBi language would be learned with high variability of intervening X elements comes from examples of natural language There is a peculiar asymmetry in language such that sequences of morphemes often involve some high-frequency items belonging to a relatively small set (such as am, the, ing, -s, are) interspersed with less frequent items belonging to a very large set (e.g., nouns, verbs, adjectives, adverbs) Gómez (2002) noted that this asymmetry translates into patterns of highly invariant nonadjacent items separated by highly variable material (am cooking, am work ing, am go ing, etc.) potentially making function morphemes more detectable (see also Valian & Coulson, 1988) Several important grammatical relations emerge in this way, notably auxiliaries and inflectional morphemes (e.g., am cooking, has travel ed) as well as dependencies in number agreement (the books on the shelf are dusty) Thus, to an extent the AiXBi language reflects some structures in natural languages Examples of a second type of variability learning also exist in natural languages, namely that nonadjacencies may be variable with respect to a single fixed embedding: for instance, several different nonadjacent relations can be interspersed with the same material (e.g., am cooking, has cooked) In Experiment we mimicked this condition with an AiXBi grammar, by exploring what happens when variability between the end-item pairs and the middle items is reversed in the input Gómez attributed poor results in the small set-sizes to low variability: in these conditions both nonadjacent dependencies and middle items vary, but none of them considerably more than the other This may confuse learners, in that it is not clear which structure is invariant Conversely, with larger set-sizes middle items are considerably more variable than first-last item pairings, making the nonadjacent pairs stand out as invariant We asked what happens when variability in the middle position is eliminated, thus making the nonadjacent items variable and the X item invariant We replicated Gómez’ experiment with adults, adding a new condition – the zero variability condition – in which there is only one middle element (i.e., A 1X1B1, A 2X1B2, and A 3X1B3) Our prediction is that invariance of the middle item will make the end-items stand out, and make detection of the appropriate nonadjacent relationships easier The final predicted picture is a U-shape learning curve in detecting nonadjacent dependencies, consistent with the idea that learning is a flexible and adaptive process Method Participants Sixty undergraduate and postgraduate students at the University of Warwick participated and were paid £3 each Materials In the training phase participants listened to auditory strings generated by one of two artificial languages (L1 or L2) Strings in L1 had the form A 1XB1, A2XB2, and A3XB3 L2 strings had the form A1XB2, A2XB3, and A 3XB1 Variability was manipulated in conditions, by drawing X from a pool of 1, 2, 6, 12, or 24 elements The strings, recorded by a female voice, were the same that Gómez used in her study and were originally chosen as tokens among several recorded sample strings in order to eliminate talker-induced differences in individual strings The elements A1, A2, and A3 were instantiated as pel, vot, and dak; B1, B2, and B3, were instantiated as rud, jic, tood The 24 X middle items were: wadim, kicey, puser, fengle, coomo, loga, gople, taspu, hiftam, deecha, vamey, skiger, benez, gensim, feenam, laeljeen, chla, roosa, plizet, balip, malsig, suleb, nilbo, and wiffle Following the design by Gómez (2002), the group of 12 middle elements was drawn from the first 12 words in the list, the set of was drawn from the first 6, the set of from the first and the set of from the first word Three strings in each language were common to all five groups and they were used as test stimuli The three L2 items served as foils for the L1 condition and vice versa In Gómez (2002) there were six test sentences generated by each language, because the smallest set-size had middle items, resulting in 12 test items To keep the number of test items equal to Gómez we presented the test stimuli twice in two blocks, randomizing within blocks for each participant Words were separated by 250-ms pauses and strings by 750-ms pauses Procedure Six participants were recruited in each of the five set-size conditions (1, 2, 6, 12, 24) and for each of the two language conditions (L1, L2) resulting in 12 participants per set-size Learners were asked to listen and pay close attention to sentences of an invented language and they were told that there would be a series of simple questions relating to the sentences after the listening phase They were not informed of the existence of rules to be discovered During training, participants in all conditions listened to the same overall number of strings, (a total of 432) This way, frequency of exposure to the nonadjacent dependencies was held constant across conditions For instance, participants in set-size 24 heard six iterations of each of 72 type strings (3 dependencies x 24 middle items), participants in set-size 12 encountered each string twice as often as those exposed to set-size 24 and so forth Whereas nonadjacent dependencies were held constant (each repeated 144 times across conditions), transitional probabilities decreased as set-size increased Training lasted 18 minutes, and training trials were presented in three blocks separated by a break Before the test, participants were told that the sentences they had heard were generated according to a set of rules, and they would now hear 12 strings, of which would violate the rules They were asked to press “Y” on a keyboard if they thought a sentence followed the rules and to press “N” otherwise Results and Discussion We measured total accuracy scores in endorsing grammatical strings and rejecting ungrammatical strings An analysis of variance with Variability (set-size 1, 2, 6, 12, 24) and Language (L1 vs L2) as between-subjects variable resulted in a main effect of Variability, F (4,50)=14.17, p

Tiêu đề	Variability is an Important Ingredient in Learning
Tác giả	Luca Onnis, Morten H. Christiansen, Nick Chater, Rebecca Gúmez
Trường học	Cornell University
Chuyên ngành	Psychology
Thể loại	thesis
Năm xuất bản	2006
Thành phố	Ithaca

Định dạng
Số trang	41
Dung lượng	337,95 KB