Implicit learning of non adjacent dependencies a graded, associative account

Implicit learning of non-adjacent dependencies A graded, associative account Luca Onnis, Arnaud Destrebecqz, Morten H Christiansen, Nick Chater, & Axel Cleeremans Nanyang Technological University / Université Libre de Bruxelles / Cornell University / University of Warwick / Université Libre de Bruxelles Language and other higher-cognitive functions require structured sequential behavior including non-adjacent relations A fundamental question in cognitive science is what computational machinery can support both the learning and representation of such non-adjacencies, and what properties of the input facilitate such processes Learning experiments using miniature languages with adult and infants have demonstrated the impact of high variability (Gómez, 2003) as well as nil variability (Onnis, Christiansen, Chater, & Gómez (2003; submitted) of intermediate elements on the learning of nonadjacent dependencies Intriguingly, current associative measures cannot explain this U-shaped curve In this chapter, extensive computer simulations using ive diferent connectionist architectures reveal that Simple Recurrent Networks (SRN) best capture the behavioral data, by superimposing local and distant information over their internal ‘mental’ states hese results provide the irst mechanistic account of implicit associative learning of non-adjacent dependencies modulated by distributional properties of the input We conclude that implicit statistical learning might be more powerful than previously anticipated Most routine actions that we perform daily such as preparing to go to work, making a cup of cofee, calling up a friend, or speaking are performed without apparent efort and yet all involve very complex sequential behavior Perhaps the most apparent example of sequential behavior – one that we tirelessly perform since we were children – involves speaking and listening to our fellow humans Given the relative ease with which children acquire these skills, the complexity of learning sequential behavior may go unseen: At irst sight, producing a sentence merely consists of establishing a chain of links between each speech motor action and the next, a simple addition of one word to the next However, this characterization falls short of one important property of structured sequences In language, for instance, many  ./sibil..onn ©  John Benjamins Publishing Company  Luca Onnis, Arnaud Destrebecqz, Morten H Christiansen, Nick Chater, & Axel Cleeremans syntactic relations such as verb agreement hold between words that may be several words apart, such as for instance in the sentence he dog that chased the cats is playful, where the number of the auxiliary is depends on the number of the non-adjacent subject dog, not on the nearer noun cats he presence of these nonadjacent dependencies in sequential patterns poses a serious conundrum for learning-based theories of language acquisition and sequence processing in general On the one hand, it appears that children must learn the relationships between words in a speciic language by capitalizing on the local properties of the input In fact, there is increasing empirical evidence that early in infanthood learners become sensitive to such local sequential patterns in the environment: For example, infants can exploit high and low transitional probabilities between adjacent syllables to individuate nonsense words in a stream of unsegmented speech (Safran, Aslin, & Newport, 1996; Safran, 2001; Estes, Evans, Alibali & Safran, 2007) Under this characterization, it is possible to learn important relations in language using local information On the other hand, given the presence of nonadjacent dependencies in language acquisition (Chomsky, 1959) as well as in sequential action (Lashley, 1951) associative mechanisms that rely exclusively on adjacent information would appear powerless For instance, processing an English sentence in a purely local way would result in errors such as *he dog that chased the cats are playful, because the nearest noun to the auxiliary verb are is the plural noun cats An outstanding question for cognitive science is thus whether it is possible to learn and process serial nonadjacent structure in language and other domains via associative mechanisms alone In this paper, we tackle the issue of the implicit learning of linguistic nonadjacencies using a class of associative models, namely connectionist networks Our starting point is a set of behavioral results on the learning of nonadjacent dependencies initiated by Rebecca Gómez hese results are interesting because they are both intuitively counterintuitive, and because they defy any explicit computational model to our knowledge Gómez (2002) found that learning non-local Ai_Bi relations in sequences of spoken pseudo-words with structure A X B is a function of the variability of X intervening items: infants and adults exposed to more word types illing the X category detected the non-adjacent relation between speciic Ai and speciic Bi words better than learners exposed to a small set of possible X words In follow-up studies with adult learners, Onnis, Christiansen, Chater, and Gómez (2003; submitted) and Onnis, Monaghan, Christiansen, and Chater (2004) replicated the original Gómez results, and further found that non-adjacencies are better learned when no variability of intervening words from the X category occurred his particular U-shaped learning curve also holds when completely new intervening words are presented at test (e.g Ai Y Bi), suggesting that learners distinguish nonadjacent relations independently of Associative learning of nonadjacent dependencies  intervening material, and can generalize their knowledge to novel sentences In addition, the U shape was replicated using abstract visual shapes, suggesting that similar learning and processing mechanisms may be at play for non-linguistic material presented in a diferent sensory domain Crucially, it has been demonstrated that implicit learning of nonadjacent dependencies is signiicantly correlated with both oline comprehension (Misyak & Christiansen, 2012) and online processing (Misyak, Christiansen & Tomblin, 2010a, b) of sentences in natural language containing long-distance dependencies he above results motivate a reconsideration of the putative mechanisms of nonadjacency learning in two speciic directions: irst, they suggest that non-adjacency learning may not be an all-or-none phenomenon and can be modulated by speciic distributional properties of the input to which learners are exposed his in turn would suggest a role for implicit associative mechanisms, variably described in the literature under terms as statistical learning, sequential learning, distributional learning, and implicit learning (Perruchet & Pacton, 2006; Frank, Goldwater, Griiths, & Tenenbaum, 2010) Second, the behavioral U shape results would appear to challenge virtually all current associative models proposed in the literature In this paper we thus ask whether there is at least one class of implicit associative mechanisms that can capture the behavioral U shape his will allow us to understand in more mechanistic terms how the presence of embedded variability facilitates the learning of nonadjacencies, thus illing the current gap in our ability to understand this important phenomenon Finally, to the extent that our computer simulations can capture the phenomenon without requiring explicit forms of learning, they also provide a proof of concept that implicit learning of non-adjacencies is possible, contributing further to the discussion of what properties of language need necessarily to be learned explicitly he plan of the paper is as follow: we irst briely discuss examples of nonadjacent structures in language and review the original experimental study by Gómez and colleagues, explaining why they challenge associative learning mechanisms Subsequently we report on a series of simulations using Simple Recurrent Networks (SRNs) because they seem to capture important aspects of serial behavior in language and other domains (Botvinick & Plaut, 2004, 2006; Christiansen & Chater, 1999; Cleeremans, Servan-Schreiber, & McClelland; 1989; Elman, 1991, among others) Further on, we test the robustness of our SRN simulations in an extensive comparison of connectionist architectures and show that only the SRNs capture the human variability results closely We discuss how this class of connectionist models are able to entertain both local and distant information in graded, superimposed representations on their hidden units, thus providing a plausible implicit associative mechanism for detecting non-adjacencies in sequential learning  Luca Onnis, Arnaud Destrebecqz, Morten H Christiansen, Nick Chater, & Axel Cleeremans he problem of detecting nonadjacent dependencies in sequential patterns At a general level, non-adjacent dependencies in sequences are pairs of mutually dependent elements separated by a varying number of embedded elements We can consider three prototypical cases of non-local constraints (from Servan-Schreiber, Cleeremans, & McClelland, 1991) and we can ask how an ideal learner could correctly predict the last element (here letter) of a sequence, given knowledge of the preceding elements Consider the three following sequences: (1) L KPS V versus L KPS M (2) L KPS V versus P GBP E (3) L KPS V versus P KPS E As for (1), it is impossible to predict V versus M correctly because the preceding material “L KPS” is exactly identical Example (2), on the other hand is trivial, because the last letter is simply contingent on the penultimate letter (‘V’ is contingent on ‘S’ and E’ is contingent on ‘P’) Example (3), the type investigated in Gómez (2002), is more complex: the material ‘KPS’ preceding ‘V’ and ‘S’ does not provide any relevant information for disambiguating the last letter, which is contingent on the initial letter he problem of maintaining information about the initial item until it becomes relevant is particularly diicult for any local prediction-driven system, when the very same predictions have to be made on each time step in either string for each embedded element, as in (3) Gómez (2002) noted that many relevant examples of non-local dependencies of type (3) are found in natural languages: they typically involve items belonging to a relatively small set (functor words and morphemes like am, the, -ing, -s, are) interspersed with items belonging to a much larger set (nouns, verbs, adjectives) his asymmetry translates into sequential patterns of highly invariant non-adjacent items separated by highly variable material For instance, the present progressive tense in English contains a discontinuous pattern of the type “tensed auxiliary verb + verb stem + -ing suix”, e.g am cooking, am working, am going, etc.) his structure is also apparent in number agreement, where information about a subject noun is to be maintained active over a number of irrelevant embedded items before it actually becomes useful when processing the associated main verb For instance, processing the sentence: (4) he dog that chased the cats is playful requires information about the singular subject noun “dog” to be maintained over the relative clause “that chased the cats”, to correctly predict that the verb “is” is singular, despite the fact that the subordinate object noun immediately adjacent to it, Associative learning of nonadjacent dependencies  “cats”, is plural Such cases are problematic for associative learning mechanisms that process local transition probabilities (i.e from one element to the next) alone, precisely because they can give rise to spurious correlations that would result in erroneously categorizing the following sentence as grammatical: (5) *he dog that chased the cats are playful In other words, the embedded material appears to be wholly irrelevant to mastering the non-adjacencies: not only is there an ininite number of possible relative clauses that might separate he dog from is, but also structurally diferent non-adjacent dependencies might share the very same embedded material, as in (4) above versus (6) he dogs that chased the cats are playful Gómez exposed infants and adults to sentences of a miniature language intended to capture such structural properties, namely with sentences of the form AiXjBi, instantiated in spoken nonsense words he language contained three families of nonadjacencies, denoted A1_B1, (pel_rud), A2_B2 (vot_jic), and A3_B3 (dak_tood) he set-size from which the embedded word Xj, could be drawn was manipulated in four between-subjects conditions (set-size = 2, 6, 12, or 24; see Figure 1, columns 2–5) At test, participants had to discriminate between expressions containing correct nonadjacent dependencies, (e.g A2X1B2, vot vadim jic) from incorrect ones (e.g *A2X1B1, vot vadim rud) his test thus required ine discriminations to be made, because even though incorrect sentences were novel three-word sequences (or trigrams), both single-word and two-word (bigrams) sequences (namely, A2X1, X1B2, X1B1) had appeared in the training phase In addition, because the same embeddings appeared in all three pairs of non-adjacencies with equal frequency, all bigrams had the same frequency within a given sets-size condition In particular, the transitional probability of any B word given the middle word X was the same, for instance, P(jic|vadim) = P(rud|vadim)= 33, and so it was not possible to predict the correct grammatical string based on knowledge of adjacent transitional probabilities alone Gómez hypothesized that if adjacent transitional probabilities were made weaker, the non-adjacent invariant frame Ai_Bi might stand out as invariant his should happen when the set-size of the embeddings is larger, hence predicting better learning of the non-adjacent dependencies under conditions of high embedding variability Her results supported this hypothesis: participants performed signiicantly better when the set-size of the embedding was largest, i.e 24 items An initial verbal interpretation of these indings by Gómez (2002) was that learners detect the nonadjacent dependencies when they become invariant enough with respect to the varying embedded X words his interpretation thus suggests that – while learners are indeed attuned to distributional properties of the local  Luca Onnis, Arnaud Destrebecqz, Morten H Christiansen, Nick Chater, & Axel Cleeremans environment – they also learn about which source of information is most likely to be useful – in this case adjacent or non-adjacent dependencies Gómez proposed that learners may capitalize on the most statistically reliable source of information in an attempt to reduce uncertainty about the input (Gómez, 2002) In the context of sequences of items generated by artiicial grammars, the cognitive system’s relative sensitivity to the information contained in bigrams, trigrams or in long-distance dependencies may therefore hinge upon the statistical properties of the speciic environment that is being sampled In follow-up studies, Onnis et al (2003; submitted) were able to replicate Gómez’ experiment with adults, and added a new condition in which there is only one middle element (A1X1B1, A2X1B2, A3X1B3; see Figure 1, column 1) Under such condition, variability in the middle position is thus simply eliminated, thus making the X element invariable and the A_B non-adjacent items variable Onnis et al found that this lip in what changes versus what stays constant again resulted in successful learning of the non-adjacent contingencies Interestingly, learning in Onnis et al.’s set-size condition does not seem to be attributable to a diferent mechanism involving rote learning of whole sentences In a control experiment, learners were required to learn not three but six nonadjacent dependencies and one X, thus equating the number of unique sentences to be learned to those in set-size 2, in which learning was poor he logic behind the control was that if learners relied on memorization of whole sentences on both conditions, they should fail to learn the six nonadjacent dependencies in the control set-size Instead, Onnis et al found that learners had little problem learning the six non-adjacencies, despite the fact that the language control set-size was more complex (13 diferent words and unique dependencies to be learned) than the language of set-size (7 words and three dependencies) his control thus ruled out a process of learning based on mere memorization and suggested that the invariability of X was responsible for the successful learning A further experiment showed that learners endorsed the correct non-adjacencies even when presented with completely new words at test For instance, they were able to distinguish A1Y1B1 from A1Y1B2, suggesting that the process of learning non-adjacencies leads to correct generalization to novel sentences In yet another experiment, they replicated the U shape and generalization indings with visually presented pseudo-shapes Taken together, Gómez’s and Onnis et al.’s results indicate that learning is best either when there are many possible intervening elements or when there is just one such element, with considerably degraded performance for conditions of intermediate variability (Figure 2) For the sake of simplicity, from here on we collectively refer to all the above results as the ‘U shape results’ Before moving to our new set of connectionist simulations, the next section evaluates whether current associative measures of implicit learning can predict the U shape results Associative learning of nonadjacent dependencies  |X| = |X| = a d a b e b c X1 X1 X2 f c |X| = d e a f b c X1 X2 X3 X6 d e a f b c |X| = 12 |X| = 24 X1 X2 X3 X1 X2 X3 d e f X12 a b c X24 d e f Figure he miniature grammars used by Gómez (2002; columns 2–5) and Onnis et al (2003; submitted; columns 1–5) Sentences with three non-adjacent dependencies are constructed with an increasing number of syntagmatically intervening X items Gómez used set-sizes 2, 6, 12, and 24 Onnis et al added a new set-size condition 100% % Correct 90% 80% 70% 60% 50% 12 24 Variability Figure Data from Onnis et al (2003, submitted) incorporating the original Gomez experiment Learning of non-adjacent dependencies results in a U-shaped curve as a function of the variability of intervening items, in ive conditions of increasing variability  Luca Onnis, Arnaud Destrebecqz, Morten H Christiansen, Nick Chater, & Axel Cleeremans Candidate measures of associative learning here exist several putative associative mechanisms of artiicial grammar and sequence learning (e.g Dulany et al 1984; Perruchet & Pacteau, 1990; ServanSchreiber & Anderson, 1990), or on learning of whole items (Vokey & Brooks, 1992) Essentially these models propose that subjects acquire knowledge of fragments, chunks or whole items from the training strings, and that they base their subsequent judgments of correctness (grammaticality) of a new set of sequences on an assessment of the extent to which the test strings are similar to the training strings (e.g how many chunks a test item shares with the training strings) To ind out how well these associative models would fare in accounting for Gómez and for Onnis et al.’s data, we considered a variety of existing measures of chunk strength and of the similarity between training and test exemplars Based on existing literature, we considered the following measures: Global Associative Chunk Strength (GCS), Anchor Strength (AS), Novelty Strength (NS), Novel Fragment Position (NFP), and Global Similarity (GS), in relation to the data in Experiment and of Onnis et al hese measures are described in detail in Appendix A Table summarizes descriptive fragment statistics are summarized, while the values of each associative measure are reported in Table Table Descriptive fragment statistics for the bigrams and trigrams contained in the artiicial grammar used in Gómez (2002), Experiment 1, and in Onnis et al (submitted) Note that Experiment of Onnis et al is a replication of Gómez’ (2003) Experiment Variability condition Total number of training strings 1-cntrl 12 24 432 432 432 432 432 432 Ai_Bi pair types 3 3 Ai_Bi pair tokens 144 72 144 144 144 144 Xj types 1 12 24 Xj tokens 432 432 216 72 36 18 6 18 36 72 AiXjBi types AiXjBi tokens 144 72 72 24 12 type/token ratio (AXB) 0.02 0.08 0.08 0.75 3.00 12.00 AiXj tokens 144 72 72 24 12 XjBi tokens 144 72 72 24 12 P(Xj|Ai) 1.00 1.00 0.50 0.17 0.08 0.04 P(Bi|Xj) 0.33 0.16 0.33 0.33 0.33 0.33 Associative learning of nonadjacent dependencies  Table Predictors of chunk strength and similarity used in the AGL literature (Global Chunk Strength, Anchor Chunk Strength, Novelty, Novel Fragment Position, Global Similarity) Scores refer to bigrams and trigrams contained in the artiicial grammar used in Gómez (2002), Experiment 1, and Onnis et al (submitted) Variability condition 1-cntrl 12 24 144 96 72 48 72 48 24 16 12 Novelty for Grammatical strings Novelty for Ungrammatical strings 1 1 1 NFP for Grammatical strings NFP for Ungrammatical strings 0 0 0 0 0 0 GS for Grammatical strings GS for Ungrammatical strings 1 1 1 GCS/ACS for Grammatical strings GCS/ACS for Ungrammatical strings he condition of null variability (set-size 1) is the only condition that can a priori be accommodated by measures of associative strength For this reason, the set-size 1-control was run in Experiment Table shows that associative measures are the same for the set-size 1-control and set-size However, since performance was signiicantly better in the set-size 1-control, the above associative measures cannot predict this diference Overall, since Novelty, Novel Fragment Position, and Global Similarity values are constant across conditions, they predict that learners would fare equally in all conditions and, to the extent that ungrammatical items were never seen as whole strings during training, that grammatical strings would be easier to recognize across conditions Taken together, the predictors based on strength and similarity would predict equal performance across conditions or better performance when the set-size of embeddings is small because the co-occurrence strength of adjacent elements is stronger Hence, none of these implicit learning measures predict the observed U shape results In the next section, we investigate whether connectionist networks can better, and whether any particular network architecture is best Simulation 1: Simple recurrent networks We have seen that no existing chunk-based model derived from the implicit learning literature appears to capture the U-shaped pattern of performance exhibited by human subjects when trained under conditions of difering variability Would connectionist models fare better in accounting for these data? One plausible candidate is the Simple  Luca Onnis, Arnaud Destrebecqz, Morten H Christiansen, Nick Chater, & Axel Cleeremans Recurrent Network model (Elman, 1990) because it has been applied successfully to model human sequential behavior in a wide variety of tasks including everyday routine performance (Botvinick & Plaut, 2004), dynamic decision making (Gibson, Fichman, & Plaut, 1997), cognitive development (Munakata, McClelland, & Siegler, 1997), implicit learning (Kinder & Shanks, 2001; Servan-Schreiber, Cleeremans, & McClelland, 1991), and the high-variability condition of the Gómez (2002) nonadjacency learning paradigm (Misyak et al 2010b) SRNs have also been applied to language processing such as spoken word comprehension and production (Christiansen, Allen, & Seidenberg, 1998; Cottrell & Plunkett, 1995; Dell, Juliano, & Govindjee, 1993; Gaskell, Hare, & Marslen-Wilson, 1995; Plaut & Kello, 1999), sentence processing (Allen & Seidenberg, 1999; Christiansen & Chater, 1999; Christiansen & MacDonald, 2009; Rohde & Plaut, 1999), sentence generation (Takac, Benuskova, & Knott, 2012), lexical semantics (Moss, Hare, Day, & Tyler, 1994), reading (Pacton, Perruchet, Fayol, & Cleeremans, 2001), hierarchical structure (Hinoshita, Arie, Tani, Okuno, & Ogata, 2011), nested and cross-serial dependencies (Kirov & Frank, 2012), grammar and recursion (Miikkulainen & Mayberry III, 1999; Tabor, 2011), phrase and syntactic parsing (Socher, Manning, & Ng, 2010), and syntactic systematicity (Brakel Frank, 2009; Farkaš & Croker, 2008; Frank, in press) In addition, recurrent neural networks are efectively solve a variety of linguistic engineering problems like automatic voice recognition (Si, Xu, Zhang, Pan, & Yan, 2012), word recognition (Frinken, Fischer, Manmatha, & Bunke, 2012), text generation (Sutskever, Martens, & Hinton, 2011), and recognition of sign language (Maraqa, Al-Zboun, Dhyabat, & Zitar, 2012) hus these networks are potentially apt at modeling the diicult task of learning of non-adjacencies in the AXB artiicial language discussed above In particular, SRNs (Figure 3a) are appealing because they come equipped with a pool of units that are used to represent the temporal context by holding a copy of the hidden units’ activation level at the previous time slice In addition, they can maintain simultaneous overlapping, graded representations for diferent types of knowledge he gradedness of representations may in fact be the key to learning non-adjacencies he speciic challenge for SRNs in this paper is to show that they can represent graded knowledge of bigrams, trigrams and non-adjacencies and that the strength of each such representation is modulated by the variability of embeddings in a similar way to humans To ind out whether associative learning mechanisms can explain the variability efect, we trained SRNs to predict each element of the sequences that were structurally identical to Gómez’s material he choice of the SRN architecture, as opposed to a simple feed-forward network, is motivated by the need to simulate the training and test procedure used by Gómez and Onnis et al who exposed their participants to auditory stimuli, one word at a time he SRN captures this temporal aspect  Luca Onnis, Arnaud Destrebecqz, Morten H Christiansen, Nick Chater, & Axel Cleeremans 2.0 2.0 SRN HD z-scores of Luce ratio differences (cond - average cond 2, 6, 12) 1.0 0.5 0.0 –0.5 –1.0 –1.5 –2.0 1.0 0.5 0.0 –0.5 –1.0 –1.5 –2.0 –2.5 –2.5 –3.0 –1.0 –0.5 0.0 0.5 –3.0 –1.0 –0.8 –0.6 –0.4 –0.2 0.0 1.0 z-scores of Luce ratio differences (cond 24 - average cond 2, 6, 12) 2.0 Jordan HD 1.5 1.5 1.0 0.5 0.0 –0.5 –1.0 –1.5 –2.0 –2.5 –0.5 0.0 0.5 0.2 0.4 0.6 0.8 1.0 z-scores of Luce ratio differences (cond 24 - average cond 2, 6, 12) z-scores of Luce ratio differences (cond - average cond 2, 6, 12) z-scores of Luce ratio differences (cond - average cond 2, 6, 12) 2.0 –3.0 –1.0 HD AA 1.5 z-scores of Luce ratio differences (cond - average cond 2, 6, 12) 1.5 1.0 z-scores of Luce ratio differences (cond 24 - average cond 2, 6, 12) Buffer HD 1.0 0.5 0.0 –0.5 –1.0 –1.5 –2.0 –2.5 –3.0 –1.0–0.8–0.6–0.4–0.2 0.0 0.2 0.4 0.6 0.8 1.0 z-scores of Luce ratio differences (cond 24 - average cond 2, 6, 12) z-scores of Luce ratio differences (cond - average cond 2, 6, 12) 2.0 HD AA 1.5 1.0 0.5 0.0 –0.5 –1.0 –1.5 –2.0 –2.5 –3.0 –1.0 –0.8 –0.6 –0.4 –0.2 0.0 0.2 0.4 0.6 0.8 1.0 z-scores of Luce ratio differences (cond 24 - average cond 2, 6, 12) Figure he regions of the space inhabited by the connectionist architectures Only SRNs group in the upper-right quadrant, where human data from Onnis et al (2003; submitted) are located Associative learning of nonadjacent dependencies  Mechanisms of implicit learning in the SRN Simulation gathered evidence that SRNs trained to predict each element of sequences identical to those used in Gómez (2002) and Onnis et al (2003, submitted) can master non-adjacencies in a manner that depends on the variability of the intervening material, thus replicating the empirically observed U-shaped relationship between variability and classiication performance he speciic interest paid to the U shape results in this study lies in the fact that no mechanism proposed in the implicit learning literature can readily simulate the human data To the extent that SRNs are also associative machines, the successful results are also surprising In this section we attempt to understand how SRNs succeed in learning non-adjacencies he key to understanding the SRN behavior is its ability to represent in its hidden units graded and overlapping representations for both the current stimulus-response mapping and any previous context information Hidden units adjust at each step of processing and can be thought of as a compressed and context-dependent “rerepresentation” of the current step in the task Given for instance a network with 10 hidden units, the internal representation of this network can be seen as a point in a 10-dimensional space As the training progresses, the network’s representation changes, and a trajectory is traced through the 10-dimension space Multi Dimensional Scaling (MDS) is a technique that reduces an n-dimensional space into a 2-dimensional space of relevant dimensions, and thus allows the visualization of this learning trajectory in a network as a function of training (Figures 7, 8, and 9) In order to predict three diferent Bi endings correctly the network has to develop trajectories that are separate enough at the time that an Xj is presented (see also Botvinick & Plaut, 2004) For the sake of the argument, let us consider a simpler scenario in which an artiicial language is composed of only two items, i.e it is an XjBi language When the input is an X, the hidden units must be shaped so as to predict one of three B elements his task still requires some considerable learning because the net has to activate an output node out of all the possible items in the language, including the Xs What speciic B will they predict? he hidden units are modiied by both (a) a trace for each of the Xs from the input units at time t, and (b) the EOS (End of Sentence) marker from the context units (this was information at time t-1) In this case, given that this past information is exactly identical for whatever prediction of B, the hidden unit representations will be similar regardless of any speciic Bi continuation In this case, therefore, there is absolutely no information in the past items that can help the hidden units to develop separate trajectories for B1, B2, and B3, and the best error reduction is obtained by activating the nodes corresponding to the three Bs with an activation of 0.33 (corresponding to an even probability of predicting one of three elements)  Luca Onnis, Arnaud Destrebecqz, Morten H Christiansen, Nick Chater, & Axel Cleeremans Let us now imagine the scenario of our simulations in which the language is an AiXBi language Here the past information that shapes the hidden units is (a) a trace from one of several Xs from the input units at time t; (b) a trace from one of three As at time t-1 from the context units, which is speciic for each B prediction; and (c) a trace from the previous EOS (End of Sentence) marker which has been incorporated in the previous time steps at t-2, which is the same for all B predictions he past context for predicting a speciic Bi is now partially diferent, because we have a speciic correspondence between an Ai and a Bi in the language In this scenario the hidden units may now develop diferent trajectories, and thus be able to predict successfully diferent B continuations What is the best condition for such dissimilarity? With low variability of Xs the traces from each shared X overshadow the traces from the A elements so that the networks form very similar representations for predicting B elements Figure 7 presents the two principal components of a Multiple Dimensional Scaling (MDS) analysis over the SRN hidden units in the setsize condition, at the time of predicting the B element over 15 diferent points in training.5 Hidden unit trajectories move across training, but they not separate at the end of training Contrast this result with Figure 8, the same MDS analysis over the hidden units of a network in Set-size 24 Hidden units move together in space at the beginning of training up to a point when they separate in diferent sub-regions of the space, corresponding to separate representations for A1, A2, and A3 It is evident that the 24 embeddings now each contribute a weaker trace and this allows the trace from each individual Ai element to be maintained more strongly in the context units, shaping the activation pattern of hidden units Regarding the large diference in performance between set-size and 2, how SRNs learn to predict the correct B non-adjacency in the former but not in the latter case? he MDS graph of hidden unit trajectories (Figure 9) once again reveals that diferent trajectories are traversed ending in three distinct regions of the space, a situation similar to set-size 24 It seems that the networks develop a compressed representation for a general X either with no variability or with a large enough number of Xs, thus leaving computational space for the three distinct A traces to be encoded in the hidden units Although this explanation is reasonable for set-size 24, one possibility is however that the networks merely memorize the three diferent strings in set-size 1, suggesting that not one but two diferent mechanisms are responsible for the U shape – one based on variability in set-size 24 and one based on rote learning in set-size In Onnis et al  Ungrammatical sequences are removed from the graph, because each produces exactly the same vector over the network’s hidden units Hence the graph displays trajectories: one each for AX1, AX2, BX1, BX2, CX1, and CX2 Associative learning of nonadjacent dependencies  0.60 X=2 Principal Component 0.40 0.20 0.00 a_x1 a_x2 b_x1 b_x2 c_x1 c_x2 –0.20 –0.40 –0.60 –0.80 –1.80 –1.60 –1.40 –1.20 –1.00 –0.80 –0.60 –0.40 –0.20 0.00 0.20 0.40 0.60 0.80 Principal Component Figure MDS analysis of hidden unit trajectories A network trained on Xs fails to achieve the needed separation: all trajectories remain close to each other all the way through the end of training Hence the network can never form correct predictions of the successor to the X 0.60 X = 24 Principal Component 0.40 0.20 0.00 aX1aX2bX1bX2cX1cX2- –0.20 –0.40 –0.60 –0.80 –1.80 –1.60 –1.40 –1.20 –1.00 –0.80 –0.60 –0.40 –0.20 0.00 0.20 0.40 0.60 0.80 Principal Component Figure MDS analysis of hidden unit trajectories in the set-size 24 X condition: all trajectories start out, on the let side, from the same small region, and progressively diverge to result in three pairs of two representations  Luca Onnis, Arnaud Destrebecqz, Morten H Christiansen, Nick Chater, & Axel Cleeremans 0.60 X=1 Principal Component 0.40 0.20 0.00 –0.20 a_x1 a_x2 b_x1 –0.40 –0.60 –0.80 –1.80 –1.60 –1.40 –1.20 –1.00 –0.80 –0.60 –0.40 –0.20 0.00 0.20 0.40 0.60 0.80 Principal Component Figure MDS analysis for a network trained on the set-size X condition Like in the set-size 24 X case, the network is successful in separating out the corresponding internal representations: he terminal points of each trajectory end up in diferent regions of space this possibility was resolved by showing that learners can endorse correct nonadjacent dependencies in set-size even when presented with a novel X at test (their Experiment 3) hey also showed (Experiment 2) that performance was good even when diferent A_B pairs had to be learned with one X Since in this latter control condition the number of string types to be learned was exactly the same as in set-size (and indeed resulted in a more complex language with 13 words as opposed to words in set-size 2), the diference in performance could not be accounted for by a memory advantage in set-size Since the MDS analyses cannot disambiguate whether the networks learn by rote in set-size – a result that would difer from human learning – we ran further simulations equivalent to Onnis et al.’s Experiment and SRNs were trained on exactly the same training regime as Simulation 1, while Ai_Bi and *Ai_Bj frames were presented at test with a completely new X that had never appeared during training Intriguingly, the networks still recognized the correct non-adjacencies better with null or high variability than in the set-size condition Figure 10 shows that when presented with novel Xs at test SRNs performance is considerably better in setsize and 24 than in set-size Figure 11 shows that this advantage persists when the networks have to learn nonadjacent dependencies, i.e when the number of trigrams to be learned is equated in set-size and set-size Crucially, in both set-size and 24, the networks develop a single representation for the X, which leaves compression space for the trace of distant A elements to be encoded in the hidden units We believe that these results, coupled with the separation of hidden unit trajectories, form compelling evidence that the learning of non-adjacencies happens Associative learning of nonadjacent dependencies  zscores of differences independently of speciic X embeddings, thus corroborating the idea that what is learned is not a trigram sequence of adjacent elements, but a true discontinuity relation In fact, the reason why the discontinuities are not learned equally well in low variability conditions is exactly that the networks ind an optimal solution in learning adjacent bigram information in those conditions Our simulations reveal that a similar variability-driven mechanism is responsible for better learning of non-adjacencies in either zero or high variability, closely matching the human data 1.8 1.6 1.4 1.2 1.0 0.8 0.6 0.4 0.2 0.0 –0.2 –0.4 –0.6 –0.8 –1.0 –1.2 SRN HD V1 V2 Variability V24 zscores of differences Figure 10 SRN and Human Data (HD) performance in endorsing nonadjacencies in sentences containing novel Xs in three difering conditions of variability (V) 1.4 1.2 1.0 0.8 0.6 0.4 0.2 0.0 –0.2 –0.4 –0.6 –0.8 –1.0 –1.2 –1.4 SRN HD V1 V2 Variability Figure 11 Both SRNs and humans learn better six nonadjacent frames with one X than three nonadjacent frames with two Xs, suggesting that there is something special about having only one intervening X  Luca Onnis, Arnaud Destrebecqz, Morten H Christiansen, Nick Chater, & Axel Cleeremans Type variability or token frequency? A last confound that has to be disentangled in the current simulations is the possible role of token frequency of X elements Because the total number of learning trials is kept constant across conditions, in larger set-size conditions each X element is presented to the network fewer times It may thus be the case that the trace from the A elements can be better encoded in the hidden units in set-size 24 because the token frequency of each X element decreases Under this scenario, improved non-adjacent learning in higher-variability conditions would not necessarily be due to the higher variability of X types, but rather to the lower frequency of X tokens, thus perhaps trivializing our results herefore, we ran a further set of simulations similar to Gómez (2002), in which the number of total token presentations of each X was kept constant across conditions Gómez found that learning still improved in set-size 24, thus ruling out the impact of X token frequency Figure 12 shows that when the number of X tokens is identical across variability conditions to the one used in set-size 24 condition of Simulation (i.e 15 repetitions) then the SNRs learn in the high variability conditions, suggesting that type variability, not token frequency is indeed the key factor improving performance in set-size 24 Figure 12 also shows that with training brought up to more asymptotical levels (token frequency of Xs of 150, 360, and 720 repetitions held constant across set-size conditions) the U shape is restored hese training trajectories are in line with connectionist networks’ typical behavior and not depart from human behavior Typically a connectionist network needs a certain amount of training in order to get “of the ground” It starts with low random weights, and needs to conigure itself to solve the task at hand his takes several training items, many more than humans typically need Arguably when humans enter the psychologist’s lab to participate in a study they not start with “random connections”, rather they bring with them considerable knowledge, accumulated over years of experience with sequences of events in the world herefore, we expected networks to require a longer training to conigure themselves for a particular task In separate studies, connectionist networks were pre-trained on basic low-level regularities of the training stimuli prior to the actual learning task (Botvinick & Plaut, 2006; Christiansen, Conway, and Curtin, 2000; Destrebecqz & Cleeremans, 2003; Harm & Seidenberg, 1999) As more data is collected on the learning of non-adjacencies, it will be necessary to provide more detailed models However, our choice of localist representations and no pretraining was motivated by the desire to capture something general about the U shape, as Onnis et al (submitted; Experiment 4) also obtained a similar learning with visually presented pseudo-shapes herefore, Figure 12 suggests that when the SRNs receive suicient training to learn the material in every condition (at least 150 repetitions of each X element) the U-shaped curve is fully restored hese control simulations Associative learning of nonadjacent dependencies  suggest that the emerging U-shaped curve in learning non-adjacencies is truly mediated by the type frequency of intervening embedded elements 0.45 0.40 Luce ratio difference 0.35 720 rep 360 rep 150 rep 15 rep 0.30 0.25 0.20 0.15 0.10 0.05 0.00 V1 V2 V6 Variability V12 V24 Figure 12 SRNs simulations controlling for number of tokens of the embeddings across variability conditions With a suicient number of tokens (150) the networks display a U-shaped learning curve that is dependent on the variability of embeddings Conclusions Sensitivity to transitional probabilities of various orders including non-adjacent probabilities in implicit sequential learning has been observed experimentally in adults and children, suggesting that learners exploit these statistical properties of the input to detect structure Indeed, studies of individual diferences in the ability to detect nonadjacencies in implicit sequential learning tasks have been found to correlate with adults’ language skills (Misyak & Christiansen, 2012; Misyak et al 2010a, b) Detecting non-adjacent structure poses a genuine computational and representational problem for simple associative models based purely on knowledge of adjacent items Following Gómez (2002), a more elaborate proposal is that human learners may exploit diferent sources of information, here adjacencies and non-adjacencies, to learn structured sequences Her original results suggested that non-adjacencies are learned better when adjacent information becomes less informative he current work began where the experimental data of Gómez (2002) and Onnis et al (2003; submitted) concluded It is a irst attempt to provide a mechanistic account of implicit associative learning for a set of human results that the current literature  Luca Onnis, Arnaud Destrebecqz, Morten H Christiansen, Nick Chater, & Axel Cleeremans cannot explain We have compared diferent connectionist architectures with several diferent parameter conigurations resulting in 22,500 individual simulations, allowing a comprehensive search over the space of possible network performances Such extensive modelling allowed us to select with a good degree of conidence Simple Recurrent Networks as the best candidates for learning under conditions of variability We have shown that SRNs succeed in accounting for the experimental U shape patterns his is not an easy feat, because SRNs have initial architectural biases toward local dependencies (Chater & Conkey, 1992; Christiansen & Chater, 1999) and because better predictions in SRNs tend to converge towards the optimal conditional probabilities of observing a particular successor to the sequence presented up to that point his means that minima are located at points in weight space where the activations equal the optimal conditional probability In fact, activations of output units corresponding to the three inal items to be predicted in set-size 2, 6, and 12 settle around.33, which is the optimal conditional probability for (B|X) across conditions However, n-gram transitional probabilities fail to account for non-adjacent constraints, yielding suboptimal solutions he networks’ ability to predict non-adjacencies is modulated by variability of the intervening element, under conditions of either nil or high variability, achieved by developing separate graded representations in the hidden units An analysis of hidden unit trajectories over training and control simulations with new embedded elements presented at test suggests that the networks’ success at the endpoints of the U curve might be supported by a similar type of learning, thus ruling out a simplistic rote learning explanation for Set-size We presented a connectionist model that can capture in a single representation both local and non-local properties of the input in a superimposed fashion his permits it to discover structured sequential input in an implicit, associative way Together, the experimental and simulation data on the U-curve challenge previous AGL accounts based on one default source of learning he major implication of this work is that, rather than ruling out associative mechanisms across the board, some statistical learning based on distributional information can account for apparently puzzling aspects of human learning of non-adjacent dependencies Furthermore, to the extent that these models it the human data without explicit knowledge, they provide a proof of concept that explicit conscious knowledge may not be necessary to acquire long-distance relations Acknowledgments his work was supported by European Commission Grant HPRN-CT-1999-00065, an institutional grant from the Université Libre de Bruxelles, a Human Frontiers Science Program Grant (RGP0177/2001-B), and Nanyang Technological University’s StartUp-Fund #M4081274 Axel Cleeremans is a Senior Research Associate of the National Fund for Scientiic Research (Belgium) Associative learning of nonadjacent dependencies  References Allen, J., & Seidenberg, M.S (1999) he emergence of grammaticality in connectionist networks In B MacWhinney (Ed.), he emergence of language (pp 115–151) Mahwah, NJ: Lawrence Erlbaum Associates Botvinick, M., & Plaut, D.C (2006) Short-term memory for serial order: A recurrent neural network model Psychological Review, 113, 201–233 DOI: 10.1037/0033-295X.113.2.201 Botvinick, M., & Plaut, D.C (2004) Doing without schema hierarchies: A recurrent connectionist approach to normal and impaired routine sequential action Psychological Review, 111, 395–429 DOI: 10.1037/0033-295X.111.2.395 Brakel, P., & Frank, S.L (2009) Strong systematicity in sentence processing by simple recurrent networks In N.A Taatgen, H van Rijn, J Nerbonne & L Schomaker (Eds.), Proceedings of the 31st Annual Conference of the Cognitive Science Society (pp 1599–1604) Austin, TX: Cognitive Science Society Chater, N., & Conkey, P (1992) Finding linguistic structure with recurrent neural networks In Proceedings of the 14th Annual Conference of the Cognitive Science Society (pp. 402–407) Hillsdale, New Jersey: Psychology Press Chomsky, N (1959) A review of BF Skinner’s Verbal Behavior Language, 35(1), 26–58 Christiansen, M.H., Allen, J., & Seidenberg, M.S (1998) Learning to segment speech using multiple cues: A connectionist model Language and Cognitive Processes, 13, 221–268 DOI: 10.1080/016909698386528 Christiansen, M.H., & Chater, N (1999) Toward a connectionist model of recursion in human linguistic performance Cognitive Science, 23, 157–205 DOI: 10.1207/s15516709cog2302_2 Christiansen, M.H., Conway, C.M., & Curtin, S (2000) A connectionist single-mechanism account of rule-like behavior in infancy In L R Gleitman & A.K Joshi (Eds.), he Proceedings of the 22nd Annual Conference of the Cognitive Science Society (pp 83–88) Philadelphia, PA: University of Pennsylvania Christiansen, M.H., & MacDonald, M.C (2009) A usage-based approach to recursion in sentence processing Language Learning, 59(Suppl 1), 126–161 DOI: 10.1111/j.1467-9922.2009.00538.x Cleeremans, A., Servan-Schreiber, D., & McClelland, J.L (1989) Finite state automata and simple recurrent networks Neural Computation, 1, 372–381 DOI: 10.1162/neco.1989.1.3.372 Destrebecqz, A., & Cleeremans, A (2003) Temporal factors in sequence learning In Luis Jiménez (Ed.), Attention and implicit learning Amsterdam: John Benjamins DOI: 10.1075/aicr.48.11des Cottrell, G.W., & Plunkett, K (1995) Acquiring the mapping from meanings to sounds Connection Science, 6, 379–412 DOI: 10.1080/09540099408915731 Dell, G.S., Juliano, C., & Govindjee, A (1993) Structure and content in language production: A theory of frame constraints in phonological speech errors Cognitive Science, 17, 149–195 DOI: 10.1207/s15516709cog1702_1 Dienes, Z (1992) Connectionist and memory-array models of artiicial grammar learning Cognitive Science, 23, 53–82 DOI: 10.1207/s15516709cog2301_3 Dulany, D.E., Carlson, R.A., & Dewey, G.I (1984) A case of syntactical learning and judgement: How conscious and how abstract? Journal of Experimental Psychology: General, 113, 541–555 DOI: 10.1037/0096-3445.113.4.541  Luca Onnis, Arnaud Destrebecqz, Morten H Christiansen, Nick Chater, & Axel Cleeremans Elman, J L (1990) Finding structure in time Cognitive science, 14(2), 179–211 Elman, J.L (1991) Distributed representations, simple recurrent networks, and grammatical structure Machine Learning, 7, 195–224 Estes, K., Evans, J., Alibali, M., & Safran, J (2007) Can infants map meaning to newly segmented words? Psychological Science, 18(3), 254 DOI: 10.1111/j.1467-9280.2007.01885.x Farkaš, I., & Crocker, M.W (2008) Syntactic systematicity in sentence processing with a recurrent self-organizing network Neurocomputing, 71(7), 1172–1179 DOI: 10.1016/j.neucom.2007.11.025 Frank, M.C., Goldwater, S., Griiths, T.L., & Tenenbaum, J.B (2010) Modeling human performance in statistical word segmentation Cognition, 117(2), 107–125 DOI: 10.1016/j.cognition.2010.07.005 Frank, S.L (in press) Getting real about systematicity In P Calvo & J Symons (Eds.), Systematicity and cognitive architecture: Conceptual and empirical issues 25 years ater Fodor & Pylyshyn’s challenge to connectionism Cambridge, MA: he MIT Press Frinken, V., Fischer, A., Manmatha, R., & Bunke, H (2012) A novel word spotting method based on recurrent neural networks IEEE Transactions on, Pattern Analysis and Machine Intelligence, 34(2), 211–224 DOI: 10.1109/TPAMI.2011.113 Gaskell, M.G., Hare, M., & Marslen-Wilson, W.D (1995) A connectionist model of phonological representation in speech perception Cognitive Science, 19, 407–439 DOI: 10.1207/s15516709cog1904_1 Gibson, F.P., Fichman, M., & Plaut, D.C (1997) Learning in dynamic decision tasks: Computational model and empirical evidence Organizational Behavior and Human Decision Processes, 71, 1–35 DOI: 10.1006/obhd.1997.2712 Gómez, R (2002) Variability and detection of invariant structure Psychological Science, 13, 431–436 DOI: 10.1111/1467-9280.00476 Harm, M.W., & Seidenberg, M.S (1999) Phonology, reading acquisition, and dyslexia: Insights from connectionist models Psychological Review, 106, 491–528 DOI: 10.1037/0033-295X.106.3.491 Hinoshita, W., Arie, H., Tani, J., Okuno, H.G., & Ogata, T (2011) Emergence of hierarchical structure mirroring linguistic composition in a recurrent neural network Neural Networks, 24(4), 311–320 DOI: 10.1016/j.neunet.2010.12.006 Johnstone, T & Shanks, D.R (2001) Abstractionist and processing accounts of implicit learning Cognitive Psychology, 42, 61–112 DOI: 10.1006/cogp.2000.0743 Jordan, M.I (1986) Attractor dynamics and parallelism in a connectionist sequential machine In Proceedings of the Eighth Annual Conference of the Cognitive Science Society Hillsdale, NJ: Lawrence Erlbaum Associates Kinder, A & Shanks, D.R (2001) Amnesia and the declarative/procedural distinction: A recurrent network model of classiication, recognition, and repetition priming Journal of Cognitive Neuroscience, 13, 648–669 DOI: 10.1162/089892901750363217 Kirov, C., & Frank, R (2012) Processing of nested and cross-serial dependencies: An automaton perspective on SRN behaviour Connection Science, 24(1), 1–24 DOI: 10.1080/09540091.2011.641939 Lashley, K.S (1951) he problem of serial order in behavior In L.A Jefress (Ed.), Cerebral mechanisms in behavior (pp 112–146) New York, NY: Wiley Luce, R.D (1963) Detection and recognition In R.D Luce, R.R bush, & E Galanter (Eds.), Handbook of mathematical psychology New York, NY: Wiley Associative learning of nonadjacent dependencies  Maraqa, M., Al-Zboun, F., Dhyabat, M., & Zitar, R.A (2012) Recognition of Arabic Sign Language (ArSL) using recurrent neural networks Journal of Intelligent Learning Systems and Applications, 4(1), 41–52 DOI: 10.4236/jilsa.2012.41004 Maskara, A., & Noetzel, A (1992) Forced simple recurrent neural network and grammatical inference In Proceedings of the Fourteenth Annual Conference of the Cognitive Science Society (pp 420–425) Hillsdale, NJ: Lawrence Erlbaum Associates Miikkulainen, R., & Mayberry III, M R (1999) Disambiguation and grammar as emergent sot constraints In B MacWhinney (Ed.), Emergence of language, 153–176 Mahwah, NJ: Lawrence Erlbaum Associates Misyak, J.B., & Christiansen, M.H (2012) Statistical learning and language: An individual differences study Language Learning, 62, 302–331 DOI: 10.1111/j.1467-9922.2010.00626.x Misyak, J.B., Christiansen, M.H & Tomblin, J.B (2010a) On-line individual diferences in statistical learning predict language processing Frontiers in Psychology, Sept.14 DOI: 10.3389/fpsyg.2010.00031 Misyak, J.B., Christiansen, M.H & Tomblin, J.B (2010b) Sequential expectations: he role of prediction- based learning in language Topics in Cognitive Science, 2, 138–153 DOI: 10.1111/j.1756-8765.2009.01072.x Moss, H.E., Hare, M.L., Day, P., & Tyler, L.K (1994) A distributed memory model of the associative boost in semantic priming Connection Science, 6, 413–427 DOI: 10.1080/09540099408915732 Munakata, Y., McClelland, J.L., & Siegler, R.S (1997) Rethinking infant knowledge: Toward an adaptive process account of successes and failures in object permanence tasks Psychological Review, 104, 686–713 DOI: 10.1037/0033-295X.104.4.686 Onnis, L., Christiansen, M.H., Chater, N., & Gómez, R (submitted) Statistical learning of nonadjacent relations Submitted manuscript Onnis, L., Christiansen, M.H., Chater, N., & Gómez, R (2003) Reduction of uncertainty in human sequential learning: Preliminary evidence from Artiicial Grammar Learning In R Alterman & D Kirsh (Eds.), Proceedings of the 25th Annual Conference of the Cognitive Science Society Boston, MA: Cognitive Science Society Onnis, L., Monaghan, P., Christiansen, M H., & Chater, N (2004) Variability is the spice of learning, and a crucial ingredient for detecting and generalizing in nonadjacent dependencies In Proceedings of the 26th annual conference of the Cognitive Science Society (pp 1047–1052) Mahwah, NJ: Lawrence Erlbaum Pacton, S., Perruchet, P., Fayol, M., & Cleeremans, A (2001) Implicit learning out of the lab: he case of orthographic regularities Journal of Experimental Psychology: General, 130, 401–426 DOI: 10.1037/0096-3445.130.3.401 Perruchet, P., & Pacteau, C (1990) Synthetic grammar learning: Implicit rule abstraction or explicit fragmentary knowledge? Journal of Experimental Psychology: General, 119, 264–275 DOI: 10.1037/0096-3445.119.3.264 Perruchet, P., & Pacton, S (2006) Implicit learning and statistical learning: One phenomenon, two approaches Trends In Cognitive Sciences, 10(5), 233–238 DOI: 10.1016/j.tics.2006.03.006 Plaut, D.C., & Kello, C.T (1999) he emergence of phonology from the interplay of speech comprehension and production: A distributed connectionist approach In B. MacWhinney (Ed.), he emergence of language (pp 381–415) Mahwah, NJ: Lawrence Erlbaum Associates  Luca Onnis, Arnaud Destrebecqz, Morten H Christiansen, Nick Chater, & Axel Cleeremans Redington, M., & Chater, N (2002) Knowledge representation and transfer in artiicial grammar learning (AGL) In R.M French & A Cleeremans (Eds.), Implicit learning and consciousness: An empirical, philosophical, and computational consensus in the making Hove: Psychology Press Rohde, D.L.T., & Plaut, D.C (1999) Language acquisition in the absence of explicit negative evidence: How important is starting small? Cognition, 72, 67–109 DOI: 10.1016/S0010-0277(99)00031-1 Safran, J.R., Aslin, R.N., & Newport, E.L (1996) Statistical learning by 8-month-old infants Science, 274, 1926–1928 DOI: 10.1126/science.274.5294.1926 Safran, J (2001) Words in a sea of sounds: he output of infant statistical learning Cognition, 81, 149–169 DOI: 10.1016/S0010-0277(01)00132-9 Servan-Schreiber, D., Cleeremans, A & McClelland, J.L (1991) Graded state machines: he representation of temporal dependencies in simple recurrent networks Machine Learning, 7, 161–193 Si, Y., Xu, J., Zhang, Z., Pan, J., & Yan, Y (2012) An improved Mandarin voice input system using recurrent neural network language model In Computational Intelligence and Security (CIS), Eighth International Conference on (pp 242–246) IEEE Socher, R., Manning, C.D., & Ng, A.Y (2010) Learning continuous phrase representations and syntactic parsing with recursive neural networks In Proceedings of the NIPS-2010 Deep Learning and Unsupervised Feature Learning Workshop Hilton: Cheakmus Sutskever, I., Martens, J., & Hinton, G (2011) Generating text with recurrent neural networks In Proceedings of the 2011 International Conference on Machine Learning (ICML-2011) Tabor, W (2011) Recursion and recursion-like structure in ensembles of neural elements In H. Sayama, A Minai, D Braha, & Y Bar-Yam (Eds.), Unifying themes in complex systems Proceedings of the VIII International Conference on Complex Systems (pp 1494–1508) Berlin: Springer Takac, M., Benuskova, L, & Knott, A (2012) Mapping sensorimotor sequences to word sequences: A connectionist model of language acquisition and sentence generation Cognition, 125, 288–308 DOI: 10.1016/j.cognition.2012.06.006 Vokey, J.R., & Brooks, L.R (1992) Salience of item knowledge in learning artiicial grammar Journal of Experimental Psychology: Learning, Memory, and Cognition, 20, 328–344 DOI: 10.1037/0278-7393.18.2.328 Appendix A Measures of associative learning Global associative chunk strength (GCS, Knowlton & Squire, 1994) averages the frequencies of all bigrams and trigrams that appear in strings For instance, one can calculate the GCS for grammatical test items in set-size he form of each test item is AiXjBi, with Ai_Bi dependencies and Xj-elements A speciic item, for instance A1X2B1, is composed of bigrams, A1X2 and X2B1, each repeated 72 times during training, and one trigram A1X2B1, repeated 72 times he GCS measure for this item is obtained by averaging the summed frequencies of each n-gram by the number of n-grams: freq ( A1 X ) + freq ( X B1 ) + freq ( A1 X B1 ) = 72 + 72 + 72 = 72 Associative learning of nonadjacent dependencies  Likewise, the GCS for an ungrammatical test item in set-size 2, say A1X2B2, is calculated as follows: freq ( A1 X ) + freq ( X B ) + freq ( A1 X B ) = 72 + 72 + = 48 he Anchor Associative Chunk strength measure (ACS, Reber & Allen, 1978) is similar to the Global Chunk Strength measure, but gives greater weight to the salient initial and inal symbols of each string It is computed by averaging the frequencies of the irst and last bigrams and trigrams in each string In this particular case, because the strings only contain three items the ACS scores are the same as the GCS scores (see Table 2) he irst two rows in Table show that GCS/ACS values are always higher for grammatical than for ungrammatical sentences (with a constant ratio of 1.5) and that both values decrease as a function of set-size Such measures predict that if learners were relying on chunk strength association, their performance should decrease as set-size increases, and thus they not capture the U shape he Novelty measure counts the number of fragments that are new in a sentence presented at test (Redington & Chater, 1996; 2002) his score is for grammatical test strings across conditions, because they not contain novel fragments and for ungrammatical test strings because they contain one new trigram AiXBj his measure predicts a preference for grammatical strings across conditions, and thus does not capture the U shape either Yet another measure is novel fragment position (NFP, Johnstone & Shanks, 2001), which counts the number of known fragments in novel absolute position his score is for both grammatical and ungrammatical test strings, since no fragment appears in a new position with respect to training items and thus cannot account for any diferences in grammaticality judgments across conditions Lastly, Global Similarity (GS) measures the number of letters in a test string that difer from the nearest training string (Vokey & Brooks, 1992) For grammatical test strings this score is 0, and for ungrammatical test items it is Since this value is the same across conditions, GS predicts preference for grammatical strings in all conditions, and again fails to capture the U shape ... Christiansen, Nick Chater, & Axel Cleeremans Candidate measures of associative learning here exist several putative associative mechanisms of artiicial grammar and sequence learning (e.g Dulany et al... correct grammatical string based on knowledge of adjacent transitional probabilities alone Gómez hypothesized that if adjacent transitional probabilities were made weaker, the nonadjacent invariant... mathematical psychology New York, NY: Wiley Associative learning of nonadjacent dependencies  Maraqa, M., Al-Zboun, F., Dhyabat, M., & Zitar, R .A (2012) Recognition of Arabic Sign Language

Tiêu đề	Implicit Learning of Non-Adjacent Dependencies
Tác giả	Luca Onnis, Arnaud Destrebecqz, Morten H. Christiansen, Nick Chater, Axel Cleeremans
Trường học	Nanyang Technological University
Chuyên ngành	Language and Other Higher-Cognitive Functions
Thể loại	chapter
Năm xuất bản	2015
Thành phố	Amsterdam

Định dạng
Số trang	34
Dung lượng	392,48 KB