Probabilistic Constraints in Language Acquisition Mark S Seidenberg, Joseph Allen, & Morten H Christiansen Program in Neural, Informational and Behavioral Sciences, University of Southern California marks, joeallen, morten@gizmo.usc.edu Abstract An approach to language acquisition is described in which the foundational questions are about how the child acquires adult-like language processing capacities This approach assumes that the child’s primary task is not to identify the grammar of the target language, but rather to learn to comprehend and produce utterances in the service of communicating with others In this framework the input to the child is seen as providing a rich source of probabilistic information, and notions of the target language are provided by constraint satisfaction models of adult performance The methodology involves implementing connectionist models that simulate detailed aspects of performance Introduction Our paper is about a new approach to language acquisition that has begun to emerge over the past few years The sources of this new approach are the renewed interest in the statistical and probabilistic aspects of language on the part of many language researchers; connectionism, which provides new insights about knowledge representation, learning, and processing; and studies of the remarkable learning capacities of infants and young children This approach entails different assumptions about what the important questions are in language acquisition research and where the answers are likely to be found First, we view the child’s task as learning to use language, not grammar identification Second, we ask how the child acquires adult-like processing capacities, not the capacity to distinguish grammatical from ungrammatical sentences Third, the “target” or “steady state” the child must achieve is provided by constraint satisfaction models of adult performance, not an idealized competence grammar What does this reorientation accomplish? For one thing, it provides the framework for a unified account of acquisition and processing For another it makes language learning a tractable problem for the child The standard approach leaves it unclear how a child could ever acquire a language: poverty of the stimulus arguments that have been taken to prove that grammar must be innate (Chomsky 1965) create a paradox because they also apply to language-specific types of knowledge that must be learned from experience The parameter setting approach attempts to avoid classic learnability problems, but introduces new ones (Gibson & Wexler 1994) In addition, the new approach is beginning to yield fresh analyses of and insights about particular language acquisition puzzles For example, the approach is yielding a different view of what is innate The old approach focuses on alternatives that date from 16th century philosophy: either tabula rasa empiricism or innate knowledge of grammar The newer view takes its account of what is innate from modern developmental neuroscience, which suggests that learning entails both structural and functional changes to the systems responsible for performance, and that these changes are governed by an interaction between the nature of the events to which the learner is exposed and a system structured so as to develop capacities for dealing successfully with particular tasks (Elman, Bates, Johnson, Karmiloff-Smith, Parisi & Plunkett 1996) In other words, brains don’t come with knowledge of specific grammars, faces, or objects already wired in; rather, brains are structured so as to facilitate the emergence of certain kinds of knowledge given certain kinds of environmental events Our approach is concerned with the factors that govern the character of such emergent knowledge The Standard Approach To place this research program in context, consider two of the standard assumptions that are widely held among acquisition researchers: there is a grammatical competence underlying language behavior, and the task of the child is to discover which of a very restricted set of innately specified grammars her language is described by In this section we critically examine these assumptions in turn The starting point for the standard approach is the competence grammar, defined as that knowledge embodied in the ideal speaker-hearer of a given speech community (Chomsky 1965) The competence grammar self-consciously abstracts away from many of the factors that actually govern whether an utterance will be used on any occasion by any individual, including memory limitations, individual differences, facts about the computational system that implements the grammar, and so on Grammar is a characterization of the knowledge that allows the speaker-hearer to produce and understand an infinite number of sentences The child’s task is then described as grammar identification; acquiring this knowledge allows grammatical sentences to be distinguished from ungrammatical ones (Wexler & Cullicover 1980) Poverty of the stimulus arguments are then used to suggest that this knowledge cannot be derived solely from experience (Chomsky 1965) The input is described as impoverished insofar as children’s knowledge of language eventually extends far beyond the range of utterances to which they are exposed; the input is also described as too rich insofar as it affords incorrect inductive generalizations that children never make; hence the input alone cannot be the source of core grammatical knowledge Results from Gold (1967) and others working within the learnability paradigm suggest that grammar identification cannot be achieved unless there are strong constraints on the possible forms of grammatical knowledge The view that children are born with knowledge of Universal Grammar (UG) is seen as compatible with the results of behavioral studies suggesting that various non-obvious aspects of grammatical knowledge (such as the binding principles) are present in children as young as they can be tested (Lust, Eisele & Mazuka 1992) It is also compatible with the observation that languages exhibit structures such as empty categories for which there is no evidence in the input (Chomsky 1981, Crain 1991) This theory is said to simultaneously explain facts about linguistic universals and other converging evidence, such as creolization and other examples of language acquisition under atypical circumstances An Alternative View We view each of these issues somewhat differently The “competence” orientation systematically excludes various factors that affect language acquisition and use, including facts about statistical and probabilistic aspects of language; perceptual and memory capacities that limit what people can produce or comprehend; and the communicative functions of language These kinds of information therefore are not available to enter into the description of linguistic structure What is available are the various abstract, domain specific, theory-internal constructs of linguistic theories The traditional approach then asks how a grammar of this sort could be acquired Competence grammar seeks an idealized characterization of language structure, and assumes that it is only with such a characterization of language in hand that one could hope to address how it is acquired, processed, or represented in the brain Facts about acquisition, processing, the brain bases of language and evolution are not available to enter into explanations of linguistic phenomena Since what is left over are the various theory-internal explanatory devices, it comes as no surprise to us that these abstract, domain- specific structures make language seem unlearnable and unrelated to other aspects of cognition Competence grammar represents an abstraction away from important properties of language, properties that provide the basis for understanding how language learning could actually proceed The competence approach also promotes systematic overattributions about the nature of linguistic knowledge by assigning to people capacities that they don’t possess We all know that competence grammar is overly powerful relative to peoples’ actual capacities to comprehend and produce utterances For example, the grammar permits unlimited amounts of centerembedding, but peoples’ capacities to comprehend or produce such structures are quite limited (For a review, see Hudson (1996)) The standard approach postulates extrinsic “performance constraints” to account for the ways in which behavior deviates from what the grammar allows Our approach attempts to eliminate the grammatical middleman We want a performance system that handles all and only those structures that people can The behavior of the system encoding knowledge of language should degrade in exactly the ways that peoples’ does Performance constraints are embodied by the system responsible for producing and comprehending utterances, not extrinsic to it The fact that we are trying to account for peoples’ actual capacities rather than the ones assumed by the competence approach is important Our methodology involves implementing connectionist models that simulate detailed aspects of performance In the past, criticism of such models has focused on whether they can capture one or another aspect of grammatical competence (e.g., unlimited recursion) However, the goal of the enterprise is to develop models that provide an account of peoples’ attested performance, not the idealized characterization of linguistic knowledge that is competence grammar Our approach also differs with regard to how we view the task confronting the language learner The standard approach assumes that the task is grammar identification: the child has to converge on the knowledge structures that allow grammatical sentences to be distinguished from ungrammatical ones From our perspective this is not what the child is actually trying to accomplish The task that children are engaged in is learning to use language In the course of mastering this task, they develop various types of knowledge representations The primary function of this knowledge is producing and comprehending utterances Being able to distinguish grammatical and ungrammatical sentences is merely a derived function This point can be understood by considering the analogous problem of learning to read The beginning reader’s problem is to learn how to read words There are various models of how the knowledge relevant to this task is acquired (Seidenberg & McClelland 1989) Once acquired this knowledge can be used to perform many other tasks One task that has been studied in scores of experiments is lexical decision: judging whether a stimulus is a word or not (Seidenberg 1995) Even young readers can reliably determine that BOOK is a word but NUST is not Note, however, that the task confronting the beginning reader is not acquiring the ability to judge the wellformedness of letter strings By the same token, the task confronting the language learner is not acquiring the ability to judge the grammaticality of sentences In both cases, knowledge that is acquired for other purposes can eventually be used to perform these secondary tasks Such tasks may provide a useful way of assessing peoples’ knowledge but should not be construed as the goal of acquisition Our approach also changes the picture regarding classic poverty of the stimulus arguments In brief, many standard arguments no longer go through Some of these arguments concern how the child copes with noisy or variable input Connectionist models provide a strong hint that this property is not devastating to systems that must converge on stable behavior One of the important properties of the algorithms used in training such models is their capacity to derive structural regularities from noisy or variable input In fact, analyses of such networks suggest that noisy input can sometimes facilitate learning, by allowing the system to escape local minima, for example (Minnix 1992) Other classic arguments turn on the assumption that the child’s task is grammar identification For example, the fact that the input contains both grammatical and ungrammatical utterances that are not labelled as such creates a massive problem for the grammar identification paradigm but not for the performance theory we are proposing Similarly, our approach does not entail the assumptions that underlie analyses of learnability by Gold and others For example, our approach suggests that the negative evidence issue has been greatly overemphasized: In the nets we are working with, constraints that act against particular states routinely arise as a by product of extracting regularities In such systems, sustained positive evidence for one generalization often turns out to simultaneously provide evidence against other, competing generalizations The common claim that “negative facts cannot be learned” (Crain 1991) is clearly contradicted by the behavior of very simple networks Thus, when the task of the child is not grammar identification, the negative evidence issue seems far less relevant Note also that the actual capacities of the child to learn are discussed very little in the standard literature, where the main emphasis of the approach is the essential unlearnability of language We think this is a mistake Claims about what cannot be learned need to be reassessed in light of new discoveries concerning the ability of young children to learn very rapidly from noisy input Far from being impoverished, the input to the child in our view provides a rich source of probabilistic information The child’s task is to identify those combinatorial and probabilistic aspects of the input that allow comprehension to proceed Representations of language emerge through the encoding of such regularities These representations will be similar but not identical to the ones derived from the classical approach Much of the focus of our research program is on determining the extent to which the input provides evidence for different types of linguistic structure For example, are there cues to the types of abstract structures proposed within the standard approach to be found in the statistical and probabilistic information available to children? What type of learning device could extract such cues, and are children devices of that sort? It seems likely that many of the claims concerning the poverty of the stimulus vis a vis particular aspects of linguistic structure will prove to be invalid, given the capacities of children to turn correlated cues into structured knowledge These considerations are relevant to assessing evidence from studies of young children that have been taken as evidence for UG (Crain 1991) Our approach suggests that it will be important to re-examine claims that various aspects of language must be innate because children exhibit knowledge of them “as young as they can be tested.” In practice, the subjects in these studies are typically 1/2 or years old We now know that learning about the statistical and probabilistic aspects of language begins much younger Recent work by Saffran, Aslin, and Newport (1997) demonstrated such learning in month olds Other studies, e.g Decasper, Lecanuet, Busnel & Maugeais (1994), suggest that language learning begins in utero: newborns show a preference for listening to speech in their mother’s language These studies indicate that children will have learned a considerable amount from experience with language before the age of 1/2 or Such studies also raise questions about the claim that children’s language exhibits properties for which there is “no evidence” in the input These claims need to be assessed in terms of the kinds of statistical information available to the child and learning mechanisms that are able to extract non-obvious regularities from it We are not asserting that these claims are wrong; rather, our focus is on the need to re-examine these phenomena from the perspective provided by the new approach For example, contrary to the assertion that that languages have properties that cannot be learned because they aren’t marked “in the signal”, networks create what look like abstract underlying representa- tions in the course of solving a problem The word segmentation model we discuss in the next section provides a simple example of this The need to reassess traditional claims in light of the new discoveries about young children’s ability to learn holds true for other sorts of converging evidence thought to support the conclusion that grammar is innate Linguistic universals, for example, are standardly explained by placing these properties of language in the brain of the child However, there are other sources of constraint on the forms of languages that need to be considered Languages may exhibit the properties they because otherwise they could not have evolved, or could not be processed, or would not fulfill particular communicative functions, or would render the language unlearnable Together we think these considerations suggest a need to reassess the nature of the biological endowment relevant to language and the evidence for it We should emphasize that our claim is not that there is no biological endowment relevant to language; rather it is that the standard arguments and conclusions are called into question by the approach we have described and that reaching valid conclusions about what is innate requires achieving a better understanding of what can be learned Current Research What kinds of research questions are suggested by this framework? The characteristic of the adult system we take to be crucial is the rapid integration of multiple probabilistic constraints involving different types of information This kind of processing has been explored extensively in the domain of syntactic ambiguity resolution and other aspects of sentence processing revealed by “on-line” processing studies (MacDonald, Pearlmutter & Seidenberg 1994) On our view, the central question in acquisition research is how the child acquires a system with this character In particular, how the various levels of linguistic representation emerge, what are their properties, and how is probabilistic knowledge acquired from experience? The general idea is that the “bootstrapping” mechanism that has played a central role in thinking about child language (Landau & Gleitman 1985) is the same as the constraint satisfaction process in adult language comprehension Bootstrapping is the idea that structures can be derived by exploiting correlations between different sources of information The phenomenon has been studied with respect to several issues in language acquisition; we think it reflects more general learning capacities In this section we discuss how the bootstrapping concept is relevant to the acquisition of three representative aspects of linguistic structure, the lexicon, grammatical categories, and verb semantics and argument structures The first issue concerns how the child identifies the lexical units of a language The second concerns how the child determines which words are members of which grammatical categories The third issue concerns how the child simultaneously learns the meanings of verbs and the conditions governing their participation in syntactic structures All of these problems share important characteristics In each case, researchers have identified a variety of sources of information in the input that might contribute to solving the problem However, using these cues to linguistic structure creates classic learnability questions: First, how does the child know which cues are the relevant ones? Second, the cues are probabilistic rather than absolute If a cue isn’t entirely reliable, how could it be useful? Although there tend to be multiple cues for each type of structure, it is not clear what benefit could possibly arise from combining several unreliable cues Answers to these questions are provided by coupling these observations about probabilistic cues with computational mechanisms that explain how cues can be identified and combined Connectionist networks provide a useful tool in this regard A network that is assigned a task such as identifying the boundaries between words acts as a discovery procedure: it will make use of whatever information in the input facilitates mastering the task The network is not restricted to using a single type of information in solving a problem; the power of the network derives from its capacity to combine multiple probabilistic cues, even those of low validity We see this as a general property of human learning and problem solving Consider first the word segmentation problem How does the child develop a spoken word vocabulary? In reading, there is a highly reliable cue to word boundaries, the white space between them In speech there are a variety of probabilistic cues These cues are thought to include the different transitional probabilities between phonological segments exhibited within and between words; the interaction of this cue with the ends of utterances, in that the end of an utterance is also likely to be the end of a word; regularities in stress patterns, such as the fact that most bisyllabic words in English are trochaic; correlations between stable elements of meaning and sound sequences; as well as the use of known words to assist in parsing new ones out of the speech stream None of these cues is fully reliable The observation that such cues are unreliable might be taken as evidence that they cannot contribute to solving the segmentation problem Although segmentation is a domain in which the role of innate linguistic knowledge is highly circumscribed, children manage to learn vocabularies The solution to this puzzle is provided by connectionist networks that are able to combine probabilistic cues efficiently Recent work (Christiansen, Allen & Seidenberg in press) on this problem suggests that conjunctions of cues, even those of low reliability, improve performance in a network considerably This research has led us to the view that knowledge underlying the ability to distinguish legal from illegal words develops in concert with, and largely as a byproduct of, the ability to detect words in the speech stream This suggests we should examine arguments concerning the possibility of learning to discriminate possible from impossible sentences with equal care The same issues arise in connection with the problem of identifying the grammatical functions of words Here the important descriptive work has been done by Michael Kelly and his colleagues (Kelly 1992, Kelly & Martin 1994) This research concerns the relationship between phonological properties of words and their grammatical categories Kelly points out that most proposals about how children learn which words are members of which categories involve exploiting the imperfect correlations between grammatical category and the syntactic or semantic properties of words-for example, that nouns are often objects, and verbs are often actions, or that the distributional contexts in which nouns appear tend to overlap, and those of verbs tend to overlap These are noisy cues, of course Kelly’s interest is in the fact that there are also phonological correlates to grammatical class; for example, the fact that bisyllabic nouns tend to receive first syllable stress, while bisyllabic verbs tend to receive second syllable stress Similarly, nouns in English tend to have more syllables than verbs Kelly’s work shows that both adults and children are sensitive to such correlations when processing novel words Even more important is the finding that processing is facilitated when multiple cues converge Kelly goes on to describe how this information might play a role in acquisition and processing A third area in which such issues arise is how children simultaneously learn both the meanings of verbs and how to interpret the arguments of verbs Allen (this volume) presents a network that combines formal, lexical and semantic information in mapping utterances to meanings This work suggests how such cues can be combined in the task of acquiring verb meanings and their privileges of use simultaneously, and how the knowledge of the lexico-semantic neighborhoods detailed in descriptive work in this area (e.g Levin 1993) might emerge as a consequence of learning to comprehend utterances This type of work illustrates several important points First, languages are standardly characterized in terms of multiple distinct levels of representation, but an important aspect of processing theories is that the dependencies among these levels play a significant role in accounting for behavior Second, the dependencies between these type of information are probabilistic, not absolute Third, humans learn these correlations by being exposed to languages that exhibit them, and proceed to use that knowledge in the course of acquisition and processing Challenges The framework we have outlined faces a number of difficult challenges In closing we discuss several of them and suggest where the solutions are likely to be found One important challenge for any statistically oriented approach to language is the fact that the input can be analyzed in so many ways That is, the world seems to provide many potentially misleading cues This raises the question of how the child would know which aspects of the input to compute statistics over Our answer to this is that it is probably not meaningful to talk about a system “making decisions” about which statistics to compute The learning algorithms we utilize discover generalizations over many levels simultaneously The result of language learning is that those aspects of the signal that turn out to facilitate the comprehension task end up as detectable sensitivities in the speakers of those languages The question that usually follows proposals of this sort is typically the following: why, in language after language, we see (for example) question formation accomplished in such similar ways (e.g., whmovement, inversion, or other common formal regularities)? Why are languages constrained to form questions in this limited way, and not in some other unattested way such as reversing all the words? Couldn’t your networks learn to that just as well? The two approaches we have discussed take different views on what counts as an answer to this question The standard approach, as far as we can tell, says “these properties exist because that is the (possibly arbitrary and accidental) way the human language capacity evolved.” On our view, one of the points of the enterprise is to specify how properties of the underlying computational mechanisms alter the situation regarding what is easy or difficult to learn In other words, we think it is likely that the sets of regularities that emerge and survive in languages are related both to the contribution such regularities make to the task of comprehension, and to how learnable those regularities are for the type of system doing the learning One illustration of this idea is the work of Jack Hawkins (Hawkins 1994), suggesting the penetrability of grammars by considerations of what is most efficient for the comprehender Another challenge is the claim that while these ideas might be relevant to low level properties of language, there is no evidence that such systems can provide an account for the unlearnable properties of UG It is true that most of the research to date has dealt with things that everyone agrees have to be learned Because this research has the character of combining lower level properties in order to understand how higher level properties emerge, we simply not know as yet how far we can go, or what things will look like when we get there Conclusions We have described a view of language acquisition that calls into question some standard assumptions about how to develop a theory of acquisition The standard assumptions lead to very strong assertions about the unlearnability of various aspects of language Our approach questions the characterization of linguistic knowledge that has resulted from this approach and the assumption that language acquisition should be construed in terms of grammar identification This reframing of the acquisition question, taken with advances in the understanding of learning and knowledge representation in connectionist networks, leads to a program of research which will both reassess standard arguments regarding the role of language-specific innate knowledge and generate new accounts of the acquisition of specific aspects of language A common objection to this type of work is the charge that it is merely recycled behaviorism This assessment is wrong for two reasons First, the behaviorists were not interested in internal representations, nor in using the properties of such representations to account for performance: We are, and the models we work with provide an alternative account of how such mental representations develop Second, the networks we use rely on properties of dynamical systems such as non-linearities in processing and the formation of stable attractors that played no role in earlier theories While the standard linguistic approach does accomplish its goal of developing generalizations concerning language regularities, these descriptions provide only one source of constraint on our theorizing Our interest is in an acquisition theory that accounts for how the child learner becomes the adult user Acknowledgements We would like to acknowledge Maryellen C MacDonald for helpful discussions concerning the issues addressed in this paper Supported by NIMH grant 47566 and an NIMH Research Scientist Development Award to MSS References Chomsky, N (1965), Aspects of the Theory of Syntax, MIT Press, Cambridge, MA Chomsky, N (1981), Lectures on Government and Binding, Dordrecht, Foris Christiansen, M., Allen, J & Seidenberg, M (in press), ‘Word segmentation using multiple cues: A connectionist model’, Language and Cognitive Processes Crain, S (1991), ‘Language acquisition in the absence of experience’, Behavioral and Brain Sciences 14(4), 597–650 Decasper, A J., Lecanuet, J P., Busnel, M C & Maugeais, R (1994), ‘Fetal reactions to recurrent maternal speech’, Infant Behavior & Development 17(2), 159– 164 Elman, J L., Bates, E A., Johnson, M H., KarmiloffSmith, A., Parisi, D & Plunkett, K (1996), Rethinking innateness : a connectionist perspective on development, Vol 1, MIT Press, Cambridge, Mass Gibson, E & Wexler, K (1994), ‘Triggers’, Linguistic Inquiry 25(3), 407–454 Gold, E M (1967), ‘Language identification in the limit’, Information and Control 10(16), 447–74 Hawkins, J A (1994), A performance theory of order and constituency, Cambridge University Press, Cambridge Hudson, R (1996), ‘The difficulty of (so-called) selfembedded structures’, UCL Working Papers in Linguistics 8(1), 283–314 Kelly, M H (1992), ‘Using sound to solve syntactic problems: The role of phonology in grammatical category assignments.’, Psychological Review 99(2), 349–364 Kelly, M H & Martin, S (1994), ‘Domain-general abilities applied to domain-specific tasks - sensitivity to probabilities in perception, cognition, and language’, Lingua 92, 105–140 Landau, B & Gleitman, L (1985), Language and Experience, Harvard University Press, Cambridge, MA Levin, B (1993), English Verb Classes and Alternations, University of Chicago Press, Chicago Lust, B., Eisele, J & Mazuka, R (1992), ‘The binding theory module - evidence from 1st languageacquisition for principle-c’, Language 68(2), 333– 358 MacDonald, M C., Pearlmutter, N J & Seidenberg, M S (1994), ‘The lexical nature of syntactic ambiguity resolution’, Psychological Review 101(4), 676–703 Minnix, J (1992), Fault tolerance of the backpropogation neural network trained on noisy inputs, in ‘Proceedings of the IEEE International Joint Conference on Neural Networks’, Baltimore Maryland Saffran, J., Aslin, R & Newport, E (1996), ‘Statistical learning by 8-month old infants’, Science (5294), 1926–1928 Seidenberg, M (1995), Visual word recognition, in J Miller & P Eimas, eds, ‘Handbook of Perception & Cognition’, Vol 11 Speech, Language & Communication, Academic Press, San Diego Seidenberg, M S & McClelland, J L (1989), ‘A distributed, developmental model of word recognition and naming’, Psychological Review 96(4), 523– 568 Wexler, K & Cullicover, P (1980), Formal Principles of Language Acquisition, MIT Press, Cambridge, MA ... is the rapid integration of multiple probabilistic constraints involving different types of information This kind of processing has been explored extensively in the domain of syntactic ambiguity... overemphasized: In the nets we are working with, constraints that act against particular states routinely arise as a by product of extracting regularities In such systems, sustained positive evidence... a characterization of language in hand that one could hope to address how it is acquired, processed, or represented in the brain Facts about acquisition, processing, the brain bases of language