An Encyclopaedia of Language_07 ppt

of observations or experimental subjects in which the members are more like each other than they are like members of other clusters. In some types of cluster analysis, a tree-like representation shows how tighter clusters combine to form looser aggregates, until at the topmost level all the observations belong to a single cluster. A further useful technique is multidimensional scaling, which aims to produce a pictorial representation of the relationships implicit in the (dis)similarity matrix. In factor analysis, a large number of variables can be reduced to just a few composite variables or ‘factors’. Discussion of various types of multivariate analysis, together with accounts of linguistic studies involving the use of such techniques, can be found in Woods et al. (1986). The rather complex mathematics required by multivariate analysis means that such work is heavily dependent on the computer. A number of package programs are available for statistical analysis. Of these, almost certainly the most widely used is SPSS (Statistical Package for the Social Sciences), an extremely comprehensive suite of programs available, in various forms, for both mainframe and personal computers. An introductory guide to the system can be found in Norušis (1982), and a description of a version for the IBM PC in Frude (1987). The package will produce graphical representations of frequency distributions (the number of cases with particular values of certain variables), and a wide range of descriptive statistics. It will cross-tabulate data according to the values of particular variables, and perform chi-square tests of independence or association. A range of other non-parametric and parametric tests can also be requested, and multivariate analyses can be performed. Another statistical package which is useful for linguists is MINITAB (Ryan, Joiner and Ryan 1976). Although not as comprehensive as SPSS, MINITAB is rather easier to use, and the most recent version offers a range of basic statistical facilities which is likely to meet the requirements of much linguistic research. Examples of SPSS and MINITAB analyses of linguistic data can be found in Butler (1985b 155–65) and MINITAB examples also in Woods et al. (1986:309–13). Specific packages for multivariate analysis, such as MDS(X) and CLUSTAN, are also available. 3. THE COMPUTATIONAL ANALYSIS OF NATURAL LANGUAGE: METHODS AND PROBLEMS 3.1 The textual material Text for analysis by the computer may be of various kinds, according to the application concerned. For an artificial intelligence researcher building a system which will allow users to interrogate a database, the text for analysis will consist only of questions typed in by the user. Stylisticians and lexicographers, however, may wish to analyse large bodies of literary or non-literary text, and those involved in machine translation are often concerned with the processing of scientific, legal or other technical material, again often in large quantities. For these and other applications the problem of getting large amounts of text into a form suitable for computational analysis is a very real one. As was pointed out in section 1.1, most textual materials have been prepared for automatic analysis by typing them in at a keyboard linked to a VDU. It is advisable to include as much information as is practically possible when encoding texts: arbitrary symbols can be used to indicate, for example, various functions of capitalisation, changes of typeface and layout, and foreign words. To facilitate retrieval of locational information during later processing, references to important units (pages, chapters, acts and scenes of a play, and so on) should be included. Many word processing programs now allow the direct entry of characters with accents and other diacritics, in languages such as French or Italian. Languages written in non-Roman scripts may need to be transliterated before coding. Increasingly, use is being made of OCR machines such as the KDEM (see section 1.1), which will incorporate markers for font changes, though text references must be edited in during or after the input phase. Archives of textual materials are kept at various centres, and many of the texts can be made available to researchers at minimal cost. A number of important corpora of English texts have been assembled: the Brown Corpus (Kucera and Francis 1967) consists of approximately 1 million words of written American English made up of 500 text samples from a wide range of material published in 1961; the Lancaster-Oslo-Bergen (LOB) Corpus (see e.g. Johansson 1980) was designed as a British English near-equivalent of the Brown Corpus, again consisting of 500 2000-word texts written in 1961; the London-Lund Corpus (LLC) is based on the Survey of English Usage conducted under the direction of Quirk (see Quirk and Svartvik 1978). These corpora are available, in various forms, from the International Computer Archive of Modern English (ICAME) in Bergen. Parts of the London-Lund corpus are available in book form (Svartvik and Quirk 1980). A very large corpus of English is being built up at the University of Birmingham for use in lexicography (see section 4.3) and other areas. The main corpus consists of 7.3 million words (6 million from a wide range of written varieties, plus 1.3 million words of non- spontaneous educated spoken English), and a supplementary corpus is also available, taking the total to some 20 million words. A 1 million word corpus of materials for the teaching of English as a Foreign Language is also available. For a description of the philosophy behind the collection of the Birmingham Corpus see Renouf (1984, 1987). Descriptive work on 336 LANGUAGE AND COMPUTATION these corpora will be outlined in section 4.1. Collections of texts are also available at the Oxford University Computing Service and at a number of other centres. 3.2 Computational analysis in relation to linguistic levels Problems of linguistic analysis must ultimately be solved in terms of the machine’s ability to recognise a ‘character set’ which will include not only the upper and lower case letters of the Roman alphabet, punctuation marks and numbers, but also a variety of other symbols such as asterisks, percentage signs, etc. (see Chapter 20 below). It is therefore obvious that the difficulty of various kinds of analysis will depend on the ease with which the problems involved can be translated into terms of character sequences. 3.2.1 Graphological analysis Graphological analyses, such as the production of punctuation counts, word-length and sentence length profiles, and lists of word forms (i.e. items distinguished by their spelling) are obviously the easiest to obtain. Word forms may be presented as a simple list with frequencies, arranged in ascending or descending frequency order, or by alphabetical order starting from the beginning or end of the word. Alternatively, an index, giving locational information as well as frequency for each chosen word, can be obtained. More information still is given by a concordance, which gives not only the location of each occurrence of a word in the text, but also a certain amount of context for each citation. Packages are available for the production of such output, the most versatile being the Oxford Concordance Program (OCP) (see Hockey and Marriott 1980), which runs on a wide range of mainframe computers and on the IBM PC and compatible machines. The CLOC program (see Reed 1977), developed at the University of Birmingham, also allows the user to obtain word lists, indexes and concordances, but is most useful for the production of lists of collocations, or co-occurrences of word forms. For a survey of both OCP and CLOC, with sample output, see Butler (1985a). Neither package produces word-length or sentence-length profiles, but these are easily programmed using a language such as SNOBOL. 3.2.2 Lexical analysis So far, we have considered only the isolation of word forms, distinguished by consisting of unique sequences of characters. Often, however, the linguist is interested in the occurrence and frequency of lexemes, or ‘dictionary words’, (e.g. RUN) rather than of the different forms which such lexemes can take (e.g. run, runs, ran, running). Computational linguists refer to lexemes as lemmata, and the process of combining morphologically-related word forms into a lemma is known as lemmatisation. Lemmatisation is one of the major problems of computational text analysis, since it requires detailed specification of morphological and spelling rules; nevertheless, substantial progress has been made for a number of languages (see also section 3.2.4). A related problem is that of homography, the existence of words which belong to different lemmata but are spelt in the same way. These problems will be discussed further in relation to lexicography in section 4.3. 3.2.3 Phonological analysis The degree of success achievable in the automatic segmental phonological analysis of texts depends on the ability of linguists to formulate explicitly the correspondences between functional sound units (phonemes) and letter units (graphemes)—on which see Chapter 20 below. Some languages, such as Spanish and Czech, have rather simple phoneme-grapheme relationships; others, including English, present more difficulties because of the many-to-many relationships between sounds and letters. Some success is being achieved, as the feasibility of systems for the conversion of written text to synthetic speech is investigated (see section 4.5.5). For a brief non-technical account see Knowles (1986). Work on the automatic assignment of intonation contours while processing written-to-be-spoken text is currently in progress in Lund and in Lancaster. The TESS (Text Segmentation for Speech) project in Lund (Altenberg 1986, 1987; Stenström 1986) aims to describe the rules which govern the prosodic segmentation of continuous English speech. The analysis is based on the London-Lund Corpus of Spoken English (see section 4.1), in which tone units are marked. The automatic intonation assignment project in Lancaster (Knowles and Taylor 1986) has similar aims, but is based on a collection of BBC sound broadcasts. Work on the automatic assignment of stress patterns will be discussed in relation to stylistic analysis in section 4.2.1. AN ENCYCLOPAEDIA OF LANGUAGE 337 3.2.4 Syntactic analysis A brief review of syntactic parsing can be found in de Roeck (1983), and more detailed accounts in Winograd (1983), Harris (1985) and Grishman (1986); particular issues are addressed in more detail in various contributions to King (1983a), Sparck Jones and Wilks (1983) and Dowty et al. (1985). The short account of natural language processing by Gazdar and Mellish (1987) is also useful. The first stage in parsing a sentence is a combination of morphological analysis (to distinguish the roots of the word forms from any affixes which may be present) and the looking up of the roots in a machine dictionary. An attempt is then made to assign one or more syntactic structures to the sentence on the basis of a grammar. The earliest parsers, developed in the late 1950s and early 1960s, were based on context-free phrase structure grammars, consisting of sets of rules in which ‘non- terminal’ symbols representing particular categories are rewritten in terms of other categories, and eventually in terms of ‘terminal’ symbols for actual linguistic items, with no restriction on the syntactic environment in which the reformulation can occur. For instance, a simple (indeed, over-simplified) context-free grammar for a fragment of English might include the following rewrite rules: S → NP VP NP → Art N VP → V NP VP → V V → broke N → boy N → window Art → the where S is a ‘start symbol’ representing a sentence, NP a noun phrase, VP a verb phrase, N a noun, V a verb, Art an article. Such a grammar could be used to assign structures to sentences such as The boy broke the window or The window broke, these structures commonly being represented in tree form as illustrated below. We may use this tree to illustrate the distinction between ‘top-down’ or ‘hypothesis-driven’ parsers and ‘bottom-up’ or ‘data- driven’ parsers. A top-down parser starts with the hypothesis that we have an S, then moves through the set of rules, using them to expand one constituent at a time until a terminal symbol is reached, then checking whether the data string matches this symbol. In the case of the above sentence, the NP symbol would be expanded as Art N, and Art as the, which does match the first word of the string, so allowing the part of the tree corresponding to this word to be constructed. If N is expanded as boy this also matches, so that the parser can now go onto the VP constituent, and so on. A bottom-up parser, on the other hand, starts with the terminal symbols and attempts to combine them. It may start from the left (finding that the Art the and the N boy combine to give a NP, and so on), or from the right. Some parsers use a combination of approaches, in 338 LANGUAGE AND COMPUTATION which the bottom-up method is modified by reference to precomputed sets of tables showing combinations of symbols which can never lead to useful higher constituents, and which can therefore be blocked at an early stage. A further important distinction is that between non-deterministic and deterministic parsing. Consider the sentence Steel bars reinforce the structure. Since bars can be either a noun or a verb, the computer must make a decision at this point. A non- deterministic parser accepts that multiple analyses may be needed in order to resolve such problems, and may tackle the situation in either of two basic ways. In a ‘depth-first’ search, one path is pursued first, and if this meets with failure, backtracking occurs to the point where a wrong choice was made, in order to pursue a second path. Such backtracking involves the undoing of any structures which have been built up while the incorrect path was being followed, and this means that correct partial structures may be lost and built up again later. To prevent this, well-formed partial structures may be stored in a ‘chart’ for use when required. An alternative to depth-first parsing is the ‘breadth-first’ method, in which all possible paths are pursued in parallel, so obviating the need for backtracking. If, however, the number of paths is considerable, this method may lead to a ‘combinatorial explosion’ which makes it uneconomic; furthermore, many of the constituents built will prove useless. Deterministic parsers (see Sampson 1983a) attempt to ensure that only the correct analysis for a given string is undertaken. This is achieved by allowing the parser to look ahead by storing information on a small number of constituents beyond the one currently being analysed. (See Chapter 10, section 2.1, above.) Let us now return to the use of particular kinds of grammar in parsing. Difficulties with context-free parsers led the computational linguists of the mid and late 1960s to turn to Chomsky’s transformational generative (TG) grammar (see Chomsky 1965), which had a context-free phrase structure ‘base’ component, plus a set of rules for transforming base (‘deep structure’) trees into other trees, and ultimately into trees representing the ‘surface’ structures of sentences. The basic task of a transformational parser is to undo the transformations which have operated in the generation of a sentence. This is by no means a trivial job: since transformational rules interact, it cannot be assumed that the rules for generation can simply be reversed for analysis; furthermore, deletion rules in the forward direction cause problems, since in the reverse direction there is no indication of what should be inserted (see King 1983b for further discussion). Faced with the problems of transformational parsing, the computational linguists of the 1970s began to examine the possibility of returning to context-free grammars, but augmenting them to overcome some of their shortcomings. The most influential of these types of grammar was the Augmented Transition Network (ATN) framework developed by Woods (1970). An ATN consists of a set of ‘nodes’ representing the states in which the system can be, linked by ‘arcs’ representing transitions between the states, and leading ultimately to a ‘final state’. A brief, clear and non-technical account of ATNs can be found in Ritchie and Thompson (1984), from which source the following example is taken. The label on each arc consists of a test and an action to be taken if that test is passed: for instance, the arc leading from NPo specifies that if the next word to be analysed is a member of the Article category. NP-Action 1 is to be performed, and a move to state NP, is to be made. The tests and actions can be much more complicated than these examples suggest: for instance, a match for a phrasal category (e.g. NP) can be specified, in which case the current state of the network is ‘pushed’ on to a data structure known as a ‘stack’, and a subnetwork for that particular type of phrase is activated. When the subnetwork reaches its final state, a return is made to the main network. Values relevant to the analysis (for instance, yes/no values reflecting the presence or absence of particular features, or partial structures) may be stored in a set of ‘registers’ associated with the network, and the actions specified on arcs may relate to the changing of these values. ATNs have formed the basis of many of the syntactic parsers developed in recent years, and may also be used in semantic analysis (see section 3.2.5). Recently, context-free grammars have attracted attention again within linguistics, largely due to the work of Gazdar and his colleagues on a model known as Generalised Phrase Structure Grammar (GPSG) (see Gazdar et al. 1985). Unlike Chomsky, Gazdar believes that context-free grammars are adequate as models of human language. This claim, and its relevance to parsing, is discussed by Sampson (1983b). A parser which will analyse text using a user-supplied GPSG and a dictionary has been described by Phillips and Thompson (1985). AN ENCYCLOPAEDIA OF LANGUAGE 339 3.2.5 Semantic analysis For certain kinds of application (e.g. for some studies in stylistics) a semantic analysis of a text may consist simply of isolating words from particular semantic fields. This can be done by manual scanning of a word list for appropriate items, perhaps followed by the production of a concordance. In other work, use has been made of computerised dictionaries and thesauri for sorting word lists into semantically based groupings. More will be said about these analyses in section 4.2. For many applications, however, a highly selective semantic analysis is insufficient. This is particularly true of work in artificial intelligence, where an attempt is made to produce computer programs which will ‘understand’ natural language, and which therefore need to perform detailed and comprehensive semantic analysis. Three approaches to the relationship between syntactic and semantic analysis can be recognised. One approach is to perform syntactic analysis first, followed by a second pass which converts the syntactic tree to a semantic representation. The main advantage of this approach is that the program can be written as separate modules for the two kinds of analysis, with no need for a complex control structure to integrate them. On the negative side, however, this is implausible as a model of human processing. Furthermore, it denies the possibility of using semantic information to guide syntactic analysis where the latter could give rise to more than one interpretation. A second approach is to minimise syntactic parsing and to emphasise semantic analysis. This approach can be seen in some of the parsers of the late 1960s and 1970s, which make no distinction between the two types of analysis. One form of knowledge representation which proved useful in these ‘homogeneous’ systems is the conceptual dependency framework of Schank (1972). This formalism uses a set of putatively universal semantic primitives, including a set of actions, such as transfer of physical location, transfer of a more abstract kind, movement of a body part by its owner, and so on, out of which representations of more complex actions can be constructed. Actions, objects and their modifiers can also be related by a set of dependencies. Conceptualisations of events can be modified by information relating to tense, mood, negativity, etc. A further type of homogeneous analyser is based on the ‘preference semantics’ of Wilks (1975), in which semantic restrictions between items are treated not as absolute, but in terms of preference. For instance, although the verb eat preferentially takes an animate subject, inanimate ones are not ruled out (e.g. as in My printer just eats paper). Wilks’s system, like Schank’s, uses a set of semantic primitives. These are grouped into trees, giving a formula for each word sense. Sentences for analysis are fragmented into phrases, which are then matched against a set of templates made up of the semantic primitives. When a match is obtained, the template is filled, and links are then sought between these filled templates in order to construct a semantic representation for the whole sentence. Burton (1976), also Woods et al. (1976), proposed the use of Augmented Transition Networks for semantic analysis. In such systems, the arcs and nodes of an ATN can be labelled with semantic as well as syntactic categories, and thus represent a kind of ‘semantic grammar’ in which the two types of patterning are mixed. A third approach is to interleave semantic analysis with syntactic parsing. The aim of such systems is to prevent the fruitless building of structures which would prove semantically unacceptable, by allowing some form of semantic feedback to the parsing process. 3.2.6 From sentence analysis to text analysis So far, we have dealt only with the analysis of sentences. Clearly, however, the meaning of a text is more than the sum of the meanings of its individual sentences. To understand a text, we must be able to make links between sentence meanings, often over a considerable distance. This involves the resolution of anaphora (for instance, the determination of the correct referent for a pronoun), a problem which can occur even in the analysis of individual sentences, and which is discussed from a computational perspective by Grishman (1988:124–34). It also involves a good deal of inferencing, during which human beings call upon their knowledge of the world. One of the most difficult problems in the computational processing of natural language texts is how to represent this knowledge in such a way that it will be useful for analysis. We have already met two kinds of knowledge representation formalism: conceptual dependency and semantic ATNs. In recent years, other types of representation have become increasingly important; some of these are discussed below. A knowledge representation structure known as the frame, introduced by Minsky (1975), makes use of the fact that human beings normally assimilate information in terms of a prototype with which they are familiar. For instance, we have internalised representations of what for us is a prototypical car, house, chair, room, and so forth. We also have prototypes for situations, such as buying a newspaper. Even in cases where a particular object or situation does not exactly fit our prototype (e.g. perhaps a car with three wheels instead of four), we are still able to conceptualise it in terms of deviations from the norm. Each frame has a set of slots which specify properties, constituents, participants, etc., whose values may be numbers, character strings or other frames. The slots may be associated with constraints on what type of value may occur there, and there may be a default value which is assigned when no value is provided by the input data. This means that a frame can 340 LANGUAGE AND COMPUTATION provide information which is not actually present in the text to be analysed, just as a human processor can assume, for example, that a particular car will have a steering wheel, even though (s)he may not be able to see it from where (s)he is standing. Analysis of a text using frames requires that a semantic analysis be performed in order to extract actions, participants, and the like, which can then be matched against the stored frame properties. If a frame appears to be only partially applicable, those parts which do match can be saved, and stored links between frames may suggest new directions to be explored. Scripts, developed by Schank and his colleagues at Yale (see Schank and Abelson 1975, 1977) are in some ways similar to frames, but are intended to model stereotyped sequences of events in narratives. For instance, when we go to a restaurant, there is a typical sequence of events, involving entering the restaurant, being seated, ordering, getting the food, eating it, paying the bill and leaving. As with frames, the presence of particular types of people and objects, and the occurrence of certain events, can be predicted even if not explicitly mentioned in the text. Like frames, scripts consist of a set of slots for which values are sought, default values being available for at least some slots. The components of a script are of several kinds: a set of entry conditions which must be satisfied if the script is to be activated; a result which will normally ensue; a set of props representing objects typically involved; a set of roles for the participants in the sequence of events. The script describes the sequence of events in terms of ‘scenes’ which, in Schank’s scheme, are specified in conceptual dependency formalism. The scenes are organised into ‘tracks’, representing subtypes of the general type of script (e.g. going to a coffee bar as opposed to an expensive restaurant). There may be a number of alternative paths through such a track. Scripts are useful only in situations where the sequence of events is predictable from a stereotype. For the analysis of novel situations, Schank and Abelson (1977) proposed the use of ‘plans’ involving means-ends chains. A plan consists of an overall goal, alternative sequences of actions for achieving it, and preconditions for applying the particular types of sequence. More recently, Schank has proposed that scripts should be broken down into smaller units (memory organisation packets, or MOPs) in such a way that similarities between different scripts can be recognised. Other developments include the work of Lehnert (1982) on plot units, and of Sager (1978) on a columnar ‘information format’ formalism for representing the properties of texts in particular fields (such as subfields of medicine or biochemistry) where the range of semantic relations is often rather restricted. So far, we have concentrated on the analysis of language produced by a single communicator. Obviously, however, it is important for natural language understanding systems to be able to deal with dialogue, since many applications involve the asking of questions and the giving of replies. As Grishman (1986:154) points out, the easiest such systems to implement are those in which either the computer or the user has unilateral control over the flow of the discourse. For instance, the computer may ask the user to supply information which is then added to a data base; or the user may interrogate a data base held in the computer system. In such situations, the computer can be programmed to know what to expect. The more serious problems arise when the machine has to be able to adapt to a variety of linguistic tactics on the part of the user, such as answering one question with another. Some ‘mixed-initiative’ systems of this kind have been developed, and one will be mentioned in section 4.5.3. One difficult aspect of dialogue analysis is the indirect expression of communicative intent, and it is likely that work by linguists and philosophers on indirect speech acts (see Grice 1975, Searle 1975, and Chapter 6 above) will become increasingly important in computational systems (Allen and Perrault 1980). 4. USES OF COMPUTATIONAL LINGUISTICS 4.1 Corpus linguistics There is a considerable and fast growing body of work in which text corpora are being used in order to find out more about language itself. For a long time linguistics has been under the influence of a school of thought which arose in connection with the ‘Chomskyan revolution’ and which regards corpora as inappropriate sources of data, because of their finiteness and degeneracy. However, as Aarts and van den Heuvel (1985) have persuasively argued, the standard arguments against corpus linguistics rest on a misunderstanding of the nature and current use of corpus studies. Present-day corpus linguists proceed in the same manner as other linguists in that they use intuition, as well as the knowledge about the language which has been accumulated in prior studies, in order to formulate hypotheses about language; but they go beyond what many others attempt, in testing the validity of their hypotheses on a body of attested linguistic data. The production of descriptions of English has been furthered recently by the automatic tagging of the large corpora mentioned in section 3.1 with syntactic labels for each word. The Brown corpus, tagged using a system known as TAGGIT, was later used as a basis for the tagging of the LOB corpus. The LOB tagging programs (see Garside and Leech 1982; Leech, Garside and Atwell 1983; Garside 1987) use a combination of wordlists, suffix removal and special routines for numbers, AN ENCYCLOPAEDIA OF LANGUAGE 341 hyphenated words and idioms, in order to assign a set of possible grammatical tags to each word. Selection of the ‘correct’ tag from this set is made by means of a ‘constituent likelihood grammar’ (Atwell 1983, 1987), based on information, derived from the Brown Corpus, on the transitional probabilities of all possible pairs of successive tags. A success rate of 96.5–97 per cent has been claimed. Possible future developments include the use of tag probabilities calculated for particular types of text, and the manual tagging of the corpus with sense numbers from the Longman Dictionary of Contemporary English is already under way. The suite of programs used for the tagging of the London-Lund Corpus of Spoken English (Svartvik and Eeg-Olofsson 1982, Eeg-Olofsson and Svartvik 1984, Eeg-Olofsson 1987, Altenberg 1987, Svartvik 1987) first splits the material up into tone units, then analyses these at word, phrase, clause and discourse levels. Word class tags are assigned by means of an interactive program using lists of high-frequency words and of suffixes, together with probabilities of tag sequences. A set of ordered, cyclical rules assign phrase tags, and these are then given clause function labels (Subject, Complement, etc.). Discourse markers, after marking at word level, are treated separately. These tagged corpora have been used for a wide variety of analyses, including work on relative clauses, verb-particle combinations, ellipsis, genitives in -s, modals, connectives in object noun clauses, negation, causal relations and contrast, topicalisation, discourse markers, etc. Accounts of these and other studies can be found in Johansson 1982, Aarts and Meijs 1984, 1986, Meijs 1987 and various volumes of ICAME News, produced by the International Computer Archive of Modern English in Bergen. 4.2 Stylistics Enkvist (1964) has highlighted the essentially quantitative nature of style, regarding it as a function of the ratios between the frequencies of linguistic phenomena in a particular text or text type and their frequencies in some contextually related norm. Critics have at times been rather sceptical of statistical studies of literary style, on the grounds that simply counting linguistic items can never capture the essence of literature in all its creativity. Certainly the ability of the computer to process vast amounts of data and produce simple or sophisticated statistical analyses can be a danger if such analyses are viewed as an end in themselves. If, however, we insist that quantitative studies should be closely linked with literary interpretation, then automated analysis can be a most useful tool in obtaining evidence to reject or support the stylistician’s subjective impressions, and may even reveal patterns which were not previously recognised and which may have some literary validity, permitting an enhanced rereading of the text. Since the style of a text can be influenced by many factors, the choice of appropriate text samples for study is crucial, especially in comparative studies. For an admirably sane treatment of the issue of quantitation in the study of style see Leech and Short (1981), and for a discussion of difficulties in achieving a synthesis of literary criticism and computing see Potter (1988). Computational stylistics can conveniently be discussed under two headings: firstly ‘pure’ studies, in which the object is simply to investigate the stylistic traits of a text, an author or a genre; and secondly ‘applied’ studies, in which similar techniques are used with the aim of resolving problems of authorship, chronology or textual integrity. The literature on this field is very extensive, and only the principles, together with a few selected examples, are discussed below. 4.2.1 ‘Pure’ computational stylistics Many studies in ‘pure’ computational stylistics have employed word lists, indexes or concordances, with or without lemmatisation. Typical examples are: Adamson’s (1977, 1979) study of the relationship of colour terms to characterisation and psychological factors in Camus’s L’Etranger; Burrows’s (1986) extremely interesting and persuasive analysis of modal verb forms in relation to characterisation, the distinction between narrative and dialogue, and different types of narrative, in the novels of Jane Austen; and also Burrows’s later (1987) wide-ranging computational and statistical study of Austen’s style. Word lists have also been used to investigate the type-token ratio (the ratio of the number of different words to the total number of running words), which can be valuable as an indicator of the vocabulary richness of texts (that is, the extent to which an author uses new words rather than repeating ones which have already been used). Word and sentence length profiles have also been found useful in stylistics, and punctuation analysis can also provide valuable information, provided that the possible effects of editorial changes are borne in mind. For an example of the use of a number of these techniques see Butler (1979) on the evolution of Sylvia Plath’s poetic style. Computational analysis of style at the phonological level is well illustrated by Logan’s work on English poetry. Logan (1982) built up a phonemic dictionary by entering transcriptions manually for one text, then using the results to process a further text, adding any additional codings which were necessary, and so on. The transcriptions so produced acted as a basis for automatic scansion. Logan (1976, 1985) has also studied the ‘sound texture’ of poetry by classifying each phoneme with a 342 LANGUAGE AND COMPUTATION set of binary distinctive features. These detailed transcriptions were then analysed to give frequency lists of sounds, lists of lines with repeated sounds, percentages of the various distinctive features in each line of poetry, and so on. Sounds were also placed on a number of scales of ‘sound colour’, such as hardness vs. softness, sonority vs. thinness, openness vs. closeness, backness vs. frontness, (on which see Chapters 1 and 2 above), and lines of poetry, as well as whole poems, were then assigned overall values for each scale, which were correlated with literary interpretations. Alliteration and stress assignment programs have been developed for Old English by Hidley (1986). Much computational stylistic analysis involving syntactic patterns has employed manual coding of syntactic categories, the computer being used merely for the production of statistical information. A recent example is Birch’s (1985) study of the works of Thomas More, in which it was shown that scores on a battery of syntactic variables correlated with classifications based on contextual and bibliographical criteria. Other studies have used the EYEBALL syntactic analysis package written by Ross and Rasche (see Ross and Rasche 1972), which produces information on word classes and functions, attempts to parse sentences, and gives tables showing the number of syllables per word, words per sentence, type/token ratio, etc. Jaynes (1980) used EYEBALL to produce word class data on samples from the early, middle and late output of Yeats, and to show that, contrary to much critical comment, the evolution in Yeats’s style seems to be more lexical than syntactic. Increasingly, computational stylistics is making use of recent developments in interactive syntactic tagging and parsing techniques. For instance, the very impressive work of Hidley (1986), mentioned earlier in relation to phonological analysis of Old English texts, builds in a system which suggests to the user tags based on a number of phonological, morphological and syntactic rules. Hidley’s suite of programs also generates a database containing the information gained from the lexical, phonological and syntactic analysis of the text, and allows the exploration of this database in a flexible way, to isolate combinations of features and plot the correlations between them. Although, as we have seen, much work on semantic patterns in literary texts has used simple graphologically-based tools such as word lists and concordances, more ambitious studies can also be found. A recent example is Martindale’s (1984) work on poetic texts, which makes use of a semantically-based dictionary for the analysis of thematic patterns. In such work, as in, for instance, the programs devised by Hidley, the influence of artificial intelligence techniques begins to emerge. Further developments in this area will be outlined in section 4.5.4. 4.2.2 ‘Applied’ computational stylistics The ability of the computer to produce detailed statistical analyses of texts is an obvious attraction for those interested in solving problems of disputed authorship and chronology in literary works. The aim in such studies is to isolate textual features which are characteristic of an author (or, in the case of chronology, particular periods in the author’s output), and then to apply these ‘fingerprints’ to the disputed text(s). Techniques of this kind, though potentially very powerful, are, as we shall see, fraught with pitfalls for the unwary, since an author’s style may be influenced by a large number of factors other than his or her own individuality. Two basic approaches to authorship studies can be discerned: tests based on word and/or sentence length, and those concerned with word frequency. Some studies have combined the two types of approach. Methods based on word and sentence length have been reviewed by Smith (1983), who concludes that word length is an unreliable predictor of authorship, but that sentence length, although not a strong measure, can be a useful adjunct to other methods, provided that the punctuation of the text can safely be assumed to be original, or that all the texts under comparison have been prepared by the same editor. The issue of punctuation has been one source of controversy in the work of Morton (1965), who used differences in sentence length distribution as part of the evidence for his claim that only four of the fourteen ‘Pauline’ epistles in the New Testament were probably written by Paul, the other ten being the work of at least six other authors. It was pointed out by critics, however, that it is difficult to know what should be taken as constituting a sentence in Greek prose. Morton (1978:99–100) has countered this criticism by claiming that editorial variations cause no statistically significant differences which would lead to the drawing of incorrect conclusions. Morton’s early work on Greek has been criticised on other grounds too: he attempts to explain away exceptions by means of the kinds of subjective argument which his method is meant to make unnecessary; and it is claimed that the application of his techniques to certain other groups of texts can be shown to give results which are contrary to the historical and theological evidence. Let us turn now to studies in which word frequency is used as evidence for authorship. The simplest case is where one of the writers in a pair of possible candidates can be shown to use a certain word, whereas the other does not. For instance, Mosteller and Wallace (1964), in their study of The Federalist papers, a set of eighteeenth-century propaganda documents, showed that certain words, such as enough, upon and while, occurred quite frequently in undisputed works by one of the possible authors, Hamilton, but were rare or non-existent in the work of the other contender, Madison. Investigation of the disputed papers revealed Madison as the more likely author on these grounds. It might be thought that the idiosyncrasies of individual writers would be best studied in the ‘lexical’ or ‘content’ words they use. Such an approach, however, holds a number of difficulties for the computational stylistician. Individual lexical items AN ENCYCLOPAEDIA OF LANGUAGE 343 often occur with frequencies which are too low for reliable statistical analysis. Furthermore, the content vocabulary is obviously strongly conditioned by the subject matter of the writing. In view of these difficulties, much recent work has concentrated on the high-frequency grammatical words, on the grounds that these are not only more amenable to statistical treatment, but are also less dependent on subject matter and less under the conscious control of the writer than the lexical words. Morton has also argued for the study of high-frequency individual items, as well as word classes, in developing techniques of ‘positional stylometry’, in which the frequencies of words are investigated, not simply for texts as wholes, but for particular positions in defined units within the text. A detailed account of Morton’s methods and their applicability can be found in Morton (1978), in which, in addition to examining word frequencies at particular positions in sentences (typically the first and last positions), he claims discriminatory power for ‘proportional pairs’ of words (e.g. the frequency of no divided by the total frequency for no and not, or that divided by that plus this), and also collocations of contiguous words or word classes, such as as if, and the or a plus adjective. Comparisons between texts are made by means of the chi-square test. Morton applies these techniques to the Elizabethan drama Pericles, providing evidence against the critical view that only part of it is by Shakespeare. Morton also discusses the use of positional stylometry to aid in the assessment of whether a statement made by a defendant in a legal case was actually made in his or her own words. Morton’s methods have been taken up by others, principally in the area of Elizabethan authorship: for instance, a lively and inconclusive debate has recently taken place between Merriam (1986, 1987) and Smith (1986, 1987) on the authorship of Henry VIII and of Sir Thomas More. Despite Smith’s reservations about the applicability of the techniques as used by Morton and Merriam, he does believe that an expansion of these methods to include a wider range of tests could be a valuable preliminary step to a detailed study of authorship. Recently, Morton (1986) has claimed that the number of words occurring only once in a text (the ‘hapax legomena’) is also useful in authorship determination. So far, we have examined the use of words at the two ends of the frequency spectrum. Ule (1983) has developed methods for authorship study which make use of the wider vocabulary structure of texts. One useful measure is the ‘relative vocabulary overlap’ between texts, defined as the ratio of the actual number of words the texts have in common to the number which would be expected if the texts had been composed by drawing words at random from the whole of the author’s published work (or some equivalent corpus of material). A second technique is concerned with the distribution of words which appear in only one of a set of texts, and a further method is based on a procedure which allows the calculation of the expected number of word types for texts of given length, given a reference corpus of the author’s works. These methods proved useful in certain cases of disputed Elizabethan authorship. As a final example of authorship attribution, we shall examine briefly an extremely detailed and meticulous study, by Kjetsaa and his colleagues, of the charge of plagiarism levelled at Sholokhov by a Soviet critic calling himself D*. A detailed account of this work can be found in Kjetsaa et al. (1984). D*’s claim, which was supported in a preface by Solzhenitsyn and had a mixed critical reaction, was that the acclaimed novel The Quiet Don was largely written not by Sholokhov but by a Cossack writer, Fedor Kryukov. Kjetsaa’s group set out to provide stylometric evidence which might shed light on the matter. Two pilot studies on restricted samples, suggested that stylometric techniques would indeed differentiate between the two contenders, and that The Quiet Don was much more likely to be by Sholokhov than by Kryukov. The main study, using much larger amounts of the disputed and reference texts, bore out the predictions of the pilot work, by demonstrating that The Quiet Don differed significantly from Kryukov’s writings, but not from those of Sholokhov, with respect to sentence length profile, lexical profile, type-token ratio (on both lemmatised and unlemmatised text, very similar results being obtained in each case), and word class sequences, with additional suggestive evidence from collocations. 4.3 Lexicography and lexicology In recent years, the image of the traditional lexicographer, poring over thousands of slips of paper neatly arranged in seemingly countless boxes, has receded, to be replaced by that of the ‘new lexicographer’, making full use of computer technology. We shall see, however, that the skills of the human expert are by no means redundant, and Chapter 19, below, should be read in this connection. The theories which lexicographers make use of in solving their problems are sometimes said to belong to the related field of lexicology, and here too the computer has had a considerable impact. The first task in dictionary compilation is obviously to decide on the scope of the enterprise, and this involves a number of interrelated questions. Some dictionaries aim at a representative coverage of the language as a whole; others (e.g. the Dictionary of American Regional English) are concerned only with non-standard dialectal varieties, and still others with particular diatypic varieties (e.g. dictionaries of German or Russian for chemists or physicists). Some are intended for native speakers or very advanced students of a language; others, such as the Oxford Advanced Learner’s Dictionary of English and the new Collins COBUILD English Language Dictionary produced by the Birmingham team, are designed specifically for 344 LANGUAGE AND COMPUTATION foreign learners. Some are monolingual, others bilingual. These factors will clearly influence the nature of the materials upon which the dictionary is based. As has been pointed out by Sinclair (1985), the sources of information for dictionary compilation are of three main types. First, it would be folly to ignore the large amount of descriptive information which is already available and organised in the form of existing dictionaries, thesauri, grammars, and so on. Though useful, such sources suffer from several disadvantages: certain words or usages may have disappeared and others may have appeared; and because existing materials may be based on particular ways of looking at language, it may be difficult simply to incorporate into them new insights derived from rapidly developing branches of linguistics such as pragmatics and discourse analysis. A second source of information for lexicography, as for other kinds of descriptive linguistic activity, is the introspective judgements of informants, including the lexicographer himself. It is well known, however, that introspection is often a poor guide to actual usage. Sinclair therefore concludes that the main body of evidence, at least in the initial stages of dictionary making, should come from the analysis of authentic texts. The use of textual material for citation purposes has, of course, been standard practice in lexicography for a very long time. Large dictionaries such as the Oxford English Dictionary relied on the amassing of enormous numbers of instances sent in by an army of voluntary readers. Such a procedure, however, is necessarily unsystematic. Fortunately, the revolution in computer technology which we are now witnessing is, as we have already seen, making the compilation and exhaustive lexical analysis of textual corpora a practical possibility. Corpora such as the LOB, London-Lund and Birmingham collections provide a rich source which is already being exploited for lexicographical purposes. Although most work in computational lexicography to date has used mainframe computers, developments in microcomputer technology mean that work of considerable sophistication is now possible on smaller machines (see Paikeday 1985, Brandon 1985). The most useful informational tools for computational lexicography are word lists and concordances, arranged in alphabetical order of the beginnings or ends of words, in frequency order, or in the order of appearance in texts. Both lemmatised and unlemmatised listings are useful, since the relationship between the lemma and its variant forms is of considerable potential interest. For the recently published COBUILD dictionary, for instance, a decision was made to treat the most frequently occurring form of a lemma as the headword for the dictionary entry. Clearly, such a decision relies on the availability of detailed information on the frequencies of word forms in large amounts of text, which only a computational analysis can provide (see Sinclair 1985). The COBUILD dictionary project worked with a corpus of some 7.3 million words; even this, however, is a small figure when compared with the vast output produced by the speakers and writers of a language, and it has been argued that a truly representative and comprehensive dictionary would have to use a database of much greater size still, perhaps as large as 500 million words. For a comprehensive account of the COBUILD project, see Sinclair (1987). The lemmatisation problem has been tackled in various ways in different dictionary projects. Lexicographers on the Dictionary of Old English project in Toronto (Cameron 1977) lemmatised one text manually, then used this to lemmatise a second text, adding new lemmata for any word forms which had not been present in the first text. In this way, an ever more comprehensive machine dictionary was built up, and the automatic lemmatisation of texts became increasingly efficient. Another technique was used in the production of a historical dictionary of Italian at the Accademia della Crusca in Florence: a number was assigned to each successive word form in the texts, and the machine was then instructed to allocate particular word numbers to particular lemmata. A further method, used in the Trésor de la Langue Française (TLF) project in Nancy and Chicago, is to use a machine dictionary of the most common forms, with their lemmata. Associated with lemmatisation are the problems of homography (the existence of words with the same spellings but quite different meanings) and polysemy (the possession of a range of meanings which are to some extent related). In some such cases (e.g. bank, meaning a financial institution or the edge of a river), it is clear that we have homography, and that two quite separate lemmata are therefore involved; in many instances, however, the distinction between homography and polysemy is not obvious, and the lexicographer must make a decision about the number of separate lemmata to be used (see Moon 1987). Although the computer cannot take over such decisions from the lexicographer, it can provide a wealth of information which, together with other considerations such as etymology, can be used as the basis for decision. Concordances are clearly useful here, since they can provide the context needed for the disambiguation of word senses. Decisions must be made concerning the minimum amount of context which will be useful: for discussion see de Tollenaere (1973). A second very powerful tool for exploring the linguistic context, or ‘co-text’, of lexical items is automated collocational analysis. The use of this technique in lexicography is still in its infancy (see Martin, Al and van Sterkenburg 1983): some collocational information was gathered in the TLF and COBUILD projects. We have seen that at present an important role of the computer is the presentation of material in a form which will aid the lexicographer in the task of deciding on lemmata, definitions, citations, etc. However, as Martin, Al and van Sterkenberg (1983) point out, advances in artificial intelligence techniques could well make the automated semantic analysis of text routinely available, if methods for the solution of problems of ambiguity can be improved. The final stage of dictionary production, in which the headwords, pronunciations, senses, citations and possibly other information (syntactic, collocational, etc.) are printed according to a specified format, is again one in which computational techniques are important (see e.g. Clear 1987). The lexicographer can type, at a terminal, codes referring to particular AN ENCYCLOPAEDIA OF LANGUAGE 345 [...]... contains a useful collection of papers covering various aspects of machine translation The process of MT consists basically of an analysis of the source language (SL) text to give a representation which will allow synthesis of a corresponding text in the target language (TL) The procedures and problems involved in analysis and AN ENCYCLOPAEDIA OF LANGUAGE 353 synthesis are, of course, largely those we... components analysis are useful A number of papers relating to manuscript grouping can be found in Irigoin and Zarri (1979) The central activity in textual editing is the attempted reconstruction of the ‘original’ text by the selection of appropriate variants, and the preparation of an apparatus criticus containing other variants and notes Although the burden of this task falls AN ENCYCLOPAEDIA OF LANGUAGE... for translation from Russian to English, using only rather rudimentary syntactic and semantic analysis This system was the forerunner of SYSTRAN, which has features of both direct and transfer approaches (see below), and has been used for Russian-English translation by the US Air Force, by the National Aeronautic and Space Administration, and by EURATOM in Italy Versions of SYSTRAN for other language... to the analysis and generation of single languages In general, as we might expect from previous discussion, the analysis of the SL is a rather harder task than the generation of the TL text The words of the SL text must be identified by morphological analysis and dictionary look-up, and problems of multiple word meaning must be resolved Enough of the syntactic structure of the SL text must be analysed... the understanding of natural language Generation has received much less attention from computational linguists than language understanding; paradoxically, this is partly because it presents fewer problems The problem of building a language understanding system is to provide the ability to analyse the vast variety of structures and lexical items which can occur in a AN ENCYCLOPAEDIA OF LANGUAGE 349... University of Grenoble (1961–71), which used what was effectively a semantic representation as its ‘pivot’ language in translating, mainly between Russian and French The rigidity of design and the inefficiency of the parser used caused the abandonment of the interlingual approach in favour of a transfer type of design Transfer systems differ from interlingual systems in interposing separate SL and TL transfer... but soon this was abandoned and the more ambitious scheme proposed of compiling an entirely new historical dictionary The plan was to record all words that had been in the language since the middle of the twelfth century, to provide a full perspective of their changing forms and meanings, and to fix as accurately as possible the point of entry of new words and the point of departure of obsolete ones All... area of computers and writing, see Selfe and Wahlstrom (1988) 5 COMPUTERS AND LANGUAGE: AN INFLUENCE ON ALL This review will, it is hoped, have shown that no serious student of language today can afford to ignore the immense impact made by the computer in a wide range of linguistic areas Computational linguistics is of direct relevance to stylisticians, textual critics, translators, lexicographers and... stores of knowledge, often in great quantity, and organised in complex ways Ultimately, of course, this knowledge derives from that of human beings An extremely important area of artificial intelligence is the development AN ENCYCLOPAEDIA OF LANGUAGE 351 of expert systems, which use large bodies of knowledge concerned with particular domains, acquired from human experts, to solve problems within those... number of languages (and possibly all), so facilitating synthesis of the TL text Such a system would clearly be more economical than a series of direct systems in an environment, such as the administrative organs of the European Economic Community, where there is a need to translate from and into a number of languages Various interlinguas have been suggested: deep structure representations of the type . process of language teaching and learning. Despite a good deal of scepticism (some of it quite understandable) on the part of language teachers, there can be little doubt that computer-assisted language. by Phillips and Thompson (1985). AN ENCYCLOPAEDIA OF LANGUAGE 339 3.2.5 Semantic analysis For certain kinds of application (e.g. for some studies in stylistics) a semantic analysis of a text. accounts of the author’s own progress and problems in the area; and Leech and Candlin (1986) and Fox (1986) contain a number of articles on various aspects of CALL. CALL can offer substantial advantages

Định dạng
Số trang	57
Dung lượng	691,77 KB