Computational Methods for Corpus Annotation and Analysis Xiaofei Lu Computational Methods for Corpus Annotation and Analysis 1 3 Xiaofei Lu Department of Applied Linguistics The Pennsylvania State University University Park Pennsylvania USA ISBN 978-94-017-8644-7 ISBN 978-94-017-8645-4 (eBook) DOI 10.1007/978-94-017-8645-4 Springer New York Heidelberg Dordrecht London Library of Congress Control Number: 2014931404 © Springer Science+Business Media Dordrecht 2014 This work is subject to copyright All rights are reserved by the Publisher, whether the whole or part of the material is concerned, specifically the rights of translation, reprinting, reuse of illustrations, recitation, broadcasting, reproduction on microfilms or in any other physical way, and transmission or information storage and retrieval, electronic adaptation, computer software, or by similar or dissimilar methodology now known or hereafter developed Exempted from this legal reservation are brief excerpts in connection with reviews or scholarly analysis or material supplied specifically for the purpose of being entered and executed on a computer system, for exclusive use by the purchaser of the work Duplication of this publication or parts thereof is permitted only under the provisions of the Copyright Law of the Publisher’s location, in its current version, and permission for use must always be obtained from Springer Permissions for use may be obtained through RightsLink at the Copyright Clearance Center Violations are liable to prosecution under the respective Copyright Law The use of general descriptive names, registered names, trademarks, service marks, etc in this publication does not imply, even in the absence of a specific statement, that such names are exempt from the relevant protective laws and regulations and therefore free for general use While the advice and information in this book are believed to be true and accurate at the date of publication, neither the authors nor the editors nor the publisher can accept any legal responsibility for any errors or omissions that may be made The publisher makes no warranty, express or implied, with respect to the material contained herein Printed on acid-free paper Springer is part of Springer Science+Business Media (www.springer.com) Dedicated to my wife, Xiaomeng, and our daughter, Jasmine v Preface This book grew out of sets of lecture notes for a graduate course on computational methods for corpus annotation and analysis that I have taught in the Department of Applied Linguistics at The Pennsylvania State University since 2006 After several iterations of the course, my students and I realized that while there is an abundance of introductory sources on the fundamentals of corpus linguistics, most of them not provide the types of detailed and systematic instructions that are necessary to help language and linguistics researches get off the ground with using computational tools other than concordancing programs for automatic corpus annotation and analysis A large proportion of the students taking the course were not yet ready to embark on learning to program, and to them the introductory sources on programming for linguistics, natural language processing, and computational linguistics appeared overwhelming What seemed to be lacking was something in the middle ground, something that enables novice language and linguistics researchers to use more sophisticated and powerful corpus annotation and analysis tools than concordancing programs and yet still does not require programming This book was written with the aim to provide that middle ground I owe a special thanks to all the students who have taken the course with me at The Pennsylvania State University This book could not have been written without their inspiration In particular, I want to thank Brody Bluemel and Ben Pin-Yun Wang for providing very detailed feedback on earlier drafts of several chapters; Edie Furniss, Qianwen Li, and many others for pointing me to various stylistic issues in the book; Haiyang Ai, Brody Bluemel, Tracy Davis, Alissa Hartig, Shibamouli Lahiri, Kwanghyun Park, Jie Zhang, and Xian Zhang for numerous discussions about the lecture notes that the book grew out of while taking and/or co-teaching the course with me It would be difficult to thank all the people who have influenced the ideas behind this book I am deeply indebted to Detmar Meurers and Chris Brew, who first introduced me to the field of computational linguistics I have also learned tremendously from a large number of other colleagues, directly or indirectly To name just a few: Gabriella Appel, Stacey Bailey, Douglas Biber, Donna Byron, Marjorie Chan, Richard Crouch, Markus Dickinson, Nick Ellis, Anna Feldman, Eric Fosler-Lussier, ZhaoHong Han, Jirka Hana, Erhard Hinrichs, Tracy Holloway King, James Lantolf, vii viii Preface Michael McCarthy, Lourdes Ortega, Richard Sproat, Hongyin Tao, Steven Thorne, Mike White, Richard Xiao, and many others Last but not least, I would like to sincerely thank Jolanda Voogd at Springer for her vision, enthusiasm and patience, Helen van der Stelt at Springer for her continued support, and the anonymous reviewers for their insightful and constructive comments Contents 1 Introduction ����������������������������������������������������������������������������������������������� 1.1 Objectives and Rationale of the Book ������������������������������������������������ 1.2 Why Do We Need to Go Beyond Raw Corpora ��������������������������������� 1.3 What Is Corpus Annotation ���������������������������������������������������������������� 1.4 Organization of the Book ������������������������������������������������������������������� References ��������������������������������������������������������������������������������������������������� 1 2 Text Processing with the Command Line Interface ������������������������������� 2.1 The Command Line Interface ������������������������������������������������������������ 2.2 Basic Commands �������������������������������������������������������������������������������� 11 2.2.1 Notational Conventions ���������������������������������������������������������� 11 2.2.2 Printing the Current Working Directory ��������������������������������� 11 2.2.3 Listing Files and Subdirectories ��������������������������������������������� 12 2.2.4 Making New Directories �������������������������������������������������������� 12 2.2.5 Changing Directory Locations ����������������������������������������������� 13 2.2.6 Creating and Editing Text Files with UTF-8 Encoding ��������� 14 2.2.7 Viewing, Renaming, Moving, Copying, and Removing Files ���������������������������������������������������������������������� 16 2.2.8 Copying, Moving, and Removing Directories ����������������������� 20 2.2.9 Using Shell Meta-Characters for File Matching �������������������� 21 2.2.10 Manual Pages, Command History, and Command Line Completion �������������������������������������������������������������������� 21 2.3 Tools for Text Processing ������������������������������������������������������������������� 22 2.3.1 Searching for a String with egrep���������������������������������������� 22 2.3.2 Regular Expressions ��������������������������������������������������������������� 24 2.3.3 Character Translation with tr������������������������������������������������ 29 2.3.4 Editing Files from the Command Line with sed������������������� 30 2.3.5 Data Filtering and Manipulation Using awk�������������������������� 31 2.3.6 Task Decomposition and Pipes ���������������������������������������������� 35 2.4 Summary �������������������������������������������������������������������������������������������� 38 References ��������������������������������������������������������������������������������������������������� 38 ix x Contents 3 Lexical Annotation ���������������������������������������������������������������������������������� 39 3.1 Part-of-Speech Tagging �������������������������������������������������������������������� 39 3.1.1 What is Part-of-Speech Tagging ������������������������������������������ 39 3.1.2 Understanding Part-of-Speech Tagsets �������������������������������� 42 3.1.3 The Stanford Part-of-Speech Tagger ������������������������������������ 46 3.2 Lemmatization ���������������������������������������������������������������������������������� 54 3.2.1 What is Lemmatization and Why is it Useful ���������������������� 54 3.2.2 The TreeTagger ������������������������������������������������������������������ 55 3.3 Additional Tools ������������������������������������������������������������������������������� 58 3.3.1 The Stanford Tokenizer �������������������������������������������������������� 58 3.3.2 The Stanford Word Segmenter for Arabic and Chinese ������� 59 3.3.3 The CLAWS Tagger for English ������������������������������������������ 61 3.3.4 The Morpha Lemmatizer for English ����������������������������������� 61 3.4 Summary ������������������������������������������������������������������������������������������ 64 References ������������������������������������������������������������������������������������������������� 64 4 Lexical Analysis ��������������������������������������������������������������������������������������� 67 4.1 Frequency Lists �������������������������������������������������������������������������������� 67 4.1.1 Working with Output Files from the TreeTagger ����������������� 68 4.1.2 Working with Output Files from the Stanford POS Tagger and Morpha �������������������������������������������������������������� 72 4.1.3 Analyzing Frequency Lists with Text Processing Tools ������ 73 4.2 N-Grams ������������������������������������������������������������������������������������������� 76 4.3 Lexical Richness ������������������������������������������������������������������������������ 80 4.3.1 Lexical Density �������������������������������������������������������������������� 80 4.3.2 Lexical Variation ������������������������������������������������������������������ 82 4.3.3 Lexical Sophistication ���������������������������������������������������������� 84 4.3.4 Tools for Lexical Richness Analysis ������������������������������������ 84 4.4 Summary ������������������������������������������������������������������������������������������ 90 References ������������������������������������������������������������������������������������������������� 91 5 Syntactic Annotation ������������������������������������������������������������������������������� 95 5.1 Syntactic Parsing Overview ������������������������������������������������������������� 95 5.1.1 What is Syntactic Parsing and Why is it Useful? ����������������� 95 5.1.2 Phrase Structure Grammars ������������������������������������������������� 96 5.1.3 Dependency Grammars �������������������������������������������������������� 102 5.2 Syntactic Parsers ������������������������������������������������������������������������������ 106 5.2.1 The Stanford Parser �������������������������������������������������������������� 106 5.2.2 Collins’ Parser ���������������������������������������������������������������������� 110 5.3 Summary ������������������������������������������������������������������������������������������ 112 References ������������������������������������������������������������������������������������������������� 113 Contents xi 6 Syntactic Analysis ����������������������������������������������������������������������������������� 115 6.1 Querying Syntactically Parsed Corpora ������������������������������������������� 115 6.1.1 Tree Relationships ���������������������������������������������������������������� 115 6.1.2 Tregex ����������������������������������������������������������������������������������� 121 6.2 Syntactic Complexity Analysis �������������������������������������������������������� 130 6.2.1 Measures of Syntactic Complexity �������������������������������������� 130 6.2.2 Syntactic Complexity Analyzers ������������������������������������������ 136 6.3 Summary ������������������������������������������������������������������������������������������ 142 References ������������������������������������������������������������������������������������������������� 142 7 Semantic, Pragmatic and Discourse Analysis ��������������������������������������� 147 7.1 Semantic Field Analysis ������������������������������������������������������������������� 147 7.1.1 The UCREL Semantic Analysis System ������������������������������ 147 7.1.2 Profile in Semantics-Lexical in Computerized Profiling ����� 152 7.2 Analysis of Propositions ������������������������������������������������������������������� 154 7.2.1 Computerized Propositional Idea Density Rater ������������������ 154 7.2.2 Analysis of Propositions in Computerized Profiling ������������ 157 7.3 Conversational Act Analysis in Computerized Profiling ����������������� 158 7.4 Coherence and Cohesion Analysis in Coh-Metrix ��������������������������� 160 7.4.1 Referential Cohesion Features ��������������������������������������������� 160 7.4.2 Features Based on Latent Semantic Analysis ���������������������� 161 7.4.3 Features Based on Connectives �������������������������������������������� 162 7.4.4 Situation Model Features ����������������������������������������������������� 163 7.4.5 Word Information Features �������������������������������������������������� 164 7.5 Text Structure Analysis �������������������������������������������������������������������� 164 7.6 Summary ������������������������������������������������������������������������������������������ 169 References ������������������������������������������������������������������������������������������������� 170 8 Summary and Outlook ��������������������������������������������������������������������������� 175 8.1 Summary of the Book ���������������������������������������������������������������������� 175 8.2 Future Directions in Computational Corpus Analysis ��������������������� 177 8.2.1 Computational Analysis of Language Meaning and Use ����� 178 8.2.2 Computational Analysis of Learner Language ��������������������� 178 8.2.3 Computational Analysis Based on Specific Language Theories ��������������������������������������������������������������� 180 References ������������������������������������������������������������������������������������������������� 182 Appendix �������������������������������������������������������������������������������������������������������� 185 References 171 Fellbaum, C., ed 1998 WordNet: An electronic lexical database Cambridge: The MIT Press Ferstl, E C., and D Y von Cramon 2001 The role of coherence and cohesion in text comprehension: An event-related fMRI study Cognitive Brain Research 11:325–340 Fey, M E 1986 Language intervention with young children San Diego: College-Hill Press Finkel, J R., T Grenager, and C Manning 2005 Incorporating non-local information into information extraction systems by Gibbs sampling In Proceedings of the Forty-Third Annual Meeting of the Association for Computational Linguistics, 363–370 Stroudsburg: Association for Computational Linguistics Foltz, P W., W Kintsch, and T K Landauer 1998 The measurement of textual coherence with latent semantic analysis Discourse Processes 25:285–307 Frederickson, M S., K L Chapman, and M Hardin-Jones 2006 Conversational skills of children with cleft lip and palate: A replication and extension Cleft Palate-Craniofacial Journal 43:179–188 Garside, R 1996 The robust tagging of unrestricted text: The BNC experience In Using corpora for language research: Studies in the honour of Geoffrey Leech, eds J Thomas and M Short, 167–180 London: Longman Graesser, A C., D S McNamara, and M M Louwerse 2003 What readers need to learn in order to process coherence relations in narrative and expository text In Rethinking, reading comprehension, eds A P Sweet, and C E Snow, 82–98 New York: Guilford Publications Graesser, A C., S Lu, G T Jackson, H Mitchell, M Ventura, A Olney, and M M Louwerse 2004a AutoTutor: A tutor with dialogue in natural language Behavioral Research Methods, Instruments, and Computers 36:180–193 Graesser, A C., D S McNamara, M M Louwerse, and Z Cai 2004b Coh-Metrix: Analysis of text on cohesion and language Behavior Research Methods, Instruments, and Computers 36:193–202 Graesser, A C., D S McNamara, and J Kulikowich 2011 Coh-Metrix: Providing multilevel analyses of text characteristics Educational Researcher 40:223–234 Greenbaum, S ed 1996 Comparing English worldwide: The International Corpus of English Oxford: Clarendon Halliday, M A K., and R Hasan 1976 Cohesion in English London: Longman Hempelmann, C F., D Dufty, P McCarthy, A C Graesser, Z Cai, and D S McNamara 2005 Using LSA to automatically identify givenness and newness of noun-phrases in written discourse In Proceedings of the Twenty-Seventh Annual Meeting of the Cognitive Science Society, 941–946 Mahwah: Erlbaum Holmes, R 1997 Genre analysis, and the social sciences: An investigation of the structure of research article discussion sections in three disciplines English for Specific Purposes 16:321–337 Johnston, J R., and A G Kamhi 1984 Syntactic and semantic aspects of the utterances of language-impaired children: The same can be less Merrill-Palmer Quarterly 30:65–86 Kamhi, A G., and J R Johnston 1992 Semantic assessment: Determining propositional complexity In Best Practices in School Speech-Language Pathology, Vol 2, ed J S Damico, 99–105 San Antonio: The Psychological Corporation Kincaid, J P., R P Fishburne Jr., R L Rogers, and B S Chissom 1975 Derivation of new readability formulas (Automated Readability Index, Fog Count and Flesch Reading Ease Formula) for Navy enlisted personnel Millington: Naval Technical Training, U S Naval Air Station http://www.dtic.mil/dtic/tr/fulltext/u2/a006655.pdf Accessed 11 May 2013 Kintsch, W 1974 The representation of meaning in memory Hillsdale: Erlbaum Kintsch, W 1998 Comprehension: A paradigm for cognition Cambridge: Cambridge University Press Kintsch, W., and J Keenan 1973 Reading rate and retention as a function of the number of propositions in the base structure of sentences Cognitive Psychology 5:257–274 Lahey, M 1988 Language disorders and language development New York: Macmillan Landauer, T K., P W Foltz, and D Laham 1998 Introduction to Latent Semantic Analysis Discourse Processes 25:259–284 172 7 Semantic, Pragmatic and Discourse Analysis Landauer, T., D S McNamara, S Dennis, and W Kintsch, eds 2007 Handbook of Latent Semantic Analysis Mahwah: Erlbaum Lee, H., C Chang, Y Peirsman, N Chambers, M Surdeanu, and D Jurafsky 2013 Deterministic coreference resolution based on entity-centric, precision-ranked rules Computational Linguistics 39:885–916 Leech, G., R Garside, and M Bryant 1994 CLAWS4: The tagging of the British National Corpus In Proceedings of the Fifteenth International Conference on Computational Linguistics, 622–628 Stroudsburg: Association for Computational Linguistics Liu, H 2004 MontyLingua: A free, commonsense-enriched natural language understander for English Cambridge: Massachusetts Institute of Technology, MIT Media Lab http://web.media.mit.edu/~hugo/montylingua Accessed 11 May 2013 Long, S H., M E Fey, and R W Channell 2008 Computerized Profiling, Version 9.7.0 Cleveland: Case Western Reserve University http://www.computerizedprofiling.org Accessed 11 May 2013 Longo, B 1994 The role of metadiscourse in persuasion Technical Communication 41:348–352 Manning, C D., and H Schütze 1999 Foundations of statistical natural language processing Cambridge: The MIT Press McArthur, T 1981 Longman lexicon of contemporary English London: Longman McCarthy, P M., D Dufty, C Hempelman, Z Cai, A C Graesser, and D S McNamara 2012 Newness and givenness of information: Automated identification in written discourse In Applied natural language processing and content analysis: Identification, investigation, and resolution, eds P M McCarthy and C Boonthum, 457–478 Hershey: IGI Global McIntyre, D., and B Walker 2010 How can corpora be used to explore the language of poetry and drama? In The Routledge handbook of corpus linguistics, eds M McCarthy and A O’Keefe, 516–530 London: Routledge McNamara, D S., E Kintsch, N B Songer, and W Kintsch 1996 Are good texts always better? Text coherence, background knowledge, and levels of understanding in learning from text Cognition and Instruction 14:1–43 Miller, G A., R Beckwith, C Fellbaum, D Gross, and K J Miller 1990 Introduction to WordNet: An on-line lexical database Journal of Lexicography 3:235–244 O’Halloran, K A 2011 Limitations of the logico-rhetorical module: Inconsistency in argument, online discussion forums and electronic deconstruction Discourse Studies 13:797–806 Ozturk, I 2007 The textual organisation of research article introductions in applied linguistics: Variability within a single discipline English for Specific Purposes 26:25–38 Punyakanok, V., D Roth, and W Yih 2008 The importance of syntactic parsing and inference in semantic role labeling Computational Linguistics 34:257–287 Rayson, P 2003 Matrix: A statistical method and software tool for linguistic analysis through corpus comparison Unpublished doctoral dissertation, Lancaster University Rayson, P 2008 From key words to key semantic domains International Journal of Corpus Linguistics 13:519–549 Rayson, P 2009 Wmatrix: A web-based corpus processing environment Lancaster: Lancaster University, University Center for Computer Corpus Research on Language http://ucrel.lancs ac.uk/wmatrix Accessed 11 May 2013 Sanders, T J M., and L G M Noordman 2000 The role of coherence relations and their linguistic markers in text processing Discourse Processes 29:37–60 Snowdon, D A., S J Kemper, J A Mortimer, L H Greiner, D R Wekstein, and W R Markesbery 1996 Linguistic ability in early life and cognitive function and Alzheimer’s disease in late life: Findings from the Nun Study Journal of the American Medical Association 275:528–532 Taylor, H A., and B Tversky 1997 Indexing events in memory: Evidence for index dominance Memory 5:509–542 Toutanova, K., D Klein, C Manning, and Y Singer 2003 Feature-rich part-of-speech tagging witha cyclic dependency network In Proceedings of Human Language Technologies: The 2003 Conference of the North American Chapter of the Association for Computational Linguistics, 252–259 Stroudsburg: Association for Computational Linguistics References 173 Turner, A A., and E Greene 1978 The construction and use of a propositional text base Boulder: University of Colorado, Institute for the Study of Intellectual Behavior http://www.colorado edu/ics/sites/default/files/attached-files/87-02.pdf Accessed 11 May 2013 Wilson, M D 1988 The MRC psycholinguistic database: Machine readable dictionary, Version Behavioural Research Methods, Instruments and Computers 20:6–11 Xiao, Z 2009 Multidimensional analysis and the study of world Englishes World Englishes 28:421–450 Zwaan, R A., and G A Radvansky 1998 Situation models in language comprehension and memory Psychological Bulletin 123:162–185 Chapter Summary and Outlook Abstract This chapter summarizes the computational tools for corpus annotation and analysis that have been covered in the book and concludes the book with a discussion of some future directions in computational corpus analysis, focusing in particular on the analysis of language meaning and use, learner language analysis, and analysis based on specific language theories 8.1 Summary of the Book This book has covered a large set of computational systems and software packages for automating or assisting the annotation and/or analysis of language samples at the lexical, syntactic, semantic, pragmatic, and discourse levels The full set of tools introduced in the book is summarized in Table 8.1 In the current state of affairs, there is no single way to access all of the tools in the most effective and efficient manner Rather, they are available in diverse platforms, with some accessible through the command line interface in UNIX-like systems, e.g., Collins’ Parser (Collins 1999) and D-Level Analyzer (Lu 2009), some through a graphic user interface (GUI) that runs either in the Windows operating system alone, e.g., AntMover (Anthony 2003) and Gramulator (McCarthy et al 2012), or in multiple operating systems, e.g., MATTR (Covington and McFall 2010), some through a web interface online only, e.g., Coh-Metrix (Graesser et al 2004, 2011), and some in multiple platforms but possibly with different functionality constraints for each platform, e.g., Tregex (Levy and Andrew 2006) Whereas graphic user interfaces and web-based interfaces may be generally more user friendly and easier to get your hands on right away, we have shown that basic knowledge of the command line interface will allow you to tap into a larger set of powerful corpus annotation and analysis tools In addition, whereas some GUI-based and web-based tools only allow one text to be processed at a time (e.g., Coh-Metrix), this restriction is generally not applicable to tools that are invoked from the command line interface Annotated corpora provide richer linguistic information that facilitates more fine-grained and more diverse types of linguistic analysis than raw corpora We have covered a wide range of linguistic annotations in this book Part-of-speech (POS) tagging allows us to differentiate between uses of the same lexical form with different lexical categories; lemmatization makes it possible for us to recognize X Lu, Computational Methods for Corpus Annotation and Analysis, DOI 10.1007/978-94-017-8645-4_8, © Springer Science+Business Media Dordrecht 2014 175 176 Table 8.1 Summary of corpus annotation and analysis tools introduced 8 Summary and Outlook Function Tools Lexical annotation and analysis (Chapters and 4) POS Tagging Stanford POS Tagger TreeTagger CLAWS Tagger Lemmatization TreeTagger Morpha Tokenization Stanford Tokenizer Word segmentation Stanford Word Segmenter Frequency lists and N-grams UNIX tools Lexical richness Lexical Complexity Analyzer vocd utility in CLAN MATTR Gramulator RANGE VocabProfile Syntactic annotation and analysis (Chapters and 6) Syntactic parsing Stanford Parser Collins’ Parser MaltParser Syntactic tree querying Tregex Syntactic complexity D-Level Analyzer analysis L2 Syntactic Complexity Analyzer DSS utility in CLAN IPSyn, DSS, and BESS modules of CP Coh-Metrix Semantic, pragmatic and discourse annotation and analysis (Chapter 7) Semantic field analysis USAS (via Wmatrix) PRISM-L module of CP Analysis of propositions CPIDR APRON module of CP Conversational act analysis CAP module of CP Cohesion and coherence Coh-Metrix Text structure analysis AntMover different inflectional forms of the same word and analyze them as a unified form when appropriate; syntactic parsing facilitates the retrieval and quantification of occurrences of specific syntactic structures or patterns; semantic field annotation is necessary for the examination of the distribution of words of different semantic and conceptual categories; conversational act annotation allows for the profiling of conversational behavior of a speaker; and structural step or move annotation makes it easy for us to analyze and compare text structures at the discourse level To take advantage of the linguistic information encoded in annotated corpora in corpus analysis, we need effective ways to query such corpora to retrieve the information that is directly useful for our own research We have discussed tools for querying corpora annotated with different types of linguistic information For 8.2 Future Directions in Computational Corpus Analysis 177 example, POS-tagged and lemmatized corpora can be queried in very fine-grained ways using command line tools and concordancing programs Syntactically parsed corpora can be queried using tools for matching patterns in phrase structure trees, such as Tregex Corpora annotated with semantic field information can be queried much in the same way as POS-tagged corpora; they can also be analyzed using the Wmatrix interface (Rayson 2003, 2008, 2009) We have also introduced a number of corpus analytic tools that can be used to perform specific types of corpus analysis, such as lexical richness analysis, syntactic complexity analysis, the analysis of propositions, and the analysis of coherence and cohesion Some analytic tools require the input texts to have been annotated with specific types of information For example, the Lexical Complexity Analyzer (Ai and Lu 2010; Lu 2012) requires the input texts to have been POS-tagged and lemmatized, and the D-Level Analyzer requires the input texts to have been parsed with the Collins’ Parser Some tools perform the annotation necessary for the target analyses themselves, such as the L2 Syntactic Complexity Analyzer (Lu 2010), the Computerized Propositional Idea Density Rater (Brown et al 2008, Covington 2012), and Coh-Metrix Some tools neither require nor perform any annotation, but analyze raw texts directly, such as RANGE (Heatley et al 2002) and a few other tools for analyzing lexical richness It is important to note that the computational methods for corpus annotation and analysis introduced here provide the means for and not the goals of linguistic research The appropriate types of corpus annotation and analyses and the tools they call for depend on one’s research questions and analytical goals An important implication of the computational methods covered in this book for corpus linguistics research is that, with the possibility of annotating and analyzing large amounts of texts with diverse types of linguistic information, it has become increasingly easier to use large sets of text samples to investigate the relationships and interactions between different linguistic features and how these relationships and interactions may vary as a function of different linguistic and sociolinguistic factors, such as language variety, text type, genre, and domain, among others The study on variation among different language varieties and registers by Xiao (2009), discussed in Sect 7.1.1, constitutes a good example of this type of research As you apply the computational methods introduced here in your own research, you may come to realize that not all your analytic needs can be readily met by existing corpus analytic tools You will find yourself much better equipped to deal with these issues with knowledge of a scripting or programming language, such as Python (e.g., Bird et al 2009; Downey 2012; Perkins 2010) or Perl (see, e.g., Schwartz et al 2005; Weisser 2010) R has also been increasingly used as a programming language and a statistical program for corpus linguistics research (see, e.g., Gries 2009) 8.2 Future Directions in Computational Corpus Analysis The benefits of robust corpus analytical tools to corpus linguistics researchers cannot be overestimated However, despite the large number of computational tools that already exist, there are still many types of linguistic annotation and analysis for 178 8 Summary and Outlook which we not yet have reliable, fully functional, and publicly available computational tools Development of such computational tools entails an interdisciplinary effort by applied and computational linguists In this section, we will discuss some of the areas in which such effort is needed and is emerging 8.2.1 Computational Analysis of Language Meaning and Use The development of computational tools that can reliably automate the analysis of language meaning and use is a direction that is especially welcomed by the corpus linguistics community Some examples of specific areas in which such tools are highly desirable include: • Disambiguation of the meanings or senses of polysemous words, discourse particles, etc • Classification of clause types and conversational acts • Interpretation of speaker commitment, attitude, and sentiment • Differentiation of literal vs non-literal uses of language • Interpretation of the degree of pragmatic appropriateness The natural language processing community has already carried out substantial research in several relevant areas, such as word sense disambiguation (e.g., Chen et al 2009; Lu 2008; Sinha and Mihalcea 2007; Stevenson 2003), sentiment analysis (e.g., Mihalcea et al 2007; Tabaoda et al 2011), and dialogue act analysis (e.g., Sridhar et al 2009; Stolcke et al 2000; Sun and Morency 2012) As research in these and other related areas moves further ahead, we hope to see more functional, user-friendly tools for analyzing language meaning and use become publicly available in the near future 8.2.2 Computational Analysis of Learner Language Annotation and analysis of learner corpora, i.e., corpora consisting of spoken and/or written language samples produced by non-native language learners, remain a challenging but active area of research The analysis of learner language has important implications for second and foreign language pedagogy as well as second language acquisition research Large-scale learner corpora such as the International Corpus of Learner English (ICLE, Granger et al 2009) and the Spoken and Written Corpus of Chinese Learners (SWCCL, Wen et al 2008) have become increasingly available However, language samples produced by language learners, particularly beginning and intermediate level learners and to a lesser extent advanced learners, tend to contain various types of errors (e.g., spelling errors, lexical errors, morphological errors, collocation errors, and grammatical errors, among others) that could affect the reliability of the annotation and analysis generated by corpus analytic tools that 8.2 Future Directions in Computational Corpus Analysis 179 are originally designed for processing native language corpora Several lines of research are being pursued in an effort to tackle this challenge A critical first step towards addressing the challenge learner language presents is to gain a systematic understanding of the types and amounts of errors that exist in language samples produced by learners with different proficiency levels and first language background One of the earliest and most influential studies on this front was done by Granger (2003), who presented a three-tiered error annotation system for annotating the French Interlanguage Database corpus and demonstrated the types of information the error-annotated corpus could reveal Some more recent efforts have focused on annotating specific types of errors with the aim of facilitating the training of automatic error detection systems For example, Lee et al (2009) discussed an error annotation scheme for annotating particle errors in corpora of learner Korean and investigated the properties of particle errors with the aim of extracting heuristic patterns of particle errors for automatic identification of such errors Not only we need to understand the types of errors that exist in learner corpora, but we also need to evaluate the effect such errors have on the reliability of existing corpus analytical tools Not all learner errors affect automatic corpus analysis in the same way, and this evaluation helps us better understand the magnitude as well as the sources of the problem For example, Pilar and Ibañez (2011) evaluated the reliability of two POS taggers on second language Spanish, investigated the most frequent tagger errors, and assessed the impact of learner errors on the performance of the taggers Lu (2010) also discussed the effect of grammatical and syntactic errors on the L2 Syntactic Complexity Analyzer Some researchers have taken a step further by developing systems for automatic error detection and correction, adapting existing natural language processing tools for learner language analysis, or modifying corpus annotation to support learner language analysis However, most of the work so far has focused on specific types of errors or relatively constrained sets of errors For example, Dickinson and Lee (2009) proposed a framework for modifying the annotation found in native corpora for the state-of-the-art parsing models to work with learner corpora and illustrated the framework by providing a parsing model that obtained accurate information about Korean postpositional particles De Felice and Pulman (2008) proposed a classifier-based approach to the automatic detection and correction of preposition and determiner errors in L2 English writing Gamon et al (2009) designed a system that automatically detects and suggests corrections for a range of errors made by English learners Heift and Schulze (2007), Nagata (2009), and Amaral et al (2011) discussed different ways in which current natural language processing technology can be adapted to detect learner errors in the context of intelligent computer-assisted language learning As we continue to develop our understanding of learner errors, their effect on the reliability of existing natural language processing technology, and the ways in which they can be handled effectively, more robust systems for automatic annotation and analysis of learner corpora will likely emerge in the future 180 8 Summary and Outlook 8.2.3 Computational Analysis Based on Specific Language Theories Evidently, corpora have been used in linguistic research from many different theoretical perspectives McEnery and Hardie (2011) provided a fairly thorough discussion of the use of corpus methods in linguistic research from functional-cognitive and psycholinguistic perspectives Much research along these lines involved automatic or semi-automatic retrieval of potentially relevant constructions from corpora followed by manual analysis and classification of the retrieved instances Automatic annotation and analysis of linguistic data based on specific language theories will alleviate the labor intensiveness involved in manual analysis and greatly facilitate larger-scale linguistic research within those theoretical paradigms This is another important area in which further research is needed In this section, we will briefly illustrate the situation with two such theories or theoretical approaches, namely, Systemic Functional Linguistics (SFL, Halliday and Matthiessen 2004) and Cognitive Linguistics (e.g., Lakoff 1987; Lakoff and Johnson 1980b; Langacker 1987, 1991) 8.2.3.1 Systemic Functional Linguistics As a function-oriented theory of language, SFL places the functions of language at the center A core concept in SFL is that of system networks, which represent the system of choices one has in generating an utterance In analyzing language use, the primary concern is not on the structural realizations of the choices the speakers make (although they certainly need to be identified), but on the meanings and functions that they express (including ideational, interpersonal, and textual meanings) and the context that underlies these choices (e.g., the social relationships between the participants and the channel of communication) SFL has been used in a large number of research fields, including child language acquisition, translation studies, critical discourse analysis, educational linguistics, and second language learning and teaching, among others (see, e.g., Byrnes 2012) In the past two decades, substantial progress has been made in developing computational tools for research within the SFL framework Most tools that are currently available, however, are for sentence generation using computational grammars in the SFL formalism, such as the KPML development environment (Bateman 1997), or for assisting manual lexico-grammatical text analysis within the SFL framework, such as Systemics (O’Halloran and Judd 2002) Fully functional and publicly available tools for automatic annotation or analysis of texts using the SFL formalism are scarce, although some preliminary research and prototypes have been reported For example, Kappagoda (2009) discussed the rationale for using SFL in automated text mining, developed a grammatical annotation scheme to enrich text corpora, and trained a machine learner for automatic annotation of word functions in the group Schwarz et al (2008) designed a rule-based system for automatically identifying simple Themes (i.e., Themes consisting of one ideational element only) Park and Lu (2011) also reported preliminary work on developing 8.2 Future Directions in Computational Corpus Analysis 181 an automatic thematic structure analyzer for written English texts More research in developing computational tools for automating the annotation and analysis of texts using the SFL formalism is needed in the future 8.2.3.2 Cognitive Linguistics Cognitive Linguistics is not a single, unified theory of language like SFL Rather, it constitutes a family of partially overlapping approaches to linguistics Cognitive Linguistics views language as “a means for organizing, processing, and conveying” information, and “a structured collection of meaningful categories that help us deal with new experiences and store information about old ones” (Geeraerts and Cuyckens 2007, p. 5) Within the Cognitive Linguistics framework, the mappings between linguistic structures and the meanings they express are a prime subject of linguistic analysis Linguistic meaning is seen to be encyclopedic and perspectival in nature: it is encyclopedic because language categorizes the world, and it is perspectival because this categorization does not objectively reflect reality but imposes a structure on the world (Geeraerts and Cuyckens 2007) The topics that are of interest for Cognitive Linguistics are well summarized by Geeraerts and Cuyckens (2007) in their introduction to The Oxford Handbook of Cognitive Linguistics: “Because Cognitive Linguistics sees language as embedded in the overall cognitive capacities of man, topics of special interest for Cognitive Linguistics include: the structural characteristics of natural language categorization (such as prototypicality, systematic polysemy, cognitive models, mental imagery and metaphor); the functional principles of linguistic organization (such as iconicity and naturalness); the conceptual interface between syntax and semantics (as explored by cognitive grammar and construction grammar); the experiential and pragmatic background of language-in-use; and the relationship between language and thought, including questions about relativism and conceptual universals.” (p. 4) Cognitive Linguistics researchers have made increasing use of corpora as a source of empirical data to assist the investigation of hypotheses formulated about the relationship between language and cognition, and the findings derived from corpora have sometimes been subsequently subjected to experimental validation (Arppe et al 2010; Gries and Stefanowitsch 2006; McEnery and Hardie 2011) For example, in the area of conceptual metaphor analysis, one of the most revolutionary aspects of the cognitive approaches to semantics (Lakoff and Johnson 1980a, b), a large number of studies have adopted corpus analytical methods (e.g., Deignan 2005; Steen et al 2010; Stefanowitsch and Gries 2006) Cognitively oriented corpus linguistic analysis generally involves identifying occurrences of relevant linguistic structures or forms and interpreting their behaviors, functions, and relationships to conceptual structures Computational tools that can automatically identify and interpret occurrences of such linguistic structures or forms will significantly facilitate this type of analysis Research in this area, notably for conceptual metaphor identification and interpretation, is emerging (e.g., Birke and Sarkar 2006; Krishnakumaran and Zhu 2007; Shutova 2010; Shutova et al 2013), and fully functional and publicly available tools of this kind will hopefully become available in the near future 182 8 Summary and Outlook References Ai, H., and X Lu 2010 A web-based system for automatic measurement of lexical complexity Paper presented at the Twenty-Seventh Annual Symposium of the Computer-Assisted Language Instruction Consortium Amherst, MA Amaral, A., D Meurers, and R Ziai 2011 Analyzing learner language: Towards a flexible NLP architecture for intelligent language tutors Computer Assisted Language Learning 24:1–16 Arppe, A., G Gilquin, D Glynn, M Hilpert, and R Zeschel 2010 Cognitive corpus linguistics: Five points of debate on current theory and methodology Corpora 5:1–27 Anthony, L 2003 AntMover, Version 1.0 Tokyo, Japan: Waseda University http://www.antlab sci.waseda.ac.jp Accessed 11 May 2013 Bateman, J A 1997 Enabling technology for multilingual natural language generation: The KPML development environment, Journal of Natural Language Engineering 3:15–55 Bird, S., E Klein, and E Loper 2009 Natural language processing with Python Sebastopol: O’Reilly Birke J., and A Sarkar 2006 A clustering approach for the nearly unsupervised recognition of nonliteral language In Proceedings of the Eleventh Conference of the European Chapter of the Association for Computational Linguistics, 329–336 Stroudsburg: Association for Computational Linguistics Brown, C., T Snodgrass, S J Kemper, R Herman, and M A Covington 2008 Automatic measurement of propositional idea density from part-of-speech tagging Behavior Research Methods 40:540–545 Byrnes, H 2012 Systemic Functional Linguistics In The Routledge handbook of second language acquisition, ed P Robinson, 622–644 London: Routledge Chen, P., W Ding, C Bowes, and D Brown 2009 A fully unsupervised word sense disambiguation method using dependency knowledge In Proceedings of Human Language Technologies: The 2009 Annual Conference of the North American Chapter of the Association for Computational Linguistics, 28–36 Stroudsburg: Association for Computational Linguistics Collins, M 1999 Head-driven statistical models for natural language parsing Unpublished doctoral dissertation, University of Pennsylvania Covington, M A 2012 CPDIR 5.1 user manual Athens: University of Georgia, Institute for Artificial Intelligence http://www.ai.uga.edu/caspr/CPIDR-5-Manual.pdf Accessed 11 May 2013 Covington, M A., and J D McFall 2010 Cutting the Gordian knot: The moving-average typetoken ratio (MATTR) Journal of quantitative linguistics 17:94–100 De Felice, R., and S Pulman 2008 A classifier-based approach to preposition and determiner error correction in L2 English In Proceedings of the Twenty-Second International Conference on Computational Linguistics, 169–176 Manchester: The COLING 2008 Organizing Committee Dickinson, M., and C M Lee 2009 Modifying corpus annotation to support the analysis of learner language CALICO Journal 26:545–561 Deignan, A 2005 Metaphor and corpus linguistics Amsterdam: John Benjamins Downey, A B 2012 Think Python: How to think like a computer scientist Cambridge: O’Reilly Media Gamon, M., C Leacock, C Brockett, W B Dolan, J Gao, D Belenko, and A Klementiev 2009 Using statistical techniques and web search to correct ESL errors CALICO Journal 26:491–511 Geeraerts, D., and H Cuyckens 2007 Introducing cognitive linguistics In The Oxford Handbook of Cognitive Linguistics, eds D Geeraerts and H Cuyckens, 3–21 New York: Oxford University Press Graesser, A C., D S McNamara, and J Kulikowich 2011 Coh-Metrix: Providing multilevel analyses of text characteristics Educational Researcher 40: 223–234 Graesser, A C., D S McNamara, M M Louwerse, and Z Cai 2004 Coh-Metrix: Analysis of text on cohesion and language Behavior Research Methods, Instruments, and Computers 36:193–202 References 183 Granger, S 2003 Error-tagged learner corpora and CALL: A promising synergy CALICO Journal 20:465–480 Granger, S., E Dagneaux, F Meunier, and M Paquot 2009 International Corpus of Learner English, Version Louvain-la-Neuve: Presses universitaires de Louvain Gries, S T., and A Stefanowitsch, eds 2006 Corpora in Cognitive Linguistics: Corpus-based approaches to syntax and lexis Berlin: Mouton de Gruyter Gries, S T 2009 Quantitative corpus linguistics with R: A practical introduction London: Routledge Halliday, M A K., and C M I M Matthiessen 2004 Introduction to functional grammar, 3rd ed London: Edward Arnold Heatley, A., I S P Nation, and A Coxhead 2002 RANGE and FREQUENCY programs Wellington: Victoria University of Wellington http://www.victoria.ac.nz/lals/resources/range Accessed 11 May 2013 Heift, T., and M Schulze 2007 Errors and intelligence in computer-assisted language learning: Parsers and pedagogues New York: Routledge Kappagoda, A 2009 The use of Systemic-Functional Linguistics in automated text mining Edinburgh, Australia: Defense Science and Technology Organization Krishnakumaran, S., and X Zhu 2007 Hunting elusive metaphors using lexical resources In Proceedings of the Workshop on Computational Approaches to Figurative Language, 13–20 Stroudsburg: Association for Computational Linguistics Lakoff, G 1987 Women, fire, and dangerous things What categories reveal about the mind Chicago: University of Chicago Press Lakoff, G., and M Johnson 1980a Conceptual metaphor in everyday language The Journal of Philosophy 77:453–486 Lakoff, G., and M Johnson 1980b Metaphors we live by Chicago: University of Chicago press Langacker, R W 1987 Foundations of cognitive grammar, Vol I: Theoretical prerequisites Stanford: Stanford University Press Langacker, R W 1991 Foundations of cognitive grammar, Vol II: Descriptive application Stanford: Stanford University Press Lee, S.-H., S B Jang, and S.-K Seo 2009 Annotation of Korean learner corpora for particle error detection CALICO Journal 26:529–544 Levy, R., and G Andrew 2006 Tregex and Tsurgeon: Tools for querying and manipulating tree data structures” In Proceedings of the Fifth International Conference on Language Resources and Evaluation, 2231–2234 Paris: ELRA Lu, X 2008 Hybrid models for sense guessing of Chinese unknown words International Journal of Corpus Linguistics 13:99–128 Lu, X 2009 Automatic measurement of syntactic complexity in child language acquisition International Journal of Corpus Linguistics 14:3–28 Lu, X 2010 Automatic analysis of syntactic complexity in second language writing International Journal of Corpus Linguistics 15:474–496 Lu, X 2012 The relationship of lexical richness to the quality of ESL learners’ oral narratives The Modern Language Journal 96:190–208 McCarthy, P M., S Watanabe, and T A Lamkin 2012 The Gramulator: A tool to identify differential linguistic features of correlative text types In Applied natural language processing and content analysis: Identification, investigation, and resolution, eds P M McCarthy and C Boonthum, 312–333 Hershey: IGI Global McEnery, T., and A Hardie 2011 Corpus linguistics: Method, theory and practice Cambridge: Cambridge University Press Mihalcea, R., C Banea, and J Wiebe 2007 Learning multilingual subjective language via crosslingual projections In Proceedings of the Forty-Fifth Annual Meeting of the Association for Computational Linguistics, 976–983 Stroudsburg: Association for Computational Linguistics Nagata, N 2009 Robo-Sensei’s NLP-based error detection and feedback generation CALICO Journal 26:562–579 184 8 Summary and Outlook O’Halloran, K L., and K Judd 2002 Systemics, Version 1.0 Singapore: Singapore University Press Park, K., and Lu, X 2011 A corpus-driven investigation of thematic structure in expert and novice academic writing Paper presented at the 2011 Annual Meeting of the American Association for Applied Linguistics Chicago, IL Perkins, J 2010 Python text processing with NLTK 2.0 cookbook Sebastopol: O’Reilly Pilar, M., and V Ibañez 2011 An evaluation of part of speech tagging on written second language Spanish In Proceedings of the Twelfth International Conference on Computational Linguistics and Intelligent Text Processing, Part I, 214–226 Berlin: Springer-Verlag Rayson, P 2003 Matrix: A statistical method and software tool for linguistic analysis through corpus comparison Unpublished doctoral dissertation, Lancaster University Rayson, P 2008 From key words to key semantic domains International Journal of Corpus Linguistics 13:519–549 Rayson, P 2009 Wmatrix: A web-based corpus processing environment Lancaster: Lancaster University, University Center for Computer Corpus Research on Language http://ucrel.lancs ac.uk/wmatrix Accessed 11 May 2013 Schwarz, L., S Bartsch, R Eckart, and E Teich 2008 Exploring automatic theme identification: A rule-based approach In Text resources and lexical knowledge Selected papers from the Ninth Conference on Natural Language Processing, eds A Storrer, A Geyken, A Siebert, and K-M Würzner, 15–26 Berlin: Mouton de Gruyter Schwartz, R L., T Phoenix, and B D Foy 2005 Learning Perl, 4th ed Cambridge: O’Reilly Media Shutova, E 2010 Models of metaphor in NLP In Proceedings of Forty-Eighth Annual Meeting of the Association for Computational Linguistics, 688–697 Stroudsburg: Association for Computational Linguistics Shutova, E., S Teufel, and A Korhonen 2013 Statistical metaphor processing Computational Linguistics 39:301–353 Sinha, R., and R Mihalcea 2007 Unsupervised graph-based word sense disambiguation using measures of word semantic similarity In Proceedings of the 2007 International Conference on Semantic Computing, 363–369 Washington: IEEE Computer Society Sridhar, V K R., S Bangalore, and S Narayanan 2009 Combining lexical, syntactic and prosodic cues for improved online dialog act tagging Computer Speech and Language 23:407–422 Steen, G J., A G Dorst, J B Herrmann, A A Kaal, T Krennmayr, and T Pasma 2010 A Method for linguistic metaphor identification: From MIP to MIPVU Amsterdam: John Benjamins Stefanowitsch, A., and Gries, S T eds 2006 Corpus-based approaches to metaphor and metonymy Berlin: Mouton de Gruyter Stevenson, M 2003 Word sense disambiguation Stanford: CSLI Publications Stolcke, A., K Ries, N Coccaro, E Shriberg, R Bates, D Jurafsky, P Taylor, R Martin, C van EssDykema, and M Meteer 2000 Dialogue act modeling for automatic tagging and recognition of conversational speech Computational Linguistics 26:339–373 Sun, C., and L.-P Morency 2012 Dialogue act recognition using reweighted speaker adaptation In Proceedings of the Thirteenth Annual Meeting of the Special Interest Group on Discourse and Dialogue, 118–125 Stroudsburg: Association for Computational Linguistics Taboada, M., J Brooke, M Tofiloski, K Voll, M Stede 2011 Lexicon-based methods for sentiment analysis Computational Linguistics 37:267–307 Weisser, M 2010 Essential programming for linguistics Edinburgh: Edinburgh University Press Wen, Q., M Liang, and X Yan 2008 Spoken and Written Corpus of Chinese Learners, Version 2.0 Beijing: Foreign Language Teaching and Research Press Xiao, Z 2009 Multidimensional analysis and the study of world Englishes World Englishes 28:421–450 Appendix Summary of commands used in the book Command awk cat cd comm cp echo egrep flex 10 gcc gunzip 11 12 13 14 15 head jar java ls make 16 17 18 19 20 21 22 23 24 25 26 27 28 man mkdir more mv python paste pwd rm sed sh sort tail tar Function Scans and processes patterns Displays, concatenates or creates text files Changes the current working directory Compares two sorted files line by line Makes a copy of a file Displays a line of text Searches a file for a pattern using regular expressions Generates a program that recognizes lexical patterns in text based on description provided in user-specified input files Compiles programs written in C language Decompresses compressed files created by gzip, zip, compress, or pack Displays the first 10 lines of a file, unless otherwise specified Extracts the contents of a JAR file Invokes a java program Lists the files and directories in a directory Determines which components of the source program need to be recompiled and issues commands to recompile those components Displays the online manual of a UNIX command Creates a new directory Displays the content of a file one screen at a time Moves a file to a different directory or renames a file Invokes a python script Merges files horizontally Displays the path of the current working directory Removes a file or a directory Edits a file from the command line Invokes a shell script Sorts lines in a text file Displays the last 10 lines of a file, unless otherwise specified Creates or decompresses a tape archive (TAR) file X Lu, Computational Methods for Corpus Annotation and Analysis, DOI 10.1007/978-94-017-8645-4, © Springer Science+Business Media Dordrecht 2014 185 186 29 30 31 32 Appendix Command tr uniq unzip wc Function Translates a set of characters to a different set of characters Deletes repeated lines in a text file Unzips a zipped file Displays the number of lines, words and characters in a file .. .Computational Methods for Corpus Annotation and Analysis Xiaofei Lu Computational Methods for Corpus Annotation and Analysis 1 3 Xiaofei Lu Department of Applied Linguistics... ISBN 97 8-9 4-0 1 7-8 64 4-7 ISBN 97 8-9 4-0 1 7-8 64 5-4 (eBook) DOI 10.1007/97 8-9 4-0 1 7-8 64 5-4 Springer New York Heidelberg Dordrecht London Library of Congress Control Number: 20149 31404 © Springer. .. Lu, Computational Methods for Corpus Annotation and Analysis, DOI 10.1007/97 8-9 4-0 1 7-8 64 5-4 _2, © Springer Science+Business Media Dordrecht 2014 10 2 Text Processing with the Command Line Interface