Natural Language Annotation for Machine Learning James Pustejovsky and Amber Stubbs Natural Language Annotation for Machine Learning by James Pustejovsky and Amber Stubbs Copyright © 2013 James Pustejovsky and Amber Stubbs All rights reserved Printed in the United States of America Published by O’Reilly Media, Inc., 1005 Gravenstein Highway North, Sebastopol, CA 95472 O’Reilly books may be purchased for educational, business, or sales promotional use Online editions are also available for most titles (http://my.safaribooksonline.com) For more information, contact our corporate/ institutional sales department: 800-998-9938 or corporate@oreilly.com Editors: Julie Steele and Meghan Blanchette Production Editor: Kristen Borg Copyeditor: Audrey Doyle October 2012: Proofreader: Linley Dolby Indexer: WordCo Indexing Services Cover Designer: Randy Comer Interior Designer: David Futato Illustrator: Rebecca Demarest First Edition Revision History for the First Edition: 2012-10-10 First release See http://oreilly.com/catalog/errata.csp?isbn=9781449306663 for release details Nutshell Handbook, the Nutshell Handbook logo, and the O’Reilly logo are registered trademarks of O’Reilly Media, Inc Natural Language Annotation for Machine Learning, the image of a cockatiel, and related trade dress are trademarks of O’Reilly Media, Inc Many of the designations used by manufacturers and sellers to distinguish their products are claimed as trademarks Where those designations appear in this book, and O’Reilly Media, Inc., was aware of a trade mark claim, the designations have been printed in caps or initial caps While every precaution has been taken in the preparation of this book, the publisher and authors assume no responsibility for errors or omissions, or for damages resulting from the use of the information contained herein ISBN: 978-1-449-30666-3 [LSI] Table of Contents Preface ix The Basics The Importance of Language Annotation The Layers of Linguistic Description What Is Natural Language Processing? A Brief History of Corpus Linguistics What Is a Corpus? Early Use of Corpora Corpora Today Kinds of Annotation Language Data and Machine Learning Classification Clustering Structured Pattern Induction The Annotation Development Cycle Model the Phenomenon Annotate with the Specification Train and Test the Algorithms over the Corpus Evaluate the Results Revise the Model and Algorithms Summary 10 13 14 21 22 22 22 23 24 27 29 30 31 31 Defining Your Goal and Dataset 33 Defining Your Goal The Statement of Purpose Refining Your Goal: Informativity Versus Correctness Background Research Language Resources Organizations and Conferences 33 34 35 41 41 42 iii NLP Challenges Assembling Your Dataset The Ideal Corpus: Representative and Balanced Collecting Data from the Internet Eliciting Data from People The Size of Your Corpus Existing Corpora Distributions Within Corpora Summary 43 43 45 46 46 48 48 49 51 Corpus Analytics 53 Basic Probability for Corpus Analytics Joint Probability Distributions Bayes Rule Counting Occurrences Zipf ’s Law N-grams Language Models Summary 54 55 57 58 61 61 63 65 Building Your Model and Specification 67 Some Example Models and Specs Film Genre Classification Adding Named Entities Semantic Roles Adopting (or Not Adopting) Existing Models Creating Your Own Model and Specification: Generality Versus Specificity Using Existing Models and Specifications Using Models Without Specifications Different Kinds of Standards ISO Standards Community-Driven Standards Other Standards Affecting Annotation Summary 68 70 71 72 75 76 78 79 80 80 83 83 84 Applying and Adopting Annotation Standards 87 Metadata Annotation: Document Classification Unique Labels: Movie Reviews Multiple Labels: Film Genres Text Extent Annotation: Named Entities Inline Annotation Stand-off Annotation by Tokens iv | Table of Contents 88 88 90 94 94 96 Stand-off Annotation by Character Location Linked Extent Annotation: Semantic Roles ISO Standards and You Summary 99 101 102 103 Annotation and Adjudication 105 The Infrastructure of an Annotation Project Specification Versus Guidelines Be Prepared to Revise Preparing Your Data for Annotation Metadata Preprocessed Data Splitting Up the Files for Annotation Writing the Annotation Guidelines Example 1: Single Labels—Movie Reviews Example 2: Multiple Labels—Film Genres Example 3: Extent Annotations—Named Entities Example 4: Link Tags—Semantic Roles Annotators Choosing an Annotation Environment Evaluating the Annotations Cohen’s Kappa (κ) Fleiss’s Kappa (κ) Interpreting Kappa Coefficients Calculating κ in Other Contexts Creating the Gold Standard (Adjudication) Summary 105 108 109 110 110 110 111 112 113 115 119 120 122 124 126 127 128 131 132 134 135 Training: Machine Learning 139 What Is Learning? Defining Our Learning Task Classifier Algorithms Decision Tree Learning Gender Identification Naïve Bayes Learning Maximum Entropy Classifiers Other Classifiers to Know About Sequence Induction Algorithms Clustering and Unsupervised Learning Semi-Supervised Learning Matching Annotation to Algorithms 140 142 144 145 147 151 157 158 160 162 163 165 Table of Contents | v Summary 166 Testing and Evaluation 169 Testing Your Algorithm Evaluating Your Algorithm Confusion Matrices Calculating Evaluation Scores Interpreting Evaluation Scores Problems That Can Affect Evaluation Dataset Is Too Small Algorithm Fits the Development Data Too Well Too Much Information in the Annotation Final Testing Scores Summary 170 170 171 172 177 178 178 180 181 181 182 Revising and Reporting 185 Revising Your Project Corpus Distributions and Content Model and Specification Annotation Training and Testing Reporting About Your Work About Your Corpus About Your Model and Specifications About Your Annotation Task and Annotators About Your ML Algorithm About Your Revisions Summary 186 186 187 188 189 189 191 192 192 193 194 194 10 Annotation: TimeML 197 The Goal of TimeML Related Research Building the Corpus Model: Preliminary Specifications Times Signals Events Links Annotation: First Attempts Model: The TimeML Specification Used in TimeBank Time Expressions Events vi | Table of Contents 198 199 201 201 202 202 203 203 204 204 204 205 Signals Links Confidence Annotation: The Creation of TimeBank TimeML Becomes ISO-TimeML Modeling the Future: Directions for TimeML Narrative Containers Expanding TimeML to Other Domains Event Structures Summary 206 207 208 209 211 213 213 215 216 217 11 Automatic Annotation: Generating TimeML 219 The TARSQI Components GUTime: Temporal Marker Identification EVITA: Event Recognition and Classification GUTenLINK Slinket SputLink Machine Learning in the TARSQI Components Improvements to the TTK Structural Changes Improvements to Temporal Entity Recognition: BTime Temporal Relation Identification Temporal Relation Validation Temporal Relation Visualization TimeML Challenges: TempEval-2 TempEval-2: System Summaries Overview of Results Future of the TTK New Input Formats Narrative Containers/Narrative Times Medical Documents Cross-Document Analysis Summary 220 221 222 223 224 225 226 226 227 227 228 229 229 230 231 234 234 234 235 236 237 238 12 Afterword: The Future of Annotation 239 Crowdsourcing Annotation Amazon’s Mechanical Turk Games with a Purpose (GWAP) User-Generated Content Handling Big Data Boosting 239 240 241 242 243 243 Table of Contents | vii Active Learning Semi-Supervised Learning NLP Online and in the Cloud Distributed Computing Shared Language Resources Shared Language Applications And Finally 244 245 246 246 247 247 248 A List of Available Corpora and Specifications 249 B List of Software Resources 271 C MAE User Guide 291 D MAI User Guide 299 E Bibliography 305 Index 317 viii | Table of Contents ... Natural Language Annotation for Machine Learning James Pustejovsky and Amber Stubbs Natural Language Annotation for Machine Learning by James Pustejovsky and Amber Stubbs Copyright © 2013 James... order to augment a computer’s capability to perform Natural Language Processing (NLP) In particular, we examine how information can be added to natural language text through annotation in order to. .. later on Natural Language Annotation for Machine Learning This book details the multistage process for building your own annotated natural lan guage dataset (known as a corpus) in order to train