Mastering Regular Expressions Powerful Techniques for Perl and Other Tools Jeffrey E.F Friedl O'REILLY Cambridge • Kưln • Paris • Sebastopol • Tokyo [PU]O'Reilly[/PU][DP]1997[/DP] CuuDuongThanCong.com https://fb.com/tailieudientucntt Page iv Mastering Regular Expressions by Jeffrey E.F Friedl Copyright © 1997 O'Reilly & Associates, Inc All rights reserved Printed in the United States of America Published by O'Reilly & Associates, Inc., 101 Morris Street, Sebastopol, CA 95472 Editor: Andy Oram Production Editor: Jeffrey Friedl Printing History: January 1997: First Edition March 1997: Second printing; Minor corrections May 1997: Third printing; Minor corrections July 1997: Fourth printing; Minor corrections November 1997: Fifth printing; Minor corrections August 1998: Sixth printing; Minor corrections December 1998: Seventh printing; Minor corrections Nutshell Handbook and the Nutshell Handbook logo are registered trademarks and The Java Series is a trademark of O'Reilly & Associates, Inc Many of the designations used by manufacturers and sellers to distinguish their products are claimed as trademarks Where those designations appear in this book, and O'Reilly & Associates, Inc was aware of a trademark claim, the designations have been printed in caps or initial caps While every precaution has been taken in the preparation of this book, the publisher assumes no responsibility for errors or omissions, or for damages resulting from the use of the information contained herein CuuDuongThanCong.com https://fb.com/tailieudientucntt Page V Table of Contents Preface xv 1: Introduction to Regular Expressions Solving Real Problems Regular Expressions as a Language The Filename Analogy The Language Analogy The Regular-Expression Frame of Mind Searching Text Files: Egrep Egrep Metacharacters Start and End of the Line Character Classes Matching Any Character—Dot 11 Alternation 12 Word Boundaries 14 In a Nutshell 15 CuuDuongThanCong.com https://fb.com/tailieudientucntt Optional Items 16 Other Quantifiers: Repetition 17 Ignoring Differences in Capitalization 18 Parentheses and Backreferences 19 The Great Escape 20 Expanding the Foundation 21 Linguistic Diversification 21 The Goal of a Regular Expression 21 A Few More Examples 22 CuuDuongThanCong.com https://fb.com/tailieudientucntt Page vi Regular Expression Nomenclature 24 Improving on the Status Quo 26 Summary 28 Personal Glimpses 30 2: Extended Introductory Examples 31 About the Examples 32 A Short Introduction to Perl 33 Matching Text with Regular Expressions 34 Toward a More Real-World Example 36 Side Effects of a Successful Match 36 Intertwined Regular Expressions 39 Intermission 43 Modifying Text with Regular Expressions 45 Automated Editing 47 A Small Mail Utility 48 That Doubled-Word Thing 54 3: Overview of Regular Expression Features and Flavors CuuDuongThanCong.com 59 https://fb.com/tailieudientucntt A Casual Stroll Across the Regex Landscape 60 The World According to Grep 60 The Times They Are a Changin' 61 At a Glance 63 POSIX 64 Care and Handling of Regular Expressions 66 Identifying a Regex 66 Doing Something with the Matched Text 67 Other Examples 67 Care and Handling: Summary 70 Engines and Chrome Finish 70 Chrome and Appearances 71 Engines and Drivers 71 Common Metacharacters 71 Character Shorthands 72 Strings as Regular Expression 75 Class Shorthands, Dot, and Character Classes 77 Anchoring 81 CuuDuongThanCong.com https://fb.com/tailieudientucntt Grouping and Retrieving 83 Quantifiers 83 [PU]O'Reilly[/PU][DP]1997[/DP] CuuDuongThanCong.com https://fb.com/tailieudientucntt Page vii Alternation 84 Guide to the Advanced Chapters 85 Tool-Specific Information 85 4: The Mechanics of Expression Processing 87 Start Your Engines! 87 Two Kinds of Engines 87 New Standards 88 Regex Engine Types 88 From the Department of Redundancy Department 90 Match Basics 90 About the Examples 91 Rule 1: The Earliest Match Wins 91 The "Transmission" and the Bump-Along 92 Engine Pieces and Parts 93 Rule 2: Some Metacharacters Are Greedy 94 Regex-Directed vs Text-Directed 99 NFA Engine: Regex-Directed 99 CuuDuongThanCong.com https://fb.com/tailieudientucntt DFA Engine: Text-Directed 100 The Mysteries of Life Revealed 101 Backtracking 102 A Really Crummy Analogy 102 Two Important Points on Backtracking 103 Saved States 104 Backtracking and Greediness 106 More About Greediness 108 Problems of Greediness 108 Multi-Character "Quotes" 109 Laziness? 110 Greediness Always Favors a Match 110 Is Alternation Greedy? 112 Uses for Non-Greedy Alternation 113 Greedy Alternation in Perspective 114 Character Classes vs Alternation 115 NFA, DFA, and POSIX 115 "The Longest-Leftmost" CuuDuongThanCong.com 115 https://fb.com/tailieudientucntt POSIX and the Longest-Leftmost Rule 116 Speed and Efficiency 118 DFA and NFA in Comparison 118 CuuDuongThanCong.com https://fb.com/tailieudientucntt standard formula for matching delimited text 129, 169 standards (see POSIX) star 17 can always match 108 non-greedy 83 star and friends (see quantifier) start of line 8, 81 start 286, 288 start, of word (see word anchor) states (see backtracking) stclass 287 Steinbach, Paul 75 stock pricing example 46, 110 Stok, Mike xxiii string (also see line) anchor (see line anchor) doublequoted (see doublequoted string) as regex in awk 187 CuuDuongThanCong.com https://fb.com/tailieudientucntt in Emacs 69, 75 in Perl 219-224 in Python 75 in Tcl 75, 189, 193 string (cont'd) string anchor optimization 92, 119, 158, 232, 287 (see also line) string-oriented case-insensitive implementation 278 study 155, 254, 280, 287-289 bug 289 and the transmission 288 sub in awk 68, 187 subexpression counting 19, 44, 97, 229 defined 25 named 228, 305 substitution (see sub; gsub; regsub; replace-match) in Perl (see Perl, substitution) to remove text 111 CuuDuongThanCong.com https://fb.com/tailieudientucntt removing a partial match 47 SubstrHash module 278 symbolic group names 228, 305 syntax class 78 and isearch-forward 192 summary chart 195 Sys::Hostname module 278 syslog module 278 T tabs (see \t (tab)) xxiii Takasaki, Yukio xxiii Tcl 90, 188-192 \1 191 \9 190 191 -all 68, 173, 175, 191 backslash substitution processing 189 case-insensitive match 68, 190-191 CuuDuongThanCong.com https://fb.com/tailieudientucntt chart of shorthands 73 flavor chart 189 format 35 hexadecimal escapes 190 home page 312 -indices 135, 191, 304 line anchors 82 \n uncertainty 189 -nocase 68, 190-191 CuuDuongThanCong.com https://fb.com/tailieudientucntt Page 341 Tcl (cont'd) non-interpolative quoting 176 null character 190 octal escapes 190 optimization 192 compile caching 160 support for POSIX locale 66 regexp 135, 159, 189-190 regsub 68, 189, 191 snippet filename and path 133-135 removing C comments 173-176 Subject 96 [Tt]ubby 159 strings as regexes 75, 189, 193 \x0xddd… 190 temperature conversion 33-43, 199, 281 Term::Cap module 278 CuuDuongThanCong.com https://fb.com/tailieudientucntt Test::Harness module 278 testlib module 278 text (see literal text) Text module 278 text-directed matching 99-100 efficiency 118 regex appearance 101, 106 Text::ParseWords module 207, 278 Text::Wrap module 278 theory of an NFA 104 thingamajiggy as a technical term 71 Thompson, Ken xxii, 60-61, 78, 90 Tie::Hash module 278 Tie::Scalar module 278 Tie::SubstrHash module 278 time of day 23 Time::Local module 278 time-now 197 toothpicks scattered 69, 193 CuuDuongThanCong.com https://fb.com/tailieudientucntt tortilla 65, 81 TPJ (see Perl) trailing context 120 transmission (also see backtracking) and pseudo-backtrack 106 keeping in synch 173, 206, 236-237, 290 and study 288 (see also backtracking) Tubby 159-160, 165, 218, 285-287 twists and turns of optimization 177 type, global variable (see Perl, variable) type, private variable (see Perl, variable) U uc 245 ucfirst 245 Ullman, Jeffrey 104 Understanding CJKV Information Processing 26 UnderstandingJapanese Information Processing xxi, 26 Unicode encoding 26 unlimited matches (see quantifier, star) CuuDuongThanCong.com https://fb.com/tailieudientucntt unmatching (see backtracking) unrolling the loop 162-172, 175, 265, 301 checkpoint 164-166 normal* (special normal*)* 164-165, 301 URL fetcher 73 V \v99 77 variable in doublequoted strings 33, 219-224, 232, 245, 247, 256, 268 global vs private 211-216 interpolation 33, 219, 232, 246, 248, 255, 266, 283, 295, 304-305, 308 names 22 vars module 278 vertical tab (see \v (vertical tab)) vi 90 Vietnamese text processing 26 Virtual Software Library 310 Vromans, Johan 199, 312 CuuDuongThanCong.com https://fb.com/tailieudientucntt W Wall, Larry xxii-xxiii, 33, 85, 90, 110, 203, 208-209, 225, 258, 304 warnings 34 -w 34, 213, 285 ($^W variable) 213, 235 setting $* 235 temporarily turning off 213, 235 webget 73 Weinberger, Peter 90, 183 Welinder, Morten 312 CuuDuongThanCong.com https://fb.com/tailieudientucntt Page 342 while vs foreach vs if 256 whitespace allowing flexible 40 allowing optional 17 and awk 188 in an email address 294 as a regex delimiter 306 removing 282, 290-291 and split 263, 308 Windows-NT 185 Wine,Hal 73 Wood, Tom xxii word anchor 14 as lookbehind 230 mechanics of matching 93 in Perl 45, 240-241, 292 sample line with positions marked 14 World Wide Web CuuDuongThanCong.com https://fb.com/tailieudientucntt common CGI utility 33 HTML 1, 9, 17-18, 69, 109, 127, 129, 229, 264 wrap module 278 WWW (see World Wide Web) Y Yahoo! 310 Z zero or more (see quantifier, star) ZIP (Zone Improvement Plan) code example 236-240 CuuDuongThanCong.com https://fb.com/tailieudientucntt Page 343 About the Author Jeffrey Friedl was raised in the countryside of Rootstown, Ohio, and had aspirations of being an astronomer until one day noticing a TRS-80 Model I sitting unused in the corner of the chem lab (bristling with a full 16k RAM, no less) He eventually began using UNIX (and regular expressions) in 1980 With degrees in computer science from Kent (B.S.) and the University of New Hampshire (M.S.), he is now an engineer with Omron Corporation, Kyoto, Japan He lives in Nagaokakyou-city with Tubby, his long-time friend and Teddy Bear, in a tiny apartment designed for a (Japanese) family of three Jeffrey applies his regular-expression know-how to make the world a safer place for those not bilingual in English and Japanese He built and maintains the World Wide Web Japanese-English dictionary server, http.//www.itc.omron.com/cgi-bin/j-e, and is active in a variety of language-related projects, both in print and on the Web When faced with the daunting task of filling his copious free time, Jeffrey enjoys riding through the mountainous countryside of Japan on his Honda CB-1 At the age of 30, he finally decided to put his 6'4" height to some use, and joined the Omron company basketball team While finalizing the manuscript for Mastering Regular Expressions, he took time out to appear in his first game, scoring five points in nine minutes of play, which he feels is pretty darn good for a geek When visiting his family in The States, Jeffrey enjoys dancing a two-step with his mom, binking old coins with his dad, and playing schoffkopf with his brothers and sisters Colophon The birds featured on the cover of Mastering Regular Expressions are owls There are two families and approximately 180 species of these birds of prey distributed throughout the world, with the exception of Antarctica Most species of owl are nocturnal hunters, feeding entirely on live animals, ranging in size from insects to hares CuuDuongThanCong.com https://fb.com/tailieudientucntt Because they have little ability to move their large, forward-facing eyes, owls must move their entire heads in order to look around They can rotate their heads up to 270 degrees, and some can turn their heads completely upside down Among the physical adaptations that enhance owls' effectiveness as hunters is their extreme sensitive to the frequency and direction of sounds Many species of owl have asymmetrical ear placement, which enables them to more easily locate CuuDuongThanCong.com https://fb.com/tailieudientucntt Page 344 their prey in dim or dark light Once they've pinpointed the location, the owl's soft feathers allow them to fly noiselessly and thus to surprise their prey While people have traditionally anthropomorphized birds of prey as evil and coldblooded creatures, owls are viewed differently in human mythology Perhaps because their large eyes give them the appearance of intellectual depth, owls have been portrayed in folklore through the ages as wise creatures Edie Freedman designed this cover and the entire UNIX bestiary that appears on Nutshell Handbooks, using a 19th-century engraving from the Dover Pictorial Archive The cover layout was produced with Quark XPress 3.3 using the ITC Garamond font The text was prepared by Jeffrey Friedl in a hybrid markup of his own design, mixing SGML, raw troff, raw PostScript, and his own markup A home-grown filter translated the latter to the other, lower-level markups, the result of which was processed by a locally modified version of O'Reilly's SGML tools (this step requiring upwards of an hour of raw processing time, and over 75 megabytes of process space, just for Chapter 7!) That result was then processed by a locally-modified version of James Clark's gtroff producing camera-ready PostScript for O'Reilly The text was written and processed on an IBM ThinkPad 755 CX, provided by Omron Corporation, running Linux the X Windows System, and Mule (Multilingual Emacs) A notoriously poor speller, Jeffrey made heavy use of ispell and its Emacs interface For imaging during development, Jeffrey used Ghostscript (from Aladdin Enterprises, Menlo Park, California), as well as an Apple Color Laser Writer 12/600PS provided by Omron Test prints at 1270dpi were kindly provided by Ken Lunde, of Adobe Systems, using a Linotronic L300-J Ken Lunde also provided a number of special font and typesetting needs, including custom-designed characters and Japanese characters from Adobe Systems's Heisei Mincho W3 typeface The figures were originally created by Jeffrey using xfig, as well as Thomas Williams's and Colin Kelley's gnuplot They were then greatly enhanced by Chris Reilley using Macromedia Freehand CuuDuongThanCong.com https://fb.com/tailieudientucntt The text is set in ITC Garamond Light; code is set in ConstantWillison; figure labels are in Helvetica Black CuuDuongThanCong.com https://fb.com/tailieudientucntt ... Modes 236 7-7 Regex Shorthands and Special-Character Encodings 241 7-8 String and Regex-Operand Case-Modification Constructs 245 7-9 Examples of m/…/g with a Can-Match-Nothing Regex 250 7-1 0 Standard... and Their Regex Engines 90 5-1 Match Efficiency for a Traditional NFA 143 5-2 Unrolling-The-Loop Example Cases 163 5-3 Unrolling-The-Loop Components for C Comments 172 6-1 A Superficial Survey of... 193 6-6 GNU Emacs's String Metacharacters 194 6-7 Emacs's NFA Regex Flavor 194 6-8 Emacs Syntax Classes 195 7-1 Overview of Perl's Regular-Expression Language 201 7-2 Overview of Perl's Regex-Related