1. Trang chủ
  2. » Công Nghệ Thông Tin

Effective awk programming universal text processing and pattern matching arnold robbins

602 468 0

Đang tải... (xem toàn văn)

Tài liệu hạn chế xem trước, để xem đầy đủ mời bạn chọn Tải xuống

THÔNG TIN TÀI LIỆU

Thông tin cơ bản

Định dạng
Số trang 602
Dung lượng 3,14 MB

Nội dung

Effective awk Programming Arnold Robbins To my parents, for their love, and for the wonderful example they set for me To my wife Miriam, for making me complete Thank you for building your life together with me To our children Chana, Rivka, Nachum, and Malka, for enrichening our lives in innumerable ways Foreword to the Third Edition Michael Brennan Author of mawk Arnold Robbins and I are good friends We were introduced in 1990 by circumstances — and our favorite programming language, awk The circumstances started a couple of years earlier I was working at a new job and noticed an unplugged Unix computer sitting in the corner No one knew how to use it, and neither did I However, a couple of days later, it was running, and I was root and the one-and-only user That day, I began the transition from statistician to Unix programmer On one of many trips to the library or bookstore in search of books on Unix, I found the gray awk book, a.k.a Alfred V Aho, Brian W Kernighan, and Peter J Weinberger’s The AWK Programming Language (Addison-Wesley, 1988) awk’s simple programming paradigm — find a pattern in the input and then perform an action — often reduced complex or tedious data manipulations to a few lines of code I was excited to try my hand at programming in awk Alas, the awk on my computer was a limited version of the language described in the gray book I discovered that my computer had “old awk” and the book described “new awk.” I learned that this was typical; the old version refused to step aside or relinquish its name If a system had a new awk, it was invariably called nawk, and few systems had it The best way to get a new awk was to ftp the source code for gawk from prep.ai.mit.edu gawk was a version of new awk written by David Trueman and Arnold, and available under the GNU General Public License (Incidentally, it’s no longer difficult to find a new awk gawk ships with GNU/Linux, and you can download binaries or source code for almost any system; my wife uses gawk on her VMS box.) My Unix system started out unplugged from the wall; it certainly was not plugged into a network So, oblivious to the existence of gawk and the Unix community in general, and desiring a new awk, I wrote my own, called mawk Before I was finished, I knew about gawk, but it was too late to stop, so I eventually posted to a comp.sources newsgroup A few days after my posting, I got a friendly email from Arnold introducing himself He suggested we share design and algorithms and attached a draft of the POSIX standard so that I could update mawk to support language extensions added after publication of The AWK Programming Language Frankly, if our roles had been reversed, I would not have been so open and we probably would have never met I’m glad we did meet He is an awk expert’s awk expert and a genuinely nice person Arnold contributes significant amounts of his expertise and time to the Free Software Foundation This book is the gawk reference manual, but at its core it is a book about awk programming that will appeal to a wide audience It is a definitive reference to the awk language as defined by the 1987 Bell Laboratories release and codified in the 1992 POSIX Utilities standard On the other hand, the novice awk programmer can study a wealth of practical programs that emphasize the power of awk’s basic idioms: data-driven control flow, pattern matching with regular expressions, and associative arrays Those looking for something new can try out gawk’s interface to network protocols via special /inet files The programs in this book make clear that an awk program is typically much smaller and faster to develop than a counterpart written in C Consequently, there is often a payoff to prototyping an algorithm or design in awk to get it running quickly and expose problems early Often, the interpreted performance is adequate and the awk prototype becomes the product The new pgawk (profiling gawk) produces program execution counts I recently experimented with an algorithm that for n lines of input exhibited ∼ Cn2 performance, while theory predicted ∼ Cn log n behavior A few minutes poring over the awkprof.out profile pinpointed the problem to a single line of code pgawk is a welcome addition to my programmer’s toolbox Arnold has distilled over a decade of experience writing and using awk programs, and developing gawk, into this book If you use awk or want to learn how, then read this book Foreword to the Fourth Edition Michael Brennan Author of mawk Some things don’t change Thirteen years ago I wrote: “If you use awk or want to learn how, then read this book.” True then, and still true today Learning to use a programming language is about more than mastering the syntax One needs to acquire an understanding of how to use the features of the language to solve practical programming problems A focus of this book is many examples that show how to use awk Some things do change Our computers are much faster and have more memory Consequently, speed and storage inefficiencies of a high-level language matter less Prototyping in awk and then rewriting in C for performance reasons happens less, because more often the prototype is fast enough Of course, there are computing operations that are best done in C or C++ With gawk 4.1 and later, you do not have to choose between writing your program in awk or in C/C++ You can write most of your program in awk and the aspects that require C/C++ capabilities can be written in C/C++, and then the pieces glued together when the gawk module loads the C/C++ module as a dynamic plug-in Chapter 16 has all the details, and, as expected, many examples to help you learn the ins and outs I enjoy programming in awk and had fun (re)reading this book I think you will, too Effective awk Programming Table of Contents Dedication Foreword to the Third Edition Foreword to the Fourth Edition Preface History of awk and gawk A Rose by Any Other Name Using This Book Typographical Conventions Dark Corners The GNU Project and This Book How to Stay Current Using Code Examples Safari® Books Online How to Contact Us Acknowledgments I The awk Language Getting Started with awk How to Run awk Programs One-Shot Throwaway awk Programs Running awk Without Input Files Running Long Programs Executable awk Programs Comments in awk Programs Shell Quoting Issues Quoting in MS-Windows batch files Datafiles for the Examples Some Simple Examples An Example with Two Rules A More Complex Example awk Statements Versus Lines Other Features of awk When to Use awk Summary Running awk and gawk Invoking awk Command-Line Options Other Command-Line Arguments Naming Standard Input The Environment Variables gawk Uses The AWKPATH Environment Variable The AWKLIBPATH Environment Variable Other Environment Variables gawk’s Exit Status Including Other Files into Your Program Loading Dynamic Extensions into Your Program Obsolete Options and/or Features Undocumented Options and Features Summary Regular Expressions How to Use Regular Expressions Escape Sequences Regular Expression Operators Using Bracket Expressions How Much Text Matches? Using Dynamic Regexps gawk-Specific Regexp Operators Case Sensitivity in Matching Summary Reading Input Files How Input Is Split into Records Record Splitting with Standard awk Record Splitting with gawk Examining Fields Nonconstant Field Numbers Changing the Contents of a Field Specifying How Fields Are Separated Whitespace Normally Separates Fields Using Regular Expressions to Separate Fields Making Each Character a Separate Field Setting FS from the Command Line Making the Full Line Be a Single Field Field-Splitting Summary Reading Fixed-Width Data Defining Fields by Content Multiple-Line Records Explicit Input with getline Using getline with No Arguments Using getline into a Variable Using getline from a File Using getline into a Variable from a File Using getline from a Pipe Using getline into a Variable from a Pipe Using getline from a Coprocess Using getline into a Variable from a Coprocess Points to Remember About getline Summary of getline Variants Reading Input with a Timeout Directories on the Command Line Summary Printing Output The print Statement print Statement Examples Output Separators Controlling Numeric Output with print Using printf Statements for Fancier Printing Introduction to the printf Statement Format-Control Letters Modifiers for printf Formats Examples Using printf Redirecting Output of print and printf Special Files for Standard Preopened Data Streams Special Filenames in gawk Accessing Other Open Files with gawk Special Files for Network Communications Special Filename Caveats Closing Input and Output Redirections Summary Expressions Constants, Variables, and Conversions Constant Expressions Numeric and string constants Octal and hexadecimal numbers Regular expression constants Using Regular Expression Constants Variables Using variables in a program Assigning variables on the command line Conversion of Strings and Numbers How awk converts between strings and numbers Locales can influence conversion Operators: Doing Something with Values Arithmetic Operators String Concatenation Assignment Expressions Increment and Decrement Operators Truth Values and Conditions True and False in awk Variable Typing and Comparison Expressions String type versus numeric type Comparison operators String comparison with POSIX rules Boolean Expressions Conditional Expressions Function Calls Operator Precedence (How Operators Nest) Where You Are Makes a Difference Summary Patterns, Actions, and Variables Pattern Elements Regular Expressions as Patterns Expressions as Patterns Specifying Record Ranges with Patterns The BEGIN and END Special Patterns Startup and cleanup actions Input/output from BEGIN and END rules The BEGINFILE and ENDFILE Special Patterns The Empty Pattern Using Shell Variables in Programs Actions Control Statements in Actions The if-else Statement The while Statement The do-while Statement The for Statement The switch Statement The break Statement The continue Statement The next Statement The nextfile Statement The exit Statement Predefined Variables Built-in Variables That Control awk Built-in Variables That Convey Information Using ARGC and ARGV Summary Arrays in awk The Basics of Arrays Introduction to Arrays Referring to an Array Element Assigning Array Elements Basic Array Example Scanning All Elements of an Array Using Predefined Array Scanning Orders with gawk Using Numbers to Subscript Arrays Using Uninitialized Variables as Subscripts The delete Statement Multidimensional Arrays Scanning Multidimensional Arrays Arrays of Arrays Summary Functions Built-in Functions Calling Built-in Functions Numeric Functions String-Manipulation Functions More about ‘\’ and ‘&’ with sub(), gsub(), and gensub() Input/Output Functions Time Functions Bit-Manipulation Functions Getting Type Information String-Translation Functions User-Defined Functions Function Definition Syntax Function Definition Examples Calling User-Defined Functions Writing a function call Controlling variable scope Passing function arguments by value or by reference The return Statement Functions and Their Effects on Variable Typing Indirect Function Calls Summary II Problem Solving with awk 10 A Library of awk Functions Naming Library Function Global Variables General Programming Converting Strings to Numbers Assertions Rounding Numbers The Cliff Random Number Generator Translating Between Characters and Numbers Merging an Array into a String Managing the Time of Day Reading a Whole File at Once Quoting Strings to Pass to the Shell Datafile Management Noting Datafile Boundaries Rereading the Current File Checking for Readable Datafiles Checking for Zero-Length Files Treating Assignments as Filenames Processing Command-Line Options Reading the User Database Reading the Group Database Traversing Arrays of Arrays Summary 11 Practical awk Programs Running the Example Programs Reinventing Wheels for Fun and Profit Cutting Out Fields and Columns Searching for Regular Expressions in Files Printing Out User Information Splitting a Large File into Pieces Duplicating Output into Multiple Files Printing Nonduplicated Lines of Text Counting Things A Grab Bag of awk Programs Finding Duplicated Words in a Document An Alarm Clock Program Transliterating Characters Printing Mailing Labels Generating Word-Usage Counts Removing Duplicates from Unsorted Text Extracting Programs from Texinfo Source Files A Simple Stream Editor An Easy Way to Use Library Functions Finding Anagrams from a Dictionary And Now for Something Completely Different Summary III Moving Beyond Standard awk with gawk 12 Advanced Features of gawk Allowing Nondecimal Input Data Controlling Array Traversal and Array Sorting Controlling Array Traversal Sorting Array Values and Indices with gawk Two-Way Communications with Another Process Using gawk for Network Programming Profiling Your awk Programs Summary 13 Internationalization with gawk Internationalization and Localization GNU gettext Internationalizing awk Programs Translating awk Programs Extracting Marked Strings Rearranging printf Arguments awk Portability Issues A Simple Internationalization Example gawk Can Speak Your Language Summary 14 Debugging awk Programs Introduction to the gawk Debugger Debugging in General Debugging Concepts awk Debugging Sample gawk Debugging Session How to Start the Debugger Finding the Bug Main Debugger Commands Control of Breakpoints Control of Execution Viewing and Changing Data Working with the Stack Obtaining Information About the Program and the Debugger State Miscellaneous Commands Readline Support Limitations Summary 15 Arithmetic and Arbitrary-Precision Arithmetic with gawk A General Description of Computer Arithmetic Other Stuff to Know Arbitrary-Precision Arithmetic Features in gawk Floating-Point Arithmetic: Caveat Emptor! Floating-Point Arithmetic Is Not Exact Many numbers cannot be represented exactly Be careful comparing values Errors accumulate Getting the Accuracy You Need Try a Few Extra Bits of Precision and Rounding Setting the Precision Setting the Rounding Mode Arbitrary-Precision Integer Arithmetic with gawk Standards Versus Existing Practice Summary 16 Writing Extensions for gawk Introduction Extension Licensing How It Works at a High Level API Description Introduction General-Purpose Data Types Memory Allocation Functions and Convenience Macros Constructor Functions Registration Functions Registering an extension function Registering an exit callback function Registering an extension version string Customized input parsers Customized output wrappers Customized two-way processors Printing Messages Updating ERRNO Requesting Values Accessing and Updating Parameters Symbol Table Access Variable access and update by name Variable access and update by cookie Creating and using cached values Array Manipulation Array data types Array functions Working with all the elements of an array How to create and populate arrays API Variables API version constants and variables Informational variables Boilerplate Code How gawk Finds Extensions Example: Some File Functions Using chdir() and stat() C Code for chdir() and stat() Integrating the Extensions The Sample Extensions in the gawk Distribution File-Related Functions Interface to fnmatch() Interface to fork(), wait(), and waitpid() Enabling In-Place File Editing Character and Numeric values: ord() and chr() Reading Directories Reversing Output Two-Way I/O Example Dumping and Restoring an Array Reading an Entire File Extension Time Functions API Tests The gawkextlib Project Summary IV Appendices A The Evolution of the awk Language Major Changes Between V7 and SVR3.1 Changes Between SVR3.1 and SVR4 Changes Between SVR4 and POSIX awk Extensions in Brian Kernighan’s awk Extensions in gawk Not in POSIX awk Common Extensions Summary Regexp Ranges and Locales: A Long Sad Story Major Contributors to gawk Summary B Installing gawk The gawk Distribution Getting the gawk Distribution Extracting the Distribution Contents of the gawk Distribution Compiling and Installing gawk on Unix-Like Systems Compiling gawk for Unix-Like Systems Additional Configuration Options The Configuration Process Installation on Other Operating Systems Installation on PC Operating Systems Compiling gawk for PC operating systems Testing gawk on PC operating systems Using gawk on PC operating systems Using gawk in the Cygwin environment Using gawk in the MSYS environment Compiling and Installing gawk on Vax/VMS and OpenVMS Compiling gawk on VMS Compiling gawk dynamic extensions on VMS Installing gawk on VMS Running gawk on VMS The VMS GNV project Some VMS systems have an old version of gawk Reporting Problems and Bugs Other Freely Available awk Implementations Summary C GNU General Public License Index Colophon Copyright ... gray awk book, a.k.a Alfred V Aho, Brian W Kernighan, and Peter J Weinberger’s The AWK Programming Language (Addison-Wesley, 1988) awk s simple programming paradigm — find a pattern in the input and then perform an action — often reduced... Chapter 1, Getting Started with awk, provides the essentials you need to know to begin using awk Chapter 2, Running awk and gawk, describes how to run gawk, the meaning of its command-line options, and how it finds awk program source files Chapter 3, Regular Expressions, introduces regular expressions in general, and in... Chapter 7, Patterns, Actions, and Variables, describes how to write patterns for matching records, actions for doing something when a record is matched, and the predefined variables awk and gawk use

Ngày đăng: 20/03/2018, 09:12