Page iii Mastering Algorithms with Perl Jon Orwant, Jarkko Hietaniemi, and John Macdonald Page iv Mastering Algorithms with Perl by Jon Orwant, Jarkko Hietaniemi. and John Macdonald Copyright © 1999 O'Reilly & Associates, Inc. All rights reserved. Printed in the United States of America. Cover illustration by Lorrie LeJeune, Copyright © 1999 O'Reilly & Associates, Inc. Published by O'Reilly & Associates, Inc., 101 Morris Street, Sebastopol, CA 95472. Editors: Andy Oram and Jon Orwant Production Editor: Melanie Wang Printing History: August 1999: First Edition. Nutshell Handbook, the Nutshell Handbook logo, and the O'Reilly logo are registered trademarks of O'Reilly & Associates, Inc. Many of the designations used by manufacturers and sellers to distinguish their products are claimed as trademarks. Where those designations appear in this book, and O'Reilly & Associates, Inc. was aware of a trademark claim, the designations have been printed in caps or initial caps. The association between the image of a wolf and the topic of Perl algorithms is a trademark of O'Reilly & Associates, Inc. While every precaution has been taken in the preparation of this book, the publisher assumes no responsibility for errors or omissions, or for damages resulting from the use of the information contained herein. ISBN: 1-56592-398-7 [1/00] [M]]break Page v Table of Contents Preface xi 1. Introduction 1 What Is an Algorithm? 1 Efficiency 8 Recurrent Themes in Algorithms 20 2. Basic Data Structures 24 Perl's Built-in Data Structures 25 Build Your Own Data Structure 26 A Simple Example 27 Perl Arrays: Many Data Structures in One 37 3. Advanced Data Structures 46 Linked Lists 47 Circular Linked Lists 60 Garbage Collection in Perl 62 Doubly-Linked Lists 65 Doubly-Linked Lists 65 Infinite Lists 71 The Cost of Traversal 72 Binary Trees 73 Heaps 91 Binary Heaps 92 Janus Heap 99 Page vi The Heaps Module 99 Future CPAN Modules 101 4. Sorting 102 An Introduction to Sorting 102 All Sorts of Sorts 119 Sorting Algorithms Summary 151 5. Searching 157 Hash Search and Other Non-Searches 158 Lookup Searches 159 Generative Searches 175 6. Sets 203 Venn Diagrams 204 Creating Sets 205 Set Union and Intersection 209 Set Differences 217 Set Differences 217 Counting Set Elements 222 Set Relations 223 The Set Modules of CPAN 227 Sets of Sets 233 Multivalued Sets 240 Sets Summary 242 7. Matrices 244 Creating Matrices 246 Manipulating Individual Elements 246 Finding the Dimensions of a Matrix 247 Displaying Matrices 247 Adding or Multiplying Constants 248 Transposing a Matrix 254 Multiplying Matrices 256 Extracting a Submatrix 259 Combining Matrices 260 Inverting a Matrix 261 Computing the Determinant 262 Gaussian Elimination 263 Eigenvalues and Eigenvectors 266 Page vii The Matrix Chain Product 269 The Matrix Chain Product 269 Delving Deeper 272 8. Graphs 273 Vertices and Edges 276 Derived Graphs 281 Graph Attributes 286 Graph Representation in Computers 287 Graph Traversal 301 Paths and Bridges 310 Graph Biology: Trees, Forests, DAGS, Ancestors, and Descendants 312 Edge and Graph Classes 316 CPAN Graph Modules 351 9. Strings 353 Perl Builtins 354 String-Matching Algorithms 357 Phonetic Algorithms 388 Stemming and Inflection 389 Parsing 394 Compression 411 10. Geometric Algorithms 425 Distance 426 Area, Perimeter, and Volume 429 Direction 433 Intersection 435 Intersection 435 Inclusion 443 Boundaries 449 Closest Pair of Points 457 Geometric Algorithms Summary 464 CPAN Graphics Modules 464 11. Number Systems 469 Integers and Reals 469 Strange Systems 480 Trigonometry 491 Significant Series 492 Page viii 12. Number Theory 499 Basic Number Theory 499 Prime Numbers 504 Unsolved Problems 522 13. Cryptography 526 Legal Issues 527 Authorizing People with Passwords 528 Authorization of Data: Checksums and More 533 Obscuring Data: Encryption 538 Hiding Data: Steganography 555 Winnowing and Chaffing 558 Winnowing and Chaffing 558 Encrypted Perl Code 562 Other Issues 564 14. Probability 566 Random Numbers 567 Events 569 Permutations and Combinations 571 Probability Distributions 574 Rolling Dice: Uniform Distributions 576 Loaded Dice and Candy Colors: Nonuniform Discrete Distributions 582 If the Blue Jays Score Six Runs: Conditional Probability 589 Flipping Coins over and Over: Infinite Discrete Distributions 590 How Much Snow? Continuous Distributions 591 Many More Distributions 592 15. Statistics 599 Statistical Measures 600 Significance Tests 608 Correlation 620 16. Numerical Analysis 626 Computing Derivatives and Integrals 627 Solving Equations 634 Interpolation, Extrapolation, and Curve Fitting 642 Page ix A. Further Reading 649 B. ASCII Character Set 652 Index 657 Page xi Preface Perl's popularity has soared in recent years. It owes its appeal first to its technical superiority: Perl's unparalleled portability, speed, and expressiveness have made it the language of choice for a million programmers worldwide. Those programmers have extended Perl in ways unimaginable with languages controlled by committees or companies. Of all languages, Perl has the largest base of free utilities, thanks to the Comprehensive Perl Archive Network (abbreviated CPAN; see http://www.perl.com/CPAN/). The modules and scripts you'll find there have made Perl the most popular language for web; text, and database programming. But Perl can do more than that. You can solve complex problems in Perl more quickly, and in fewer lines, than in any other language. This ease of use makes Perl an excellent tool for exploring algorithms. Computer science embraces complexity; the essence of programming is the clean dissection of a seemingly insurmountable problem into a series of simple, computable steps. Perl is ideal for tackling the tougher nuggets of computer science because its liberal syntax lets the programmer express his or her solution in the manner best suited to the task. (After all, Perl's motto is There's More Than One Way To Do It.) Algorithms are complex enough; we don't need a computer language making it any tougher. Most books about computer algorithms don't include working programs. They express their ideas in quasi-English pseudocode instead, which allows the discussion to focus on concepts without getting bogged down in implementation details. But sometimes the details are what matter—the inefficiencies of a bad implementation sometimes cancel the speedup that a good algorithm provides. The devil is in the details.break Page xii And while converting ideas to programs is often a good exercise, it's also just plain time-consuming. So, in this book we've supplied you with not just explanations, but implementations as well. If you read this book carefully, you'll learn more about both algorithms and Perl. About This Book This book is written for two kinds of people: those who want cut and paste solutions and those who want to hone their programming skills. You'll see how we solve some of the classic problems of computer science and why we solved them the way we did. Theory or Practice? Like the wolf featured on the cover, this book is sometimes fierce and sometimes playful. The fierce part is the computer science: we'll often talk like computer scientists talk and discuss problems that matter little to the practical Perl programmer. Other times, we'll playfully explain the problem and simply tell you about ready-made solutions you can find on the Internet (almost always on CPAN). Deciding when to be fierce and when to be playful hasn't been easy for us. For instance, every algorithms textbook has a chapter on all of the different ways to sort a collection of items. So do we, even though Perl provides its own sort() function that might be all you ever need. We do this for four reasons. First, we don't want you thinking you've Mastered Algorithms without understanding the algorithms covered in every college course on the subject. Second, the concepts, processes, and strategies underlying those algorithms will come in handy for more than just sorting. Third, it helps to know how Perl's sort() works under the hood, why its particular algorithm (quicksort) was used, and how to avoid some of the inefficiencies that even experienced Perl programmers fall prey to. Finally, sort() isn't always the best solution! Someday, you might need another of the techniques we provide. When it comes to the inevitable tradeoffs between theory and practice, programmers' tastes vary. We have chosen a middle course, swiftly pouncing from one to the other with feral abandon. If your tastes are exclusively theoretical or practical, we hope you'll still appreciate the balanced diet you'll find here. Organization of This Book The chapters in this book can be read in isolation; they typically don't require knowledge from previous chapters. However, we do recommend that you read at least Chapter 1, Introduction, and Chapter 2, Basic Data Structures, which provide the basic material necessary for understanding the rest of the book.break Page xiii Chapter 1 describes the basics of Perl and algorithms, with an emphasis on speed and general problem-solving techniques. Chapter 2 explains how to use Perl to create simple and very general representations, like queues and lists of lists. Chapter 3, Advanced Data Structures, shows how to build the classic computer science data structures. Chapter 4, Sorting, looks at techniques for ordering data and compares the advantages of each technique. Chapter 5, Searching, investigates ways to extract individual pieces of information from a larger collection. Chapter 6, Sets, discusses the basics of set theory and Perl implementations of set operations. Chapter 7, Matrices, examines techniques for manipulating large arrays of data and solving problems in linear algebra. Chapter 8, Graphs, describes tools for solving problems that are best represented as a graph: a collection of nodes connected by edges. Chapter 9, Strings, explains how to implement algorithms for searching, filtering, and parsing strings of text. Chapter 10, Geometric Algorithms, looks at techniques for computing with two-and three-dimensional constructs. Chapter 11, Number Systems, investigates methods for generating important constants, functions, and number series, as well as manipulating numbers in alternate coordinate systems. Chapter 12, Number Theory, examines algorithms for factoring numbers, modular arithmetic, and other techniques for computing with integers. Chapter 13, Cryptography, demonstrates Perl utilities to conceal your data from prying eyes. Chapter 14, Probability, discusses how to use Perl for problems involving chance. Chapter 15, Statistics, describes methods for analyzing the accuracy of hypotheses and characterizing the distribution of data. Chapter 16, Numerical Analysis, looks at a few of the more common problems in scientific computing. Appendix A, Further Reading, contains an annotated bibliography.break Page xiv Appendix B, ASCII Character Set, lists the seven-bit ASCII character set used by default when Perl sorts strings. Conventions Used in This Book Italic Used for filenames, directory names, URLs, and occasional emphasis. Constant width Used for elements of programming languages, text manipulated by programs, code examples, and output. Constant width bold Used for user input and for emphasis in code. Constant width italic Used for replaceable values. [...]... 1, 000,000 N log N 13 , 815 , 510 N2 1, 000,000,000,000 N3 1, 000,000,000,000,000,000 2N A number with 693 ,14 8 digits Figure 1- 1 shows how these functions compare when N varies from 1 to 2.break Page 19 Figure 1- 1 Orders of growth between 1 and 2 In Figure 1- 1, all these orders of growth seem comparable But see how they diverge as we extend N to 15 in Figure 1- 2 Figure 1- 2 Orders of growth between 1 and 15 ... Benchmark: timing 10 000 iterations of bruteforce, quadratic bruteforce: 53 secs (12 .07 usr 0.05 sys = 12 .12 cpu) quadratic: 5 secs ( 1. 17 usr 0.00 sys = 1. 17 cpu) This tells us that computing the quadratic formula isn't just more elegant, it's also 10 times faster, using only 1. 17 CPU seconds compared to the for loop's sluggish 12 .12 CPU seconds Some tips for using the Benchmark module: • Any test that takes... this to the perlbug mailing list: Hi, I'd appreciate if this is a known bug and if a patch is available int of (2.4/0.2) returns 11 instead of the expected 12 It would seem that this poor fellow is correct: perl -e 'print int(2.4/0.2)' indeed prints 11 You might expect it to print 12 , because two-point-four divided by oh-point-two is twelve, and the integer part of 12 is 12 Must be a bug in Perl, right?... @result; } @numbers = (1 1000); timethese (10 00, { no_temp => 'logbase1( 10 , \@numbers )', temp => 'logbase2( 10 , \@numbers )' }); Here, we compute the logs of all the numbers between 1 and 10 00 logbase1() and logbase2() are nearly identical, except that logbase2() stores the log of 10 in $logbase so that it doesn't need to compute it each time The result: Benchmark: timing 10 00 iterations of no_temp,... were computed on a 255-MHz DEC Alpha with 96 megabytes of RAM running Perl 5.004_ 01 Each printable character was fed to the subroutines 5,000 times: Benchmark: timing 5000 iterations of compute, lookup_array, lookup_hash compute: 24 secs (19 .28 usr lookup_array: 16 secs (15 .98 usr lookup_hash: 16 secs (15 .70 usr 0.08 sys = 19 .37 cpu) 0.03 sys = 16 .02 cpu) 0.02 sys = 15 .72 cpu) The lookup hash is slightly... linearly until we find a good-enough choice my ($low, $high) = @_; my $x; for ($x = $low; $x 'quadratic (1, 1, -1) ', bruteforce => 'bruteforce(0, 1) ' }); After including the Benchmark module with use Benchmark, this program defines two subroutines The first computes the larger root of any quadratic equation... popular algorithms for sorting a collection of elements Quicksort is O (N 2) worst case and O (N log N) average case You'll learn about quicksort in Chapter 4 In case this all seems pedantic, consider how growth functions compare Table 1- 4 lists eight growth functions and their values given a million data points Table 1- 4 An Order of Growth Sampler Growth Function Value for N = 1, 000,000 1 1 log N 13 .8 10 00... is a regular array, with a name, cubs, whereas ['Winken', 'Blinken', 'Nod'] refers to an anonymous array The syntax for both is shown in Table 1- 1 Table 1- 1 Items to Which References Can Point Type Assigning a Reference to a Variable Assigning a Reference to an Anonymous Variable scalar $ref = \$scalar $ref = \1 list $ref = \@arr $ref = [ 1, 2, 3 ] hash $ref = \%hash $ref = { a= >1, b=>2, c=>3 } subroutine... low ← 0 2 high ← length[A] Page 3 3 while low < high 4 do try ← int ((low + high) / 2) 5 if A[try] > w 6 then high ← try 7 else if A[try] < w 8 then low ← try + 1 9 else return try 10 end if 11 end if 12 end do 13 return NO_ELEMENT And now the Perl program Not only is it shorter, it's an honest-to-goodness working subroutine # $index = binary_search( \@array, $word ) # @array is a list of lowercase strings... Numbers" in Chapter 11 , Number Systems.) The volume_var() subroutine assigns (n/2)! to a temporary variable, $denom; the volume_novar() subroutine returns the result directly.break use constant pi => 3 .14 159265358979; sub volume_var { my ($r, $n) = @_; Page 16 my $denom; if ($n % 2) { $denom = sqrt(pi) * factorial (2 * (int($n / 2)) + 2) / factorial(int($n / 2) + 1) / (4 ** (int($n / 2) + 1) ); } else { $denom . iii Mastering Algorithms with Perl Jon Orwant, Jarkko Hietaniemi, and John Macdonald Page iv Mastering Algorithms with Perl by Jon Orwant, Jarkko Hietaniemi. and John Macdonald Copyright © 19 99. Sorting 10 2 All Sorts of Sorts 11 9 Sorting Algorithms Summary 15 1 5. Searching 15 7 Hash Search and Other Non-Searches 15 8 Lookup Searches 15 9 Generative Searches 17 5 6. Sets 203 Venn Diagrams 204 Creating. Classes 316 CPAN Graph Modules 3 51 9. Strings 353 Perl Builtins 354 String-Matching Algorithms 357 Phonetic Algorithms 388 Stemming and Inflection 389 Parsing 394 Compression 411 10 . Geometric Algorithms