Structures of String Matching and Data Compression N Jesper Larsson Department of Computer Science Lund University Department of Computer Science Lund University Box 118 S-221 00 Lund Sweden Copyright © 1999 by Jesper Larsson CODEN:LUNFD6/(NFCS-1015)/1–130/(1999) ISBN 91-628-3685-4 Abstract This doctoral dissertation presents a range of results concerning efficient algorithms and data structures for string processing, including several schemes contributing to sequential data compression It comprises both theoretic results and practical implementations We study the suffix tree data structure, presenting an efficient representation and several generalizations This includes augmenting the suffix tree to fully support sliding window indexing (including a practical implementation) in linear time Furthermore, we consider a variant that indexes naturally word-partitioned data, and present a linear-time construction algorithm for a tree that represents only suffixes starting at word boundaries, requiring space linear in the number of words By applying our sliding window indexing techniques, we achieve an efficient implementation for dictionary-based compression based on the LZ-77 algorithm Furthermore, considering predictive source modelling, we show that a PPM* style model can be maintained in linear time using arbitrarily bounded storage space We also consider the related problem of suffix sorting, applicable to suffix array construction and block sorting compression We present an algorithm that eliminates superfluous processing of previous solutions while maintaining robust worst-case behaviour We experimentally show favourable performance for a wide range of natural and degenerate inputs, and present a complete implementation Block sorting compression using BWT, the Burrows-Wheeler transform, has implicit structure closely related to context trees used in predictive modelling We show how an explicit BWT context tree can be efficiently generated as a subset of the corresponding suffix tree and explore the central problems in using this structure We experimentally evaluate prediction capabilities of the tree and consider representing it explicitly as part of the compressed data, arguing that a conscious treatment of the context tree can combine the compression performance of predictive modelling with the computational efficiency of BWT Finally, we explore offline dictionary-based compression, and present a semi-static source modelling scheme that obtains excellent compression, yet is also capable of high decoding rates The amount of memory used by the decoder is flexible, and the compressed data has the potential of supporting direct search operations Between theory and practice, some talk as if they were two – making a separation and difference between them Yet wise men know that both can be gained in applying oneself whole-heartedly to one Bhagavad-G¯ıt¯a 5:4 Short-sighted programming can fail to improve the quality of life It can reduce it, causing economic loss or even physical harm In a few extreme cases, bad programming practice can lead to death P J Plauger, Computer Language, Dec 1990 Contents Foreword Chapter One Fundamentals 1.1 Basic Definitions 1.2 Trie Storage Considerations 1.3 Suffix Trees 1.4 Sequential Data Compression 10 12 13 19 Chapter Two Sliding Window Indexing 2.1 Suffix Tree Construction 2.2 Sliding the Window 2.3 Storage Issues and Final Result 21 22 24 32 Chapter Three Indexing Word-Partitioned Data 3.1 Definitions 3.2 Wasting Space: Algorithm A 3.3 Saving Space: Algorithm B 3.4 Extensions and Variations 3.5 Sublinear Construction: Algorithm C 3.6 Additional Notes on Practice 33 34 36 36 40 41 45 Chapter Four Suffix Sorting 4.1 Background 4.2 A Faster Suffix Sort 4.3 Time Complexity 4.4 Algorithm Refinements 4.5 Implementation and Experiments 48 50 52 56 59 63 Chapter Five Suffix Tree Source Models 5.1 Ziv-Lempel Model 5.2 Predictive Modelling 5.3 Suffix Tree PPM* Model 5.4 Finite PPM* Model 5.5 Non-Structural Operations 5.6 Conclusions 71 71 73 74 76 76 78 Chapter Six Burrows-Wheeler Context Trees 6.1 Background 6.2 Context Trees 6.3 The Relationship between Move-to-front Coding and Context Trees 6.4 Context Tree BWT Compression Schemes 6.5 Final Comments 79 80 82 Chapter Seven Semi-Static Dictionary Model 7.1 Previous Approaches 7.2 Recursive Pairing 7.3 Implementation 7.4 Compression Effectiveness 7.5 Encoding the Dictionary 7.6 Tradeoffs 7.7 Experimental Results 7.8 Future Work 91 93 94 95 101 102 105 106 110 86 87 89 Appendix A Sliding Window Suffix Tree Implementation 111 Appendix B Suffix Sorting Implementation 119 Appendix C Notation 125 Bibliography 127 Foreword O riginally, my motivation for studying computer science was most likely spawned by a calculator I bought fourteen years ago This gadget could store a short sequence of operations, including a conditional jump to the start, which made it possible to program surprisingly intricate computations I soon realized that this simple mechanism had the power to replace the tedious repeated calculations I so detested with an intellectual exercise: to find a general method to solve a specific problem (something I would later learn to refer to as an algorithm) that could be expressed by pressing a sequence of calculator keys My fascination for this process still remains With more powerful computers, programming is easier, and more challenging problems are needed to keep the process interesting Ultimately, in algorithm theory, the bothers of producing an actual program are completely skipped over Instead, the final product is an explanation of how an idealized machine could be programmed to solve a problem efficiently In this abstract world, program elements are represented as mathematical objects that interact as if they were physical They can be chained together, piled on top of each other, or linked together to any level of complexity Without these data structures, which can be combined into specialized tools for solving the problem at hand, producing large or complicated programs would be infeasible However, they not exist any further than in the programmer’s mind; when the program is to be written, everything must again be translated into more basic operations In my research, I have Foreword tried to maintain this connection, seeing algorithm theory not merely as mathematics, but ultimately as a programming tool At a low level, computers represent everything as sequences of numbers, albeit with different interpretations depending on the context The main topic in this thesis is algorithms and data structures – most often tree shaped structures – for finding patterns and repetitions in long sequences, strings, of similar items Examples of typical strings are texts (strings of letters and punctuation marks), programs (strings of operations), and genetic data (strings of amino acids) Even two-dimensional data, such as pictures, are represented as strings at a lower level One area particularly explored in the thesis is storing strings compactly, compressing them, by recording repetition and systematically introducing abbreviations for repeating patterns The result is a collection of methods for organizing, searching, and compressing data Its creation has deepened my insights in computer science enormously, and I hope some of it can make a lasting contribution to the computing world as well Numerous people have influenced this work Obviously, my coauthors for different parts of the thesis, Arne Andersson, Alistair Moffat, Kunihiko Sadakane, and Kurt Swanson, have had a direct part in its creation, but many others have contributed in a variety of ways Without attempting to name them all, I would like to express my gratitude to all the central and peripheral members of the global research community who have supported and assisted me The influence of my advisor Arne Andersson goes beyond the work where he stands as an author He brought me into the research community from his special angle, and imprinted me with his views and visions His notions of what is relevant research, and how it should be presented, have guided me through these last five years Finally, I wish to specifically thank Alistair Moffat for inviting me to Melbourne and collaborating with me for three months, during which time I was accepted as a full member of his dynamic research group This gave me a new perspective, and a significant push towards completing the thesis Malmö, August 1999 Jesper Larsson Chapter One Fundamentals T • • • • he main theme of this work is the organization of sequential data to find and exploit patterns and regularities This chapter defines basic concepts, formulates fundamental observations and theorems, and presents an efficient suffix tree representation Following chapters frequently refer and relate to the material given in this chapter The material and much of the text in this current work is taken primarily from the following five previously presented writings: Extended Application of Suffix Trees to Data Compression, presented at the IEEE Data Compression Conference 1996 [42] A revised and updated version of this material is laid out in chapters two and five, and to some extent in §1.3 Suffix Trees on Words, written in collaboration with Arne Andersson and Kurt Swanson, published in Algorithmica, March 1998 [4] A preliminary version was presented at the seventh Annual Symposium on Combinatorial Pattern Matching in June 1996 This is presented in chapter three, with some of the preliminaries given in §1.2 The Context Trees of Block Sorting Compression, presented at the IEEE Data Compression Conference 1998 [43] This is the basis of chapter six Offline Dictionary-Based Compression, written with Alistair Moffat of the University of Melbourne, presented at the IEEE Data Compression Conference 1999 [44] An extended version of this work is presented in chapter seven Chapter One • Faster Suffix Sorting, written with Kunihiko Sadakane of the University of Tokyo; technical report, submitted [45] This work is reported in chapter four Some of its material has been presented in a preliminary version as A Fast Algorithm for Making Suffix Arrays and for BurrowsWheeler Transformation by Kunihiko Sadakane [59] 1.1 Basic Definitions We assume that the reader is familiar with basic conventional definitions regarding strings and graphs, and not attempt to completely define all the concepts used However, to resolve differences in the literature concerning the use of some concepts, we state the definitions of not only our specialized concepts, but also of some more general ones For quick reference to our specialized notations, appendix C on pages 125–126 summarizes terms and symbols used in each of the chapters of the thesis Notational Convention For notation regarding asymptotic growth of functions and similar concepts, we adopt the general tradition in computer science; see, for instance, Cormen, Leiserson, and Rivest [20] All logarithms in the thesis are assumed to be base two, except where otherwise stated 1.1.1 Symbols and Strings The input of each of the algorithms described in this thesis is a sequence of items which we refer to as symbols The interpretation of these symbols as letters, program instructions, amino acids, etc., is generally beyond our scope We treat a symbol as an abstract element that can represent any kind of unit in the actual implementation – although we provide several examples of practical uses, and often aim our efforts at a particular area of application Two basic sets of operations for symbols are common Either the symbols are considered atomic – indivisible units subject to only a few predefined operations, of which pairwise comparison is a common example – or they are assumed to be represented by integers, and thereby possible to manipulate with all the common arithmetic operations We adopt predominantly the latter approach, since our primary goal is to develop practically useful tools, and in present computers everything is always, at the lowest level, represented as integers Thus, restricting allowed operations beyond the set of arithmetic ones often introduces an unrealistic impediment We denote the size of the input alphabet, the set of possible values 10 Appendix A /* Function advancefront: Moves front, the right endpoint of the window, forward by positions positions, increasing its size.*/ void advancefront(int positions) { int s, u, v; /* nodes.*/ int j; SYMB b, c; while (positions ) { v=0; c=x[front]; while (1) { CANONIZE(r, a, ins, proj); if (r0) { /* if ins has a child for c.*/ a=c; /* a is first symbol in (ins, r) label.*/ break; /* endpoint found.*/ } else u=ins; /* will add child below u.*/ } else { /* active point on edge.*/ j=(r>=mmax ? MM(r-mmax+nodes[ins].depth) : nodes[r].pos); b=x[MM(j+proj)]; /* next symbol in (ins, r) label.*/ if (c==b) /* if same as front symbol.*/ break; /* endpoint found.*/ else { /* edge must be split.*/ u=freelist; /* u is new node.*/ freelist=next[u]; nodes[u].depth=nodes[ins].depth+proj; nodes[u].pos=M0(front-proj); nodes[u].child=0; nodes[u].suf=SIGN; /* emulate update (skipped below).*/ DELETEEDGE(ins, r, a); CREATEEDGE(ins, u, a); CREATEEDGE(u, r, b); if (r(t=pn-pd-1)) s=t; for (pl=pb, pm=pn-s; s; s, ++pl, ++pm) SWAP(pl, pm); s=pb-pa; t=pd-pc; if (s>0) sort_split(p, s); update_group(p+s, p+n-t-1); if (t>0) sort_split(p+n-t, t); } 121 Appendix B /* Function bucketsort: Bucketsort for first iteration Input: x[0 n-1] holds integers in the range k-1, all of which appear at least once x[n] is (This is the corresponding output of transform.) k must be at most n+1 p is array of size n+1 whose contents are disregarded Output: x is V and p is I after the initial sorting stage of the refined suffix sorting algorithm.*/ static void bucketsort(int *x, int *p, int n, int k) { int *pi, i, c, d, g; for (pi=p; pi=0) { /* if more than one element in group.*/ p[i ]=c; /* p is permutation for the sorted x.*/ { d=x[c=d]; /* next in linked list.*/ x[c]=g; /* group number in x.*/ p[i ]=c; /* permutation in p.*/ } while (d>=0); } else p[i ]=-1; /* one element, sorted group.*/ } } /* Function transform: Transforms the alphabet of x by attempting to aggregate several symbols into one, while preserving the suffix order of x The alphabet may also be compacted, so that x on output comprises all integers of the new alphabet with no skipped numbers Input: x is an array of size n+1 whose first n elements are positive integers in the range l k-1 p is array of size n+1, used for temporary storage q controls aggregation and compaction by defining the maximum value for any symbol during transformation: q must be at least k-l; if qn, compaction is never done; if q is INT_MAX, the maximum number of symbols are aggregated into one Output: Returns an integer j in the range q representing the size of the new alphabet If j>=1) ++s; /* s is number of bits in old symbol.*/ e=INT_MAX>>s; /* e is for overflow checking.*/ for (b=d=r=0; r