Tài liệu Thuật toán Algorithms (Phần 28) pdf

10 296 0
Tài liệu Thuật toán Algorithms (Phần 28) pdf

Đang tải... (xem toàn văn)

Thông tin tài liệu

PATTERN MATCHING is the number of the actual initial state. (Note the special representation used for null states with 0 or 1 exits.) Since we often will want to access states just by number, the most suitable organization for the machine is to use the array representation. We’ll use the three arrays ch: amty [O Mmax] of char; nextl, next2: array [O Mmax] of integer; Here Mmax is the maximum number of’ states (twice the maximum pattern length). It would be possible to get by with two-thirds this amount of space, since each state really uses only two rreaningful pieces of information, but we’ll forsake this improvement for the sake of clarity and also because pattern descriptions are not likely to be particularly long. We’ve seen how to build up mach.nes from regular expression pattern descriptions and how such machines might be represented as arrays. However, to write a program to do the translation from a regular expression to the corresponding nondeterministic machine representation automatically is quite another matter. In fact, even writing a program to determine if a given regular expression is legal is challenging for the uninitiated. In the next chapter, we’ll study this operation, called parsing, in much more detail. For the moment, we’ll assume that this translation has been done, so that we have available the ch, nextl, and next2 arrays representing a particular nondeterministic machine which corresponds to the regular expression pattern description of interest. Simulating the Machine The last step in the development of a. general regular-expression pattern- matching algorithm is to write a program which somehow simulates the opera- tion of a nondeterministic pattern-matching machine. The idea of writing a program which can “guess” the right answer seems ridiculous. However, in this case it turns out that we can keep track of all possible matches in a systematic way, so that we do eventually encounter the correct one. One possibility would be to develop a recursive program which mimics the nondeterministic machine (but tries all possibilities rather than guessing the right one). Instead of using this approach, we’ll look at a nonrecursive implementation which exposes the basic operating principles of the method by keeping the states under consideration in a rather peculiar data structure called a deque, described in some detail below. The idea is to keep track of all states that could possibly be encountered while the machine is “looking at” the c:lrrent input character. Each of these 264 CHAPTER 20 states are processed in turn: null states lead to two (or fewer) states, states for characters which do not match the current input are eliminated, and states for characters which do match the current input lead to new states for use when the machine is looking at the next input character. Thus, we maintain a list of all the states that the nondeterministic machine could possibly be in at a particular point in the text: the problem is to design an appropriate data structure for this list. Processing null states seems to require a stack, since we are essentially postponing one of two things to be done, just as when we removed the recursion from Quicksort (so the new state should be put at the beginning of the current list, lest it get postponed indefinitely). Processing the other states seems to require a queue, since we don’t want to examine states for the next input character until we’ve finished with the current character (so the new state should be put at the end of the current list). Rather than choosing between these two data structures, we’ll use both! Deques (“double-ended queues”) combine the features of stacks and queues: a deque is a list to which items can be added at either end. (Actually, we use an “output-restricted deque,” since we always remove items from the beginning, not the end: that would be “dealing from the bottom of the deck.“) A crucial property of the machine is that there are no “loops” consisting of just null states: otherwise it could decide nondeterministically to loop forever. It turns out that this implies that the number of states on the deque at any time is less than the number of characters in the pattern description. The program given below uses a deque to simulate the actions of a non- deterministic pattern-matching machine as described above. While examin- ing a particular character in the input, the nondeterministic machine can be in any one of several possible states: the program keeps track of these in a deque dq. One pointer (head) to the head of the deque is maintained so that items can be inserted or removed at the beginning, and another pointer (tail) to the tail of the deque is maintained so that items can be inserted at the end. If the pattern description has M characters the deque can be implemented in a “circular” manner in an array of M integers. The con- tents of the deque are the elements “between” head and tail (inclusive): if head<=tail, the meaning is obvious; if head>tail we take the elements that would fall between head and tail if the elements of dq were arranged in a circle: dq[head], dq[head+l],. . .,dq[M-l],dq[O], dq[l], . . .,dq[tail]. This is quite simply implemented by using head:= head+1 mod M to increment head and similarly for tail. Similarly, head:= head+M-1 mod M refers to the ele- ment before head in the rrray: this is the position at which an element should be added to the beginning of the deque. The main loop of the program removes a state from the deque (by PATTERN MATCHING 265 incrementing head mod M and then referring to dq[head]) and performs the action required. If a character is to be matched, the input is checked for the required character: if it is found, the sate transition is effected by putting the new state at the end of the deque (so that all states involving the current character are processed before those involving the next one). If the state is null, the two possible states to be simulated are put at the beginning of the deque. The states involving the curren, input character are kept separated from those involving the next by a marker scan=-1 in the deque: when scan is encountered, the pointer into th,: input string is advanced. The loop terminates when the end of the input is reached (no match found), state 0 is reached (legal match found), or only one item, the scan marker is left on the deque (no match found). This leads directly to the following implementation: function match(j: intege ): integer; const scan=- 1; var head, tail, nl, n2: integer; dq: array [O Mmax] of integer; procedure addhead(x: integer); begin dq[head] := x; head:=(head+M-1) mod A4 end; procedure addtail(x: integer); begin tail:=(tail+l) mod M; dq[tail]:=x end; begin head:=l; taiJ:=O; addtail(next1 [O]); addtail(scan); match:=j-1; repeat if dq [head] =scan thfsn begin j:=j+l; addtail(scan) end else if ch [dq[head]]==alj] then addtail(next1 [dq[head]]) else if ch[dq[head]]==’ ‘then begin nl :=nextl [dq[her!d]] ; n2:=next2[dq[head]]; addhead(n1); if r’l<>n2 then addhead(n2) end ; head:=(head+l) mod M until (j>N) or (dq[head]=O) or (head=tail); if dq[head]=O then match:=j-1; end ; This function takes as its argument the -1osition j in the text string a at which 266 GIAF'TER20 it should start trying to match. It returns the index of the last character in the match found (if any, otherwise it returns j-1). The following table shows the contents of the deque each time a state is removed when our sample machine is run with the text string AABD. (For clarity, the details involving head, tail, and the maintenance of the circular deque are suppressed in this table: each line shows those elements in the deque between the head and tail pointers.) The characters appear in the lefthand column in the table at the point when the program has finished scanning them. 5 scan 2 6 1 3 3 6 6 scan A scan 2 2 7 1 3 3 7 7 scan A scan 2 2 scan 1 3 3 scan B scan 4 4 scan 8 scan D scan 9 9 scan 0 scan scan 6 scan scan 2 2 7 scan 7 scan scan 2 2 scan Thus, we start with State 5 while scanning the first character. First State 5 leads to States 2 and 6, then State 2 leads to States 1 and 3, all of which need to scan the same character and are on the beginning of the deque. Then State 1 leads to State 2, but at the end of the deque (for the next input character). State 3 only leads to another state while scanning a B, so it is ignored while an A is being scanned. When the “scan” sentinel finally reaches the front of the deque, we see that the machine could be either in State 2 or State 7 after scanning an A. Continuing, the program eventually ends up the final state, after considering all transitions consistent with the text string. PATTERN MATCHING The running time of this program obviously depends very heavily on the pattern being matched. However, for each of the N input characters, it processes at most M states of the mac:nne, so the worst case running time is proportional to MN. For sure, not all nondeterministic machines can be simulated so efficiently, as discussed in more detail in Chapter 40, but the use of a simple hypothetical pattern-matching machine in this application leads to a quite reasonable algorithm for a quite difficult problem. However, to complete the algorithm, we need a program which translates arbitrary regular expressions into “machines” for interpretation by the above code. In the next chapter, we’ll look at the implementation of such a program in the context of a more general discussion of compilers a,nd parsing techniques. r-l 268 Exercises 1. Give a regular expression for recognizing all occurrences of four or fewer consecutive l’s in a binary string. 2. Draw the nondeterministic pattern matching machine for the pattern description (A+B)* +C. 3. Give the state transitions your machine from the previous exercise would make to recognize ABBAC. 4. Explain how you would modify the nondeterministic machine to handle the “not” function. 5. Explain how you would modify the nondeterministic machine to handle “don’t-care” characters. 6. What would happen if match were to try to simulate the following ma- chine? 7. Modify match to handle regular expressions with the “not” function and “don’t-care” characters. 8. Show how to construct a pattern description of length M and a text string of length N for which the running time of match is as large as possible. 9. Why must the deque in match have only one “scan” sentinel in it? 10. Show the contents of the deque each time a state is removed when match is used to simulate the example machine in the text with the text string ACD. 21. Parsing Several fundamental algorithms have been developed to recognize legal computer programs and to decomI:ose their structure into a form suitable for further processing. This operation, called parsing, has application beyond computer science, since it is directly related to the study of the structure of language in general. For example, parsing plays an important role in sys- tems which try to “understand” natural (human) languages and in systems for translating from one language to another. One particular case of inter- est is translating from a “high-level” co.nputer language like Pascal (suitable for human use) to a “low-level” assembly or machine language (suitable for machine execution). A program for doing such a translation is called a com- piler. Two general approaches are used for parsing. Top-down methods look for a legal program by first looking for parts of a legal program, then looking for parts of parts, etc. until the pieces are small enough to match the input directly. Bottom-up methods put pieces of the input together in a structured way making bigger and bigger pieces until a legal program is constructed. In general, top-down methods are recursive, bottom-up methods are iterative; top-down methods are thought to be easier to implement, bottom-up methods are thought to be more efficient. A full treatment of the issues involved in parser and compiler construction would clearly be beyond the scope of thi>, book. However, by building a simple “compiler” to complete the pattern-mats:hing algorithm of the previous chap- ter, we will be able to consider some of’ the fundamental concepts involved. First we’ll construct a top-down parser for a simple language for describing regular expressions. Then we’ll modify the parser to make a program which translates regular expressions into pattern-matching machines for use by the match procedure of the previous chapter. Our intent in this chapter is to give some feeling for the basic principles 269 270 CHAPTER 21 of parsing and compiling while at the same time developing a useful pattern matching algorithm. Certainly we cannot treat the issues involved at the level of depth that they deserve. The reader should be warned that subtle difficulties are likely to arise in applying the same approach to similar prob- lems, and advised that compiler construction is a quite well-developed field with a variety of advanced methods available for serious applications. Context-Free Grammars Before we can write a program to determine whether a program written in a given language is legal, we need a description of exactly what constitutes a legal program. This description is called a grammar: to appreciate the ter- minology, think of the language as English and read “sentence” for “program” in the previous sentence (except for the first occurrence!). Programming lan- guages are often described by a particular type of grammar called a context- free grammar. For example, the context-free grammar which defines the set of all legal regular expressions (as described in the previous chapter) is given below. (expression) : : = (term) 1 (term) + (expression) (term) ::= (factor) 1 (factor)(term) (factor) ::= ((expression)) ( 21 1 (factor)* This grammar describes regular expressions like those that we used in the last chapter, such as (l+Ol)*(O+l) or (A*B+AC)D. Each line in the grammar is called a production or replacement rule. The productions consist of terminal symbols (, ), + and * which are the symbols used in the language being described (‘91,” a special symbol, stands for any letter or digit); nonterminal symbols (expression), (term), and (factor) which are internal to the grammar; and metasymbols I:= and ( which are used to describe the meaning of the productions. The ::= symbol, which may be read 2s a,” defines the left-hand side of the production in terms of the right-hand side; and the 1 symbol, which may be read as “or” indicates alternative choices. The various productions, though expressed in this concise symbolic notation, correspond in a simple way to an intuitive description of the grammar. For example, the second production in the example grammar might be read “a (term) is a (factor) or a (factor) followed by a (term).” One nonterminal symbol, in this case (expreswon), is distinguished in the sense that a string of terminal symbols is in the language described by the grammar if and only if there is some way to use the productions to derive that string from the distinguished nonterminal by replacing (in any number of steps) a nonterminal symbol by any of the “or” clauses on the right-hand side of a production for that nonterminal symbol. PARSING 271 One natural way to describe the result of this derivation process is called a purse tree: a diagram of the complete grammatical structure of the string being parsed. For example, the following parse tree shows that the string (A*B+AC)D is in the language described by the above grammar. The circled internal nodes labeled E, F, a.nd T represent (expression), (factor), and (term), respectively. Parse trees like this are sometimes used for English, to break down a “sentence” into “subject,” “verb,” “object,” etc. The main function of a parser is to accept strings which can be so derived and reject those that cannot, by attempting to construct a parse tree for any given string. That is, the parser can recognize whether a string is in the language described by the grammar by determining whether or not there exists a parse tree for the string. Top-down parsers do so by building the tree starting with the distinguished nonterminal at the top, working down towards the string to be recognized at the bottom; bottom-up parsers do this by starting with the string at the bottom, working backwards up towards the distinguished nonterminal at the top. As we’ll see, if the strings being reo>gnized also have meanings implying further processing, then the parser can convert them into an internal repre- sentation which can facilitate such processing. Another example of a context-free grammar may be found in the appen- dix of the Pascal User Manual and Report: it describes legal Pascal programs. The principles considered in this section for recognizing and using legal ex- pressions apply directly to the complex job of compiling and executing Pascal 272 CHAPTER 21 programs. For example, the following grammar describes a very small subset of Pascal, arithmetic expressions involving addition and multiplication. (expression) ::= (term) 1 (term) + (expression) (term) ::= (factor) 1 (factor)* (term) (factor) ::= ((expression)) ) 21 Again, w is a special symbol which stands for any letter, but in this grammar the letters are likely to represent variables with numeric values. Examples of legal strings for this grammar are A+(B*C) and (A+B*C)*D*(A+(B+C)). As we have defined things, some strings are perfectly legal both as arith- metic expressions and as regular expressions. For example, A*(B+C) might mean “add B to C and multiply the result by A” or “take any number of A’s followed by either B or C.” This points out the obvious fact that checking whether a string is legally formed is one thing, but understanding what it means is quite another. We’ll return to this issue after we’ve seen how to parse a string to check whether or not it is described by some grammar. Each regular expression is itself an example of a context-free grammar: any language which can be described by a regular expression can also be described by a context-free grammar. The converse is not true: for example, the concept of “balancing” parentheses can’t be captured with regular ex- pressions. Other types of grammars can describe languages which can’t be described by context-free grammars. For example, context-sensitive grammars are the same as those above except that the left-hand sides of productions need not be single nonterminals. The differences between classes of languages and a hierarchy of grammars for describing them have been very carefully worked out and form a beautiful theory which lies at the heart of computer science. Top-Down Parsing One parsing method uses recursion to recognize strings from the language described exactly as specified by the grammar. Put simply, the grammar is such a complete specification of the language that it can be turned directly into a program! Each production corresponds to a procedure with the name of the non- terminal on the left-hand side. Nonterminals on the right-hand side of the input correspond to (possibly recursive) procedure calls; terminals correspond to scanning the input string. For example, the following procedure is part of a top-down parser for our regular expression grammar: . machine in the text with the text string ACD. 21. Parsing Several fundamental algorithms have been developed to recognize legal computer programs and to decomI:ose

Ngày đăng: 26/01/2014, 14:20

Từ khóa liên quan

Tài liệu cùng người dùng

Tài liệu liên quan