Tài liệu Thuật toán Algorithms (Phần 29) pdf

THÔNG TIN TÀI LIỆU

Thông tin cơ bản

Định dạng
Số trang	10
Dung lượng	74,83 KB

Nội dung

PARSING 273 procedure expression; begin term ; if plj]=‘+’ then begin j:=j+ 1; expression end end ; An array p contains the regular expre:;sion being parsed, with an index j pointing to the character currently begin examined. To parse a given regular expression, we put it in p[l M], (with a sentinel character in p[M+l] which is not used in the grammar) set j to 1, and call expression. If this results in j being set to M+1, then the regular ex 3ression is in the language described by the grammar. Otherwise, we’ll see below how various error conditions are handled. The first thing that expression does is call term, which has a slightly more complicated implementation: procedure term ; begin fact x-; if (1: b]=‘( ‘) or letter(ptj]) then term; end A direct translation from the grammar would simply have term call factor and then term. This obviously won’t work because it leaves no way to exit from term: this program would go into an infinite recursive loop if called. (Such loops have particularly unpleasant effects in many systems.) The implementation above gets around this by first checking the input to decide whether term should be called. l’he first thing that term does is call factor, which is the only one of the proc:dures that could detect a mismatch in the input. From the grammar, we know that when factor is called, the current input character must be either :L “(” or an input letter (represented by u). This process of checking the nez- t character (without incrementing j to decide what to do is called lookahead. For some grammars, this is not necessary; for others even more lookahead is required. Now, the implementation of factor fallows directly from the grammar. If the input character being scanned is not a “(” or an input letter, a procedure error is called to handle the error condit on: 274 CHAPTER 21 procedure factor; begin if pb]=‘(‘then begin j:=j+l; expression ; if p b] = ‘) ’ then j : =j+ 1 else error end else if letter(plj]) then j:=j+l else error; if pb]=‘*‘then j:=j+l; end ; Another error condition occurs when a “)” is missing. These procedures are obviously recursive; in fact they are so intertwined that they can’t be compiled in Pascal without using the forward construct to get around the rule that a procedure can’t be used without first being declared. The parse tree for a given string gives the recursive cal! structure during parsing. The reader may wish to refer to the tree above and trace through the operation of the above three procedures when p contains (A*B+AC)D and expression is called with j=1. This makes the origin of the “top-down” name obvious. Such parsers are also often called recursive descent parsers because they move down the parse tree recursively. The top-down approach won’t work for all possible context-free grammars. For example, if we had the production (expression) ::= v 1 (expression) + (term) then we would have procedure badexpression ; begin if letter(pb]) then j:=j+l else begin badexpression ; if p b] < > ‘+ ’ then error else begin j:=j+l; term end end end ; If this procedure were called with plj] a nonletter (as in our example, for j=l) then it would go into an infinite recursive loop. Avoiding such loops is a principal difficulty in the implementation of recursive descent parsers. For PARSING 275 term, we used lookahead to avoid such a loop; in this case the proper way to get around the problem is to switch the grammar to say (term)+(expression). The occurrence of a nonterminal as the first thing on the right hand side of a replacement rule for itself is called left recursion. Actually, the problem is more subtle, because the left recursion can arise indirectly: for example if we were to have the productions (expression) ::= (term) and (term) ::= v 1 (expression) + (term). Recursive descent parsers won’t work for such grammars: they have to be transformed to equivalent grammars without left recursion, or some other parsing method has to be used. In general, there is an intimate and very widely studied connection between parsers and the grammars they recognize. The choice of a parsing technique is often dictated by the characteristics of the grammar to be parsed. Bottom- Up Parsing Though there are several recursive calls in the programs above, it is an instructive exercise to remove the recursion systematically. Recall from Chapter 9 (where we removed the recursion from Quicksort) that each procedure call can be replaced by a stack push and each procedure return by a stack pop, mimicking what the Pascal system does to implement recursion. A reason for doing this is that many of the calls which seem recursive are not truly recursive. When a procedure call is the last action of a procedure, then a simple goto can be used. This turns expression and term into simple loops, which can be incorporated together and combined with factor to produce a single procedure with one true recursive call (the call to expression within factor). This view leads directly to a quite simple way to check whether regular expressions are legal. Once all the procedure calls are removed, we see that each terminal symbol is simply scanned as it is encountered. The only real processing done is to check whether there is a right parenthesis to match each left parenthesis and whether each ‘I+” is followed by either a letter or a “(I’. That is, checking whether a regular expression is legal is essentially equivalent to checking for balanced parentheses. This can be simply implemented by keeping a counter, initialized to 0, which is incremented when a left parenthesis is encountered, decremented when a right parenthesis is encountered. If the counter is zero when the end of the expression is reached, and each ‘I+” of the expression is followed by either a letter or a “(“, then the expression was legal. Of course, there is more to parsing than simply checking whether the input string is legal: the main goal is to build the parse tree (even if in an implicit way, as in the top-down parser) for other processing. It turns out to be possible to do this with programs with the same essential structure as the parenthesis checker described in the previous paragraph. One type of parser 276 CHAPTER 21 which works in this way is the ‘so-called shift-reduce parser. The idea is to maintain a pushdown stack which holds terminal and nonterminal symbols. Each step in the parse is either a shift step, in which the next input character is simply pushed onto the stack, or a reduce step, in which the top characters on the stack are matched to the right-hand side of some production in the grammar and “reduced to” (replaced by) the nonterminal on the left side of that production. Eventually all the input characters get shifted onto the stack, and eventually the stack gets reduced to a single nonterminal symbol. The main difficulty in building a shift-reduce parser is deciding when to shift and when to reduce. This can be a complicated decision, depending on the grammar. Various types of shift-reduce parsers have been studied in great detail, an extensive literature has been developed on them, and they are quite often preferred over recursive descent parsers because they tend to be slightly more efficient and significantly more flexible. Certainly we don’t have space here to do justice to this field, and we’ll forgo even the details of an implementation for our example. Compilers A compiler may be thought of as a program which translates from one language to another. For example, a Pascal compiler translates programs from the Pascal language into the machine language of some particular computer. We’ll illustrate one way that this might be done by continuing with our regular-expression pattern-matching example, where we wish to translate from the language of regular expressions to a “language” for pattern-matching machines, the ch, nextl, and next2 arrays of the match program of the previous chapter. Essentially, the translation process is “one-to-one”: for each character in the pattern (with the exception of parentheses) we want to produce a state for the pattern-matching machine (an entry in each of the arrays). The trick is to keep track of the information necessary to fill in the next1 and next2 arrays. To do so, we’ll convert each of the procedures in our recursive descent parser into functions which create pattern-matching machines. Each function will add new states as necessary onto the end of the ch, nextl, and next2 arrays, and return the index of the initial state of the machine created (the final state will always be the last entry in the arrays). For example, the function given below for the (expression) production creates the “or” states for the pattern matching machine. PARSING 277 function expression : integer; var tl, t2: integer; begin tl : = term ; expression : = tl ; if plj]=‘+’ then begin j:=j+l; state:=state+I; t2:=state; expression:=t2; state:=state+l; setstate(t2, ’ ‘, expression, tl ) ; setstate(t2-I, ’ ‘, state, state); end ; end ; This function uses a procedure setstate which simply sets the ch, nextl, and next2 array entries indexed by the first argument to the values given in the second, third, and fourth arguments, respectively. The index state keeps track of the “current” state in the machine being built. Each time a new state is created, state is simply incremented. Thus, the state indices for the machine corresponding to a particular procedure call range between the value of state on entry and the value of state on exit. The final state index is the value of state on exit. (We don’t actually “create” the final state by incrementing state before exiting, since this makes it easy to “merge” the final state with later initial states, as we’ll see below.) With this convention, it is easy to check (beware of the recursive call!) that the above program implements the rule for composing two machines with the “or” operation as diagramed in the previous chapter. First the machine for the first part of the expression is built (recursively), then two new null states are added and the second part of the expression built. The first null state (with index t2 1) is the final state of the machine of the first part of the expression which is made into a “no-op” state to skip to the final state for the machine for the second part of the expression, as required. The second null state (with index t2) is the initial state, so its index is the return value for expression and its next1 and next2 entries are made to point to the initial states of the two expressions. Note carefully that these are constructed in the opposite order than one might expect, because the value of state for the no-op state is not known until the recursive call to expression has been made. The function for (term) first builds the machine for a (factor) then, if necessary, merges the final state of that machine with the initial state of the machine for another (term). This is easier done than said, since state is the final state index of the call to factor. A call to term without incrementing state does the trick: 278 CHAPTER 21 function term ; var t: integer; begin term :=factor; if (pb]=‘(‘) or letter(p[j]) then t:=term end ; (We have no use for the initial state index returned by the second call to term, but Pascal requires us to put it, somewhere, so we throw it away in a temporary variable t.) The function for (factor) uses similar techniques to handle its three cases: a parenthesis calls for a recursive call on expression; a v calls for simple concatenation of a new state; and a * calls for operations similar to those in expression, according to the closure diagram from the previous section: function factor; var tl, t2: integer; begin tl :=state; if plj]=‘(‘then begin j:=j+l; t2:=expression; if p b] = ‘) ’ then j := j+ 1 else error end else if letter(pb]) then begin setstate(state,plj], state+l, 0); t2:=state; j:=j+l; state:=state+I end else error; if p[j]<>‘*‘then factor:=t2 else begin setstate(state, ’ ‘, state+l, t2); factor:=state; next1 [tl-I] :=state; j:=j+l; state:=state+l; end ; end ; The reader may find it instructive to trace through the construction of the machine for the pattern (A*B+AC)D given in the previous chapter. PARSING 279 The final step in the development >f a general regular expression pattern matching algorithm is to put these procedures together with the match procedure, as follows: j:==l; state:=l; ne Ytl [0] :=expression; setstate(state, ’ ‘, 0,O); foI i:=l to N-l do if match(i)>=i then writeln(i); This program will print out all character positions in a text string a[l . N] where a pattern p[l .M] leads to a match. Compiler-Compilers The program for general regular expresr:ion pattern matching that we have developed in this and the previous chapter is efficient and quite useful. A version of this program with a few added capabilities (for handling “don’t- care” characters and other amenities) is likely to be among the most heavily used utilities on many computer systems. It is interesting (some might say confusing) to reflect on this algorithm from a more philosophical point of view. In this chapter, we have considered parsers for unraveling the structure of regular expressions, based on a formal description of regular expressions using a context-free grammar. Put another way, we used the context-free gramma]’ to specify a particular “pattern”: sequences of characters with legally balz.nced parentheses. The parser then checks to see if the pattern occurs in the input (but only considers a match legal if it covers the entire input string). Thus parsers, which check that an input string is in the set of strings defined by some context-free grammar, and pattern matchers, which check that an input string is in the set of strings defined by some regular expression, are essentially performing the same function! The principal difference is that context-free grammars are capable of describing a much wider class of strings. For example, the set of all regular expressions can’t be described with regular expressions. Another difference in the way we’ve implemented the programs is that the context-free grammar is “built in” to the parser, while the match procedure is “table-driven”: the same program wol,ks for all regular expressions, once they have been translated into the propel. format. It turns out to be possible to build parsers which are table-driven In the same way, so that the same program can be used to parse all language 3 which can be described by context- free grammars. A parser generator is a program which takes a grammar as input and produces a parser for the language described by that grammar as 280 CHAPTER 21 output. This can be carried one step further: it is possible to build compilers which are table-driven in terms of both the input and the output languages. A compiler-compiler is a program which takes two grammars (and some formal specification of the relationships between them) as input and produces a compiler which translates strings from one language to the other as output. Parser generators and compiler-compilers are available for general use in many computing environments, and are quite useful tools which can be used to produce efficient and reliable parsers and compilers with a relatively small amount of effort. On the other hand, top-down recursive descent parsers of the type considered here are quite serviceable for simple grammars which arise in many applications. Thus, as with many of the algorithms we have considered, we have a straightforward method which can be used for applications where a great deal of implementation effort might not be justified, and several ad- vanced methods which can lead to significant performance improvements for large-scale applications. Of course, in this case, this is significantly understat- ing the point: we’ve only scratched the surface of this extensively researched PARSING 281 Exercises 1. 2. 3. 4. 5. 6. 7. 8. 9. 10. How does the recursive descent parser find an error in a regular expression such as (A+B)*BC+ which is incomplete? Give the parse tree for the regular expression ((A+B)+(Ct-D)*)*. Extend the arithmetic expression grammar to include exponentiation, div and mod. Give a context-free grammar to d’:scribe all strings with no more than two consecutive 1’s. How many procedure calls are used by the recursive descent parser to recognize a regular expression in terms of the number of concatenation, or, and closure operations and the number of parentheses? Give the ch, next1 and next2 arrays that result from building the pattern matching machine for the pattern ((A+B)+(C+D)*)*. Modify the regular expression grammar to handle the “not” function and “don’t-care” characters. Build a general regular expression pattern matcher based on the improved grammar in your answer to the previous question. Remove the recursion from the recursive descent compiler, and simplify the resulting code as much as possible. Compare the running time of the nonrecursive and recursive methods. Write a compiler for simple arithmetic expressions described by the grammar in the text. It should produce a list of ‘*instructions” for a machine capable of three operations: Pugh the value of a variable onto a stack; add the top two values on the stick, removing them from the stack, then putting the result there; and mt.ltiply the top two values on the stack, in the same way. . simple grammars which arise in many applications. Thus, as with many of the algorithms we have considered, we have a straightforward method which can be

Ngày đăng: 26/01/2014, 14:20

Xem thêm