PARSING 273
procedure expression;
begin
term ;
if
plj]=‘+’
then
begin j:=j+ 1; expression end
end
;
An array p contains the regular expre:;sion being parsed, with an index j
pointing to the character currently begin examined. To parse a given regular
expression, we put it in p[l M], (with a sentinel character in
p[M+l]
which
is not used in the grammar) set j to 1, and call expression. If this results in
j being set to M+1, then the regular ex
3ression
is in the language described
by the grammar. Otherwise, we’ll see below how various error conditions are
handled.
The first thing that expression does is call term, which has a slightly more
complicated implementation:
procedure term ;
begin
fact
x-;
if
(1:
b]=‘(
‘)
or letter(ptj]) then term;
end
A direct translation from the grammar would simply have term call factor
and then term. This obviously won’t work because it leaves no way to
exit from term: this program would
go
into an infinite recursive loop if
called. (Such loops have particularly unpleasant effects in many systems.)
The implementation above gets around this by first checking the input to
decide whether term should be called. l’he first thing that term does is call
factor, which is the only one of the proc:dures that could detect a mismatch
in the input. From the grammar, we know that when factor is called, the
current input character must be either
:L
“(”
or an input letter (represented
by
u).
This process of checking the
nez-
t character (without incrementing j
to decide what to do is called lookahead. For some grammars, this is not
necessary; for others even more lookahead is required.
Now, the implementation of factor fallows directly from the grammar. If
the input character being scanned is not a
“(”
or an input letter, a procedure
error is called to handle the error condit on:
274
CHAPTER 21
procedure factor;
begin
if pb]=‘(‘then
begin
j:=j+l;
expression
;
if p
b]
=
‘)
’
then j : =j+
1
else error
end
else if letter(plj]) then
j:=j+l
else error;
if pb]=‘*‘then
j:=j+l;
end
;
Another error condition occurs when a
“)”
is missing.
These procedures are obviously recursive; in fact they are so intertwined
that they can’t be compiled in Pascal without using the forward construct
to get around the rule that a procedure can’t be used without first being
declared.
The parse tree for a given string gives the recursive cal! structure during
parsing. The reader may wish to refer to the tree above and trace through
the operation of the above three procedures when p contains (A*B+AC)D and
expression is called with
j=1.
This makes the origin of the “top-down” name
obvious. Such parsers are also often called recursive descent parsers because
they move down the parse tree recursively.
The top-down approach won’t work for all possible context-free gram-
mars. For example, if we had the production (expression) ::=
v
1 (expression)
+ (term) then we would have
procedure
badexpression ;
begin
if letter(pb]) then
j:=j+l
else
begin
badexpression
;
if p
b]
< >
‘+
’
then error else
begin
j:=j+l;
term end
end
end ;
If this procedure were called with plj]
a nonletter (as in our example, for
j=l)
then it would go into an infinite recursive loop. Avoiding such loops is
a principal difficulty in the implementation of recursive descent parsers. For
PARSING
275
term, we used lookahead to avoid such a loop; in this case the proper way to
get around the problem is to switch the grammar to say (term)+(expression).
The occurrence of a nonterminal as the first thing on the right hand side of
a replacement rule for itself is called left recursion. Actually, the problem
is more subtle, because the left recursion can arise indirectly: for example
if we were to have the productions (expression) ::= (term) and (term) ::=
v
1 (expression) + (term). Recursive descent parsers won’t work for such
grammars: they have to be transformed to equivalent grammars without left
recursion, or some other parsing method has to be used. In general, there
is an intimate and very widely studied connection between parsers and the
grammars they recognize. The choice of a parsing technique is often dictated
by the characteristics of the grammar to be parsed.
Bottom- Up Parsing
Though there are several recursive calls in the programs above, it is an in-
structive exercise to remove the recursion systematically. Recall from Chapter
9 (where we removed the recursion from Quicksort) that each procedure call
can be replaced by a stack push and each procedure return by a stack pop,
mimicking what the Pascal system does to implement recursion. A reason
for doing this is that many of the calls which seem recursive are not truly
recursive. When a procedure call is the last action of a procedure, then a
simple
goto
can be used. This turns expression and term into simple loops,
which can be incorporated together and combined with factor to produce a
single procedure with one true recursive call (the call to expression within
factor).
This view leads directly to a quite simple way to check whether regular
expressions are legal. Once all the procedure calls are removed, we see that
each terminal symbol is simply scanned as it is encountered. The only real
processing done is to check whether there is a right parenthesis to match each
left parenthesis and whether each
‘I+”
is followed by either a letter or a
“(I’.
That is, checking whether a regular expression is legal is essentially equivalent
to checking for balanced parentheses. This can be simply implemented by
keeping a counter, initialized to 0, which is incremented when a left paren-
thesis is encountered, decremented when a right parenthesis is encountered.
If the counter is zero when the end of the expression is reached, and each
‘I+”
of the expression is followed by either a letter or a
“(“,
then the expression
was legal.
Of course, there is more to parsing than simply checking whether the
input string is legal: the main goal is to build the parse tree (even if in an
implicit way, as in the top-down parser) for other processing. It turns out to
be possible to do this with programs with the same essential structure as the
parenthesis checker described in the previous paragraph. One type of parser
276
CHAPTER 21
which works in this way is the ‘so-called shift-reduce parser. The idea is to
maintain a pushdown stack which holds terminal and nonterminal symbols.
Each step in the parse is either a shift step, in which the next input character
is simply pushed onto the stack, or a reduce step, in which the top characters
on the stack are matched to the right-hand side of some production in the
grammar and “reduced to” (replaced by) the nonterminal on the left side
of that production. Eventually all the input characters get shifted onto the
stack, and eventually the stack gets reduced to a single nonterminal symbol.
The main difficulty in building a shift-reduce parser is deciding when to
shift and when to reduce. This can be a complicated decision, depending
on the grammar. Various types of shift-reduce parsers have been studied in
great detail, an extensive literature has been developed on them, and they are
quite often preferred over recursive descent parsers because they tend to be
slightly more efficient and significantly more flexible. Certainly we don’t have
space here to do justice to this field, and we’ll forgo even the details of an
implementation for our example.
Compilers
A compiler may be thought of as a program which translates from one lan-
guage to another. For example, a Pascal compiler translates programs from
the Pascal language into the machine language of some particular computer.
We’ll illustrate one way that this might be done by continuing with our
regular-expression pattern-matching example, where we wish to translate
from the language of regular expressions to a “language” for pattern-matching
machines, the ch,
nextl,
and next2 arrays of the match program of the pre-
vious chapter.
Essentially, the translation process is “one-to-one”: for each character in
the pattern (with the exception of parentheses) we want to produce a state
for the pattern-matching machine (an entry in each of the arrays). The trick
is to keep track of the information necessary to fill in the next1 and next2
arrays. To do so, we’ll convert each of the procedures in our recursive descent
parser into functions which create pattern-matching machines. Each function
will add new states as necessary onto the end of the ch, nextl, and next2
arrays, and return the index of the initial state of the machine created (the
final state will always be the last entry in the arrays).
For example, the function given below for the (expression) production
creates the “or” states for the pattern matching machine.
PARSING
277
function expression : integer;
var
tl,
t2: integer;
begin
tl : = term
;
expression : = tl
;
if
plj]=‘+’
then
begin
j:=j+l;
state:=state+I;
t2:=state; expression:=t2;
state:=state+l;
setstate(t2,
’
‘,
expression,
tl
) ;
setstate(t2-I,
’
‘,
state, state);
end ;
end
;
This function uses a procedure setstate which simply sets the ch, nextl, and
next2 array entries indexed by the first argument to the values given in the
second, third, and fourth arguments, respectively. The index state keeps track
of the “current” state in the machine being built. Each time a new state is
created, state is simply incremented. Thus, the state indices for the machine
corresponding to a particular procedure call range between the value of state
on entry and the value of state on exit.
The final state index is the value
of state on exit. (We don’t actually “create” the final state by incrementing
state before exiting, since this makes it easy to “merge” the final state with
later initial states, as we’ll see below.)
With this convention, it is easy to check (beware of the recursive call!)
that the above program implements the rule for composing two machines with
the “or” operation as diagramed in the previous chapter. First the machine
for the first part of the expression is built (recursively), then two new null
states are added and the second part of the expression built. The first null
state (with index t2 1) is the final state of the machine of the first part of
the expression which is made into a “no-op” state to skip to the final state for
the machine for the second part of the expression, as required. The second
null state (with index t2) is the initial state, so its index is the return value
for expression and its next1 and next2 entries are made to point to the initial
states of the two expressions. Note carefully that these are constructed in the
opposite order than one might expect, because the value of state for the no-op
state is not known until the recursive call to expression has been made.
The function for (term) first builds the machine for a (factor) then, if
necessary, merges the final state of that machine with the initial state of the
machine for another (term). This is easier done than said, since state is the
final state index of the call to factor. A call to term without incrementing
state does the trick:
278
CHAPTER 21
function term ;
var
t:
integer;
begin
term :=factor;
if (pb]=‘(‘) or letter(p[j]) then
t:=term
end
;
(We have no use for the initial state index returned by the second call to
term, but Pascal requires us to put it, somewhere, so we throw it away in a
temporary variable t.)
The function for (factor) uses similar techniques to handle its three cases:
a parenthesis calls for a recursive call on expression; a
v
calls for simple
concatenation of a new state; and a
*
calls for operations similar to those in
expression, according to the closure diagram from the previous section:
function factor;
var
tl,
t2: integer;
begin
tl :=state;
if
plj]=‘(‘then
begin
j:=j+l;
t2:=expression;
if p b] =
‘)
’
then j := j+
1
else error
end
else if letter(pb]) then
begin
setstate(state,plj], state+l, 0);
t2:=state;
j:=j+l;
state:=state+I
end
else error;
if
p[j]<>‘*‘then
factor:=t2 else
begin
setstate(state,
’
‘,
state+l, t2);
factor:=state; next1
[tl-I]
:=state;
j:=j+l;
state:=state+l;
end
;
end ;
The reader may find it instructive to trace through the construction of
the machine for the pattern (A*B+AC)D given in the previous chapter.
PARSING 279
The final step in the development
>f
a general regular expression pat-
tern matching algorithm is to put these procedures together with the match
procedure, as follows:
j:==l;
state:=l;
ne Ytl [0]
:=expression;
setstate(state,
’
‘,
0,O);
foI
i:=l
to N-l do
if
match(i)>=i
then writeln(i);
This program will print out all character positions in a text string
a[l
. N]
where a pattern
p[l
.M]
leads to a match.
Compiler-Compilers
The program for general regular expresr:ion pattern matching that we have
developed in this and the previous chapter is efficient and quite useful. A
version of this program with a few added capabilities (for handling
“don’t-
care” characters and other amenities) is likely to be among the most heavily
used utilities on many computer systems.
It is interesting (some might say confusing) to reflect on this algorithm
from a more philosophical point of view. In this chapter, we have considered
parsers for unraveling the structure of regular expressions, based on a formal
description of regular expressions using a context-free grammar. Put another
way, we used the context-free gramma]’ to specify a particular “pattern”:
sequences of characters with legally
balz.nced
parentheses. The parser then
checks to see if the pattern occurs in the input (but only considers a match
legal if it covers the entire input string). Thus parsers, which check that an
input string is in the set of strings defined by some context-free grammar,
and pattern matchers, which check that an input string is in the set of
strings defined by some regular expression, are essentially performing the same
function! The principal difference is that context-free grammars are capable
of describing a much wider class of
strings.
For example, the set of all regular
expressions can’t be described with regular expressions.
Another difference in the way we’ve implemented the programs is that the
context-free grammar is “built in” to the parser, while the match procedure
is “table-driven”: the same program
wol,ks
for all regular expressions, once
they have been translated into the
propel.
format. It turns out to be possible
to build parsers which are table-driven
In
the same way, so that the same
program can be used to parse all language
3
which can be described by context-
free grammars. A parser generator is a program which takes a grammar as
input and produces a parser for the language described by that grammar as
280 CHAPTER 21
output. This can be carried one step further: it is possible to build compilers
which are table-driven in terms of both the input and the output languages. A
compiler-compiler is a program which takes two grammars (and some formal
specification of the relationships between them) as input and produces a
compiler which translates strings from one language to the other as output.
Parser generators and compiler-compilers are available for general use in
many computing environments, and are quite useful tools which can be used
to produce efficient and reliable parsers and compilers with a relatively small
amount of effort. On the other hand, top-down recursive descent parsers of the
type considered here are quite serviceable for simple grammars which arise in
many applications. Thus, as with many of the algorithms we have considered,
we have a straightforward method which can be used for applications where
a great deal of implementation effort might not be justified, and several ad-
vanced methods which can lead to significant performance improvements for
large-scale applications. Of course, in this case, this is significantly understat-
ing the point: we’ve only scratched the surface of this extensively researched
PARSING
281
Exercises
1.
2.
3.
4.
5.
6.
7.
8.
9.
10.
How does the recursive descent parser find an error in a regular expression
such as (A+B)*BC+ which is incomplete?
Give the parse tree for the regular expression ((A+B)+(Ct-D)*)*.
Extend the arithmetic expression grammar to include exponentiation, div
and mod.
Give a context-free grammar to
d’:scribe
all strings with no more than
two consecutive
1’s.
How many procedure calls are used by the recursive descent parser to
recognize a regular expression in terms of the number of concatenation,
or, and closure operations and the number of parentheses?
Give the ch, next1 and next2 arrays that result from building the pattern
matching machine for the pattern ((A+B)+(C+D)*)*.
Modify the regular expression grammar to handle the “not” function and
“don’t-care” characters.
Build a general regular expression pattern matcher based on the improved
grammar in your answer to the previous question.
Remove the recursion from the recursive descent compiler, and simplify
the resulting code as much as possible. Compare the running time of the
nonrecursive and recursive methods.
Write a compiler for simple arithmetic expressions described by the gram-
mar in the text. It should produce a list of ‘*instructions” for a machine
capable of three operations:
Pugh
the value of a variable onto a stack;
add the top two values on the stick, removing them from the stack, then
putting the result there; and mt.ltiply the top two values on the stack, in
the same way.
. simple grammars which arise in
many applications. Thus, as with many of the algorithms we have considered,
we have a straightforward method which can be