PATTERN MATCHING
is the number of the actual initial state. (Note the special representation used
for null states with 0 or 1 exits.)
Since we often will want to access states just by number, the most suitable
organization for the machine is to use the array representation. We’ll use the
three arrays
ch: amty [O Mmax] of char;
nextl, next2: array [O Mmax] of integer;
Here Mmax is the maximum number of’ states (twice the maximum pattern
length). It would be possible to get by with two-thirds this amount of space,
since each state really uses only two rreaningful pieces of information, but
we’ll forsake this improvement for the sake of clarity and also because pattern
descriptions are not likely to be particularly long.
We’ve seen how to build up mach.nes from regular expression pattern
descriptions and how such machines might be represented as arrays. However,
to write a program to do the translation from a regular expression to the
corresponding nondeterministic machine representation automatically is quite
another matter. In fact, even writing a program to determine if a given regular
expression is legal is challenging for the uninitiated. In the next chapter, we’ll
study this operation, called parsing, in much more detail. For the moment,
we’ll assume that this translation has been done, so that we have available
the ch,
nextl,
and next2 arrays representing a particular nondeterministic
machine which corresponds to the regular expression pattern description of
interest.
Simulating the Machine
The last step in the development of a. general regular-expression pattern-
matching algorithm is to write a program which somehow simulates the opera-
tion of a nondeterministic pattern-matching machine. The idea of writing a
program which can “guess” the right answer seems ridiculous. However, in
this case it turns out that we can keep track of all possible matches in a
systematic way, so that we do eventually encounter the correct one.
One possibility would be to develop a recursive program which mimics
the nondeterministic machine (but tries all possibilities rather than guessing
the right one). Instead of using this approach, we’ll look at a nonrecursive
implementation which exposes the basic operating principles of the method
by keeping the states under consideration in a rather peculiar data structure
called a
deque,
described in some detail below.
The idea is to keep track of all states that could possibly be encountered
while the machine is “looking at” the c:lrrent input character. Each of these
264
CHAPTER 20
states are processed in turn: null states lead to two (or fewer) states, states for
characters which do not match the current input are eliminated, and states
for characters which do match the current input lead to new states for use
when the machine is looking at the next input character. Thus, we maintain
a list of all the states that the nondeterministic machine could possibly be in
at a particular point in the text: the problem is to design an appropriate data
structure for this list.
Processing null states seems to require a stack, since we are essentially
postponing one of two things to be done, just as when we removed the
recursion from Quicksort (so the new state should be put at the beginning
of the current list, lest it get postponed indefinitely). Processing the other
states seems to require a queue, since we don’t want to examine states for the
next input character until we’ve finished with the current character (so the
new state should be put at the end of the current list). Rather than choosing
between these two data structures, we’ll use both! Deques (“double-ended
queues”) combine the features of stacks and queues: a deque is a list to which
items can be added at either end. (Actually, we use an “output-restricted
deque,” since we always remove items from the beginning, not the end: that
would be “dealing from the bottom of the deck.“)
A crucial property of the machine is that there are no “loops” consisting of
just null states: otherwise it could decide nondeterministically to loop forever.
It turns out that this implies that the number of states on the deque at any
time is less than the number of characters in the pattern description.
The program given below uses a deque to simulate the actions of a
non-
deterministic pattern-matching machine as described above. While examin-
ing a particular character in the input, the nondeterministic machine can be
in any one of several possible states: the program keeps track of these in
a deque dq. One pointer (head) to the head of the deque is maintained so
that items can be inserted or removed at the beginning, and another pointer
(tail) to the tail of the deque is maintained so that items can be inserted
at the end. If the pattern description has M characters the deque can be
implemented in a “circular” manner in an array of M integers. The con-
tents of the deque are the elements “between” head and tail (inclusive): if
head<=tail, the meaning is obvious; if head>tail we take the elements that
would fall between head and tail if the elements of dq were arranged in a
circle: dq[head], dq[head+l],. . .,dq[M-l],dq[O], dq[l], . . .,dq[tail]. This is
quite simply implemented by using head:= head+1 mod M to increment head
and similarly for tail. Similarly, head:= head+M-1 mod M refers to the ele-
ment before head in the rrray: this is the position at which an element should
be added to the beginning of the deque.
The main loop of the program removes a state from the deque (by
PATTERN MATCHING 265
incrementing head mod M and then referring to dq[head]) and performs the
action required. If a character is to be matched, the input is checked for the
required character: if it is found, the sate transition is effected by putting
the new state at the end of the deque (so that all states involving the current
character are processed before those involving the next one). If the state is
null, the two possible states to be simulated are put at the beginning of the
deque. The states involving the curren, input character are kept separated
from those involving the next by a marker scan=-1 in the deque: when
scan is encountered, the pointer into th,: input string is advanced. The loop
terminates when the end of the input is reached (no match found), state 0 is
reached (legal match found), or only one item, the scan marker is left on the
deque (no match found). This leads directly to the following implementation:
function match(j: intege ): integer;
const
scan=- 1;
var head, tail,
nl,
n2: integer;
dq: array [O Mmax] of integer;
procedure addhead(x: integer);
begin dq[head] := x; head:=(head+M-1) mod A4 end;
procedure addtail(x: integer);
begin
tail:=(tail+l)
mod M; dq[tail]:=x end;
begin
head:=l;
taiJ:=O;
addtail(next1 [O]); addtail(scan);
match:=j-1;
repeat
if dq [head] =scan
thfsn
begin
j:=j+l;
addtail(scan) end
else if ch [dq[head]]==alj] then
addtail(next1 [dq[head]])
else if ch[dq[head]]==’ ‘then
begin
nl :=nextl
[dq[her!d]]
; n2:=next2[dq[head]];
addhead(n1); if
r’l<>n2
then addhead(n2)
end ;
head:=(head+l) mod M
until
(j>N)
or (dq[head]=O) or (head=tail);
if dq[head]=O then
match:=j-1;
end ;
This function takes as its argument the
-1osition
j in the text string a at which
266
GIAF'TER20
it should start trying to match. It returns the index of the last character in
the match found (if any, otherwise it returns j-1).
The following table shows the contents of the deque each time a state is
removed when our sample machine is run with the text string AABD. (For
clarity, the details involving head, tail, and the maintenance of the circular
deque are suppressed in this table: each line shows those elements in the deque
between the head and tail pointers.) The characters appear in the lefthand
column in the table at the point when the program has finished scanning
them.
5
scan
2 6
1
3
3
6
6
scan
A scan
2
2 7
1
3
3
7
7
scan
A scan
2
2
scan
1
3
3
scan
B scan
4
4
scan
8
scan
D scan
9
9
scan
0 scan
scan
6
scan
scan
2
2
7
scan
7 scan
scan 2
2
scan
Thus, we start with State 5 while scanning the first character. First State 5
leads to States 2 and 6, then State 2 leads to States 1 and 3, all of which need
to scan the same character and are on the beginning of the deque. Then State
1 leads to State 2, but at the end of the deque (for the next input character).
State 3 only leads to another state while scanning a B, so it is ignored while
an A is being scanned. When the “scan” sentinel finally reaches the front of
the deque, we see that the machine could be either in State 2 or State 7 after
scanning an A. Continuing, the program eventually ends up the final state,
after considering all transitions consistent with the text string.
PATTERN MATCHING
The running time of this program obviously depends very heavily on
the pattern being matched. However, for each of the N input characters, it
processes at most M states of the mac:nne, so the worst case running time
is proportional to MN. For sure, not all nondeterministic machines can be
simulated so efficiently, as discussed in more detail in Chapter 40, but the use
of a simple hypothetical pattern-matching machine in this application leads
to a quite reasonable algorithm for a quite difficult problem. However, to
complete the algorithm, we need a program which translates arbitrary regular
expressions into “machines” for interpretation by the above code. In the next
chapter, we’ll look at the implementation of such a program in the context of
a more general discussion of compilers
a,nd
parsing techniques.
r-l
268
Exercises
1.
Give a regular expression for recognizing all occurrences of four or fewer
consecutive l’s in a binary string.
2. Draw the nondeterministic pattern matching machine for the pattern
description (A+B)* +C.
3.
Give the state transitions your machine from the previous exercise would
make to recognize ABBAC.
4.
Explain how you would modify the nondeterministic machine to handle
the “not” function.
5.
Explain how you would modify the nondeterministic machine to handle
“don’t-care” characters.
6.
What would happen if match were to try to simulate the following ma-
chine?
7.
Modify match to handle regular expressions with the “not” function and
“don’t-care” characters.
8. Show how to construct a pattern description of length M and a text
string of length N for which the running time of match is as large as
possible.
9.
Why must the deque in match have only one “scan” sentinel in it?
10.
Show the contents of the deque each time a state is removed when match
is used to simulate the example machine in the text with the text string
ACD.
21. Parsing
Several fundamental algorithms have been developed to recognize legal
computer programs and to
decomI:ose
their structure into a form suitable
for further processing. This operation, called parsing, has application beyond
computer science, since it is directly related to the study of the structure
of language in general. For example, parsing plays an important role in sys-
tems which try to “understand” natural (human) languages and in systems
for translating from one language to another. One particular case of inter-
est is translating from a “high-level” co.nputer language like Pascal (suitable
for human use) to a “low-level” assembly or machine language (suitable for
machine execution). A program for doing such a translation is called a com-
piler.
Two general approaches are used for parsing. Top-down methods look
for a legal program by first looking for parts of a legal program, then looking
for parts of parts, etc. until the pieces are small enough to match the input
directly. Bottom-up methods put pieces of the input together in a structured
way making bigger and bigger pieces until a legal program is constructed.
In general, top-down methods are recursive, bottom-up methods are iterative;
top-down methods are thought to be
easier
to implement, bottom-up methods
are thought to be more efficient.
A full treatment of the issues involved in parser and compiler construction
would clearly be beyond the scope of
thi>,
book. However, by building a simple
“compiler” to complete the pattern-mats:hing algorithm of the previous chap-
ter, we will be able to consider some of’ the fundamental concepts involved.
First we’ll construct a top-down parser for a simple language for describing
regular expressions. Then we’ll modify the parser to make a program which
translates regular expressions into pattern-matching machines for use by the
match procedure of the previous chapter.
Our intent in this chapter is to give some feeling for the basic principles
269
270 CHAPTER 21
of parsing and compiling while at the same time developing a useful pattern
matching algorithm. Certainly we cannot treat the issues involved at the
level of depth that they deserve. The reader should be warned that subtle
difficulties are likely to arise in applying the same approach to similar prob-
lems, and advised that compiler construction is a quite well-developed field
with a variety of advanced methods available for serious applications.
Context-Free Grammars
Before we can write a program to determine whether a program written in
a given language is legal, we need a description of exactly what constitutes
a legal program. This description is called a grammar: to appreciate the ter-
minology, think of the language as English and read “sentence” for “program”
in the previous sentence (except for the first occurrence!). Programming lan-
guages are often described by a particular type of grammar called a
context-
free grammar. For example, the context-free grammar which defines the set
of all legal regular expressions (as described in the previous chapter) is given
below.
(expression) : : =
(term) 1 (term) + (expression)
(term) ::= (factor) 1
(factor)(term)
(factor) ::= ((expression)) (
21
1 (factor)*
This grammar describes regular expressions like those that we used in the last
chapter, such as (l+Ol)*(O+l) or (A*B+AC)D. Each line in the grammar is
called a production or replacement rule. The productions consist of terminal
symbols
(,
), + and
*
which are the symbols used in the language being
described
(‘91,”
a special symbol, stands for any letter or digit); nonterminal
symbols (expression), (term), and (factor) which are internal to the grammar;
and
metasymbols I:=
and ( which are used to describe the meaning of the
productions. The ::= symbol, which may be read
2s
a,” defines the left-hand
side of the production in terms of the right-hand side; and the 1 symbol, which
may be read as
“or”
indicates alternative choices. The various productions,
though expressed in this concise symbolic notation, correspond in a simple
way to an intuitive description of the grammar. For example, the second
production in the example grammar might be read “a (term) is a (factor)
or a (factor) followed by a (term).” One nonterminal symbol, in this case
(expreswon),
is distinguished in the sense that a string of terminal symbols is
in the language described by the grammar if and only if there is some way to
use the productions to derive that string from the distinguished nonterminal
by replacing (in any number of steps) a nonterminal symbol by any of the “or”
clauses on the right-hand side of a production for that nonterminal symbol.
PARSING
271
One natural way to describe the result of this derivation process is called
a purse tree: a diagram of the complete grammatical structure of the string
being parsed. For example, the following parse tree shows that the string
(A*B+AC)D is in the language described by the above grammar.
The circled internal nodes labeled E, F,
a.nd
T represent (expression), (factor),
and (term), respectively. Parse trees like this are sometimes used for English,
to break down a “sentence” into “subject,” “verb,” “object,” etc.
The main function of a parser is to accept strings which can be so derived
and reject those that cannot, by attempting to construct a parse tree for
any given string. That is, the parser can recognize whether a string is in
the language described by the grammar by determining whether or not there
exists a parse tree for the string. Top-down parsers do so by building the
tree starting with the distinguished nonterminal at the top, working down
towards the string to be recognized at the bottom; bottom-up parsers do this
by starting with the string at the bottom, working backwards up towards the
distinguished nonterminal at the top.
As we’ll see, if the strings being reo>gnized also have meanings implying
further processing, then the parser can convert them into an internal repre-
sentation which can facilitate such processing.
Another example of a context-free grammar may be found in the appen-
dix of the Pascal User Manual and
Report:
it describes legal Pascal programs.
The principles considered in this section for recognizing and using legal ex-
pressions apply directly to the complex job of compiling and executing Pascal
272
CHAPTER 21
programs. For example, the following grammar describes a very small subset
of Pascal, arithmetic expressions involving addition and multiplication.
(expression) ::= (term) 1 (term) + (expression)
(term) ::= (factor) 1 (factor)* (term)
(factor) ::= ((expression)) )
21
Again,
w
is a special symbol which stands for any letter, but in this grammar
the letters are likely to represent variables with numeric values. Examples of
legal strings for this grammar are A+(B*C) and (A+B*C)*D*(A+(B+C)).
As we have defined things, some strings are perfectly legal both as arith-
metic expressions and as regular expressions. For example, A*(B+C) might
mean “add B to C and multiply the result by
A”
or “take any number of A’s
followed by either B or C.” This points out the obvious fact that checking
whether a string is legally formed is one thing, but understanding what it
means is quite another. We’ll return to this issue after we’ve seen how to
parse a string to check whether or not it is described by some grammar.
Each regular expression is itself an example of a context-free grammar:
any language which can be described by a regular expression can also be
described by a context-free grammar. The converse is not true: for example,
the concept of “balancing” parentheses can’t be captured with regular ex-
pressions.
Other types of grammars can describe languages which can’t be
described by context-free grammars. For example, context-sensitive grammars
are the same as those above except that the left-hand sides of productions
need not be single nonterminals. The differences between classes of languages
and a hierarchy of grammars for describing them have been very carefully
worked out and form a beautiful theory which lies at the heart of computer
science.
Top-Down Parsing
One parsing method uses recursion to recognize strings from the language
described exactly as specified by the grammar. Put simply, the grammar is
such a complete specification of the language that it can be turned directly
into a program!
Each production corresponds to a procedure with the name of the
non-
terminal on the left-hand side. Nonterminals on the right-hand side of the
input correspond to (possibly recursive) procedure calls; terminals correspond
to scanning the input string. For example, the following procedure is part of
a top-down parser for our regular expression grammar:
. machine in the text with the text string
ACD.
21. Parsing
Several fundamental algorithms have been developed to recognize legal
computer programs and to
decomI:ose