• Language implementation systems must analyze • Language implementation systems must analyze source code, regardless of the specific implementation approach • Nearly all syntax analysis
Trang 2• Language implementation systems must analyze
• Language implementation systems must analyze source code, regardless of the specific
implementation approach
• Nearly all syntax analysis is based on a formal description of the syntax of the source language (BNF)
Copyright © 2006 Addison-Wesley All rights reserved 1-3
Using BNF to Describe Syntax
• Provides a clear and concise syntax descriptionProvides a clear and concise syntax description
• The parser can be based directly on the BNF
Parsers based on BNF are easy to maintain
• Parsers based on BNF are easy to maintain
Trang 3Syntax Analysis
• The syntax analysis portion of a language
processor nearly always consists of two parts:
– A low-level part called a p lexical analyzer y
(mathematically, a finite automaton based on a regular grammar) g g )
– A high-level part called a syntax analyzer , or
parser (mathematically a push-down automaton
parser (mathematically, a push down automaton based on a context-free grammar, or BNF)
Copyright © 2006 Addison-Wesley All rights reserved 1-5
Reasons to Separate Lexical and Syntax Analysis
• Simplicity - less complex approaches can be
• Simplicity less complex approaches can be
used for lexical analysis; separating them
simplifies the parser
• Efficiency - separation allows optimization of the lexical analyzer
• Portabilityy - parts of the lexical analyzer may p y ynot be portable, but the parser always is
portable
portable
Trang 4Lexical Analysis
• A lexical analyzer is a pattern matcher for
character strings
• A lexical analyzer is a “front-end” for the parserA lexical analyzer is a front end for the parser
• Identify substrings of the source program that
b l t th l
belong together lexemes
– Lexemes match a character pattern, which is
associated with a lexical category called a token – sum is a lexeme; its token may be IDENT
Copyright © 2006 Addison-Wesley All rights reserved 1-7
Trang 5Lexical Analysis (cont.)
• The lexical analyzer is usually a function that is
ll d b th h it d th t t k
called by the parser when it needs the next token
• The lexical analysis process also:
– Includes skipping comments, tabs, newlines, and blanks
– Inserts lexemes for user-defined names (strings, identifiers, numbers) into the symbol table
Saves source locations (file line column) for error
– Saves source locations (file, line, column) for error messages
– Detects and reports syntactic errors in tokens such
Copyright © 2006 Addison-Wesley All rights reserved 1-9
– Detects and reports syntactic errors in tokens, such
as ill-formed floating-point literals, to the user
Pragmas
• Provide directives or hints to the compiler
• Directives:
– Turn various kinds of run-time checks on or off
– Turn certain code improvements on or off (performance vs
il ti d)
compilation speed)
– Turn performance profiling on or off
• Hints:
– Variable x is very heavily used (to keep it in a register)
– Subroutine S is not recursive (its storage may be statically allocated)
– 32 bits of precision (instead of 64) suffice for floating-point variable x
Le ical anal sis is responsible for (often) dealing ith
Lexical analysis is responsible for (often) dealing with
pragmas
Trang 6Lexical Analysis (cont.)
• Three main approaches to building a scanner:
1 Write a formal description of the tokens and use a software tool that constructs lexical analyzers given such a description
2 Design a state diagram that describes the token patterns and write a program that implements the diagram*
3 Design a state diagram that describes the token patterns and hand-construct a table-driven
Copyright © 2006 Addison-Wesley All rights reserved 1-11
impementation of the state diagram
The “longest possible token” rule
• The scanner returns to the parser only when the next character cannot be used to continue the
next character cannot be used to continue the current token
The next character will generally need to be saved
– The next character will generally need to be saved for the next token
• In some cases you may need to peek at more
• In some cases, you may need to peek at more than one character of look-ahead in order to
know whether to proceed
– In Pascal, when you have a 3 and you a see a ‘.’
• do you proceed (in hopes of getting 3.14)? or do you proceed (in hopes of getting 3.14)? or
• do you stop (in fear of getting 3 5)?
Trang 7The rule …
• In messier cases, you may not be able to get by with any fixed amount of look-ahead In Fortran, for example, we have
DO 5 I = 1,25 loop
DO 5 I = 1.25 assignment
• Here, we need to remember we were in a
potentially final state, and save enough
information that we can back up to it, if we get stuck later
Copyright © 2006 Addison-Wesley All rights reserved 1-13
State Diagram Design
• Suppose we need a lexical analyzer that only
recognizes program names, reserved words, and integer literals
• A nạve state diagram would have a transition
from every state on every character in the source language - such a diagram would be very large!
Trang 8State Diagram Design (cont.)
• In many cases, transitions can be combined to simplify the state diagram
– When recognizing an identifier, all uppercase and lowercase letters are equivalent - use a character
Copyright © 2006 Addison-Wesley All rights reserved 1-15
for each reserved word)
State Diagram Design (cont.)
• Convenient utility subprograms:
– getChar - gets the next character of input, puts
it in global variable nextChar, determines its
– lookup - determines whether the string in
lexeme is a reserved word (returns a code)
lexeme is a reserved word (returns a code)
Trang 9State Diagram
Copyright © 2006 Addison-Wesley All rights reserved 1-17
Lexical Analysis - Implementation
Trang 10Lexical Analysis - Implementation
case DIGIT:
dd h () addChar();
getChar();
while (charClass == DIGIT) { while (charClass DIGIT) { addChar();
getChar();
} return INT_LIT;
} /* End of switch */
} /* End of function lex() */
Copyright © 2006 Addison-Wesley All rights reserved 1-19
A part of a Pascal scanner
• We read the characters one at a time with
look-h d
ahead
• If it is one of the one-character tokens
{ ( ) [ ] < > , ; = + - }
we announce that token
• If it is a ‘.’, we look at the next character
– If that is a dot, we announce ‘ ’
– Otherwise, we announce ‘.’ and reuse the ahead
Trang 11look-A part of …
• If it is a ‘<’, we look at the next character
– if that is a ‘=‘ we announce ‘<=’ if that is a we announce <
– otherwise, we announce ‘<‘ and reuse the look-ahead, etc
• If it is a letter, we keep reading letters and digits and maybe underscores until we can't anymore
then e check to see if it is a reser ed ord
– then we check to see if it is a reserved word
• If it is a digit, we keep reading until we find a non-digit
– if that is not a ‘.’ we announce an integer
– otherwise, we keep looking for a real number
Copyright © 2006 Addison-Wesley All rights reserved 1-21
– if the character after the ‘.’ is not a digit, we announce
an integer and reuse the ‘.’ and the look-ahead
State
Diagram
Trang 12we skip any initial white space (spaces, tabs, and newlines)
we read the next character
if it is a ( we look at the next character
if that is a * we have a comment;
we skip forward through the terminating *)
otherwise
we return a left parenthesis and reuse the look-ahead
if it is one of the one-character tokens ([ ] , ; = + - etc.)
we return that token
if it is a we look at the next character
if that is a we return
otherwise we return and reuse the look-ahead
if it is a < we look at the next character
if that is a = we return <=
otherwise we return < and reuse the look-ahead
Copyright © 2006 Addison-Wesley All rights reserved 1-23
otherwise we return < and reuse the look-ahead
etc.
if it is a letter we keep reading letters and digits
and maybe underscores until we can’t anymore;
then we check to see if it is a keyword
if so we return the keyword
otherwise we return an identifier
in either case we reuse the character beyond the end of
the token
if it is a digit we keep reading until we find a nondigit
if that is not a
we return an integer and reuse the nondigit
otherwise we keep looking for a real number
if the character after the is not a digit
if the character after the is not a digit
we return an integer and
reuse the and the look-ahead
etc.
Trang 13The Parsing Problem
• Goals of the parser, given an input program:
– Find all syntax errors; for each, produce an
appropriate diagnostic message, and recover pp p g g ,
quickly
– Produce the parse tree, or at least a trace of the Produce the parse tree, or at least a trace of the parse tree, for the program
Copyright © 2006 Addison-Wesley All rights reserved 1-25
The Parsing Problem (cont.)
• Two categories of parsers
– Top down - produce the parse tree, beginning at the root
Order is that of a leftmost derivation
Traces the parse tree in preorder
– Bottom up - produce the parse tree, beginning at the leaves
Order is that of the reverse of a rightmost derivation
• Parsers look only one token ahead in the input
Trang 14The Set of Notational Conventions
• Terminal symbols – Lowercase letters at the
b i i f th l h b t ( b )
beginning of the alphabet (a, b, )
• Nonterminal symbols - Uppercase letters at the
beginning of the alphabet (A, B, )
• Terminals or nonterminals - Uppercase letters at the end of the alphabet (W, X, Y, Z)
• Strings of terminals - Lowercase letters at the end of the alphabet (w, x, y, z)
• Mixed strings (terminals and/or nonterminals)
-Copyright © 2006 Addison-Wesley All rights reserved 1-27
Lowercase Greek letters (, , , )
The Parsing Problem (cont.)
• Top-down Parsers
– Given a sentential form, xA , the parser must choose the correct A-rule to get the next
sentential form in the leftmost derivation using
sentential form in the leftmost derivation, using only the first token produced by A
• The most common top-down parsing
• The most common top-down parsing
algorithms:
Recursive descent a coded implementation
– Recursive descent - a coded implementation
– LL parsers – table-driven implementation (1 st L stands for left-to-right 2 nd L stands for leftmost stands for left to right, 2 L stands for leftmost
Trang 15The Parsing Problem (cont.)
L stands for left-to-right, R stands for rightmost derivation
Copyright © 2006 Addison-Wesley All rights reserved 1-29
Trang 16be generated by that nonterminal
– EBNF is ideally suited for being the basis for a recursive descent parser because EBNF
recursive-descent parser, because EBNF
minimizes the number of nonterminals
• A grammar for simple expressions:
<expr> <term> {(+ | -) <term>}
<term> <factor> {(* | /) <factor>} {( | /) }
<factor> id | ( <expr> )
Trang 17Recursive-Descent Parsing (cont.)
• Assume we have a lexical analyzer named
lex , which puts the next token code in
continue, else there is an error
– For each nonterminal symbol in the RHS, call its
Copyright © 2006 Addison-Wesley All rights reserved 1-33
associated parsing subprogram
Function expr()
/* Function expr()
Parses strings in the language
generated by the rule:
<expr> → <term> {(+ | -) <term>} p {( | ) }
Trang 18Function expr() (cont.)
/* As long as the next token is + or -, call
lex() to get the next token, and parse the next term */
while (nextToken == PLUS_CODE ||
nextToken == MINUS CODE) { nextToken MINUS_CODE) { lex();
term(); ()
}
}
Copyright © 2006 Addison-Wesley All rights reserved 1-35
Recursive-Descent Parsing (cont.)
• A nonterminal that has more than one RHS
requires an initial process to determine which RHS it is to parse
– The correct RHS is chosen on the basis of the next token of input
– The next token is compared with the first token that can be generated by each RHS until a match is found
– If no match is found, it is a syntax error
Trang 19Copyright © 2006 Addison-Wesley All rights reserved 1-37
Function factor() (cont.)
/* If the RHS is (<expr>) – call lex() to pass
over the left parenthesis call expr() and
over the left parenthesis, call expr(), and check for the right parenthesis */
else error();
}
else error(); /* Neither RHS matches */ }
Trang 20The LL Grammar Class
• The Left Recursion Problem : If a grammar has left recursion, either direct or indirect, it cannot be the basis for a top-down parser– A grammar can be modified to remove left
recursion
• Example: consider the following rule
A A + B
A A + B – A recursive-descent parser subprogram for A
immediately calls itself to parse the first symbol
Copyright © 2006 Addison-Wesley All rights reserved 1-39
immediately calls itself to parse the first symbol
in its RHS …
Pairwise Disjointness Test
• The other characteristic of grammars that
disallows top-down parsing is the lack of
pairwise disjointness
– The inability to determine the correct RHS on the basis
of one token of lookahead
– FIRST() = {a | * a } (If * ∈ FIRST()) FIRST() = {a | a } (If , ∈ FIRST())
• Pairwise Disjointness Test
– For each nonterminal A in the grammar that has For each nonterminal, A, in the grammar that has
more than one RHS, for each pair of rules, A i and
A j, it must be true that:
FIRST( ) ∩ FIRST( ) =
Trang 21• Example 1: A aB | aAb
– The FIRST sets for the RHSs in these rules are {a} and {a}, which are clearly not disjoint So, these j rules fail the pairwise disjointness test
pairwise disjointness test
• Example: consider the rules
<variable> identifier | identifier [<expression>] – The two rules can be replace by
<variable> identifier <new>
<new> | [<expression>]
– or
<variable> identifier [[<expression>]]
(the outer brackets are metasymbols of EBNF)
Trang 22Bottom-up Parsing
• The process of bottom-up parsing produces the reverse of a rightmost derivation
• A bottom-up parser starts with the input
sentence and produces the sequence of
sentential forms from there until all that remains
is the start symbol
• In each step, the task of the bottom-up parser is finding the correct RHS in a right sentential form
to reduce to get the previous right sentential
Copyright © 2006 Addison-Wesley All rights reserved 1-43
form in the derivation
is the correct one to be rewritten
– If the RHS E + T were chosen to be rewritten in this sentential form, the resulting sentential form would be
E * id But E * id is not a legal right sentential form g g for the given grammar
Trang 23• is the handle of the right sentential form
= w if and only if S rm* Aw rm w
• is a is a phrasephrase of the right sentential form ifof the right sentential form if and only if S * = 1A2 + 12
i i l h f h i h i l f
• is a simple phrase of the right sentential form
if and only if S * = 1A2 12
Copyright © 2006 Addison-Wesley All rights reserved 1-45
Example: Parser Tree of Sentential Form
• The only simple phrase is id
• The handle of a rightmost sentential form is the
leftmost simple phrase