Lexical and syntax analysis

• Language implementation systems must analyze • Language implementation systems must analyze source code, regardless of the specific implementation approach • Nearly all syntax analysis

Trang 2

• Language implementation systems must analyze

• Language implementation systems must analyze source code, regardless of the specific

implementation approach

• Nearly all syntax analysis is based on a formal description of the syntax of the source language (BNF)

Using BNF to Describe Syntax

• Provides a clear and concise syntax descriptionProvides a clear and concise syntax description

• The parser can be based directly on the BNF

Parsers based on BNF are easy to maintain

• Parsers based on BNF are easy to maintain

Trang 3

Syntax Analysis

• The syntax analysis portion of a language

processor nearly always consists of two parts:

– A low-level part called a p lexical analyzer y

(mathematically, a finite automaton based on a regular grammar) g g )

– A high-level part called a syntax analyzer , or

parser (mathematically a push-down automaton

parser (mathematically, a push down automaton based on a context-free grammar, or BNF)

Reasons to Separate Lexical and Syntax Analysis

• Simplicity - less complex approaches can be

• Simplicity less complex approaches can be

used for lexical analysis; separating them

simplifies the parser

• Efficiency - separation allows optimization of the lexical analyzer

• Portabilityy - parts of the lexical analyzer may p y ynot be portable, but the parser always is

portable

Trang 4

Lexical Analysis

• A lexical analyzer is a pattern matcher for

character strings

• A lexical analyzer is a “front-end” for the parserA lexical analyzer is a front end for the parser

• Identify substrings of the source program that

b l t th l

belong together lexemes

– Lexemes match a character pattern, which is

associated with a lexical category called a token – sum is a lexeme; its token may be IDENT

Trang 5

Lexical Analysis (cont.)

• The lexical analyzer is usually a function that is

ll d b th h it d th t t k

called by the parser when it needs the next token

• The lexical analysis process also:

– Includes skipping comments, tabs, newlines, and blanks

– Inserts lexemes for user-defined names (strings, identifiers, numbers) into the symbol table

Saves source locations (file line column) for error

– Saves source locations (file, line, column) for error messages

– Detects and reports syntactic errors in tokens such

– Detects and reports syntactic errors in tokens, such

as ill-formed floating-point literals, to the user

Pragmas

• Provide directives or hints to the compiler

• Directives:

– Turn various kinds of run-time checks on or off

– Turn certain code improvements on or off (performance vs

il ti d)

compilation speed)

– Turn performance profiling on or off

• Hints:

– Variable x is very heavily used (to keep it in a register)

– Subroutine S is not recursive (its storage may be statically allocated)

– 32 bits of precision (instead of 64) suffice for floating-point variable x

Le ical anal sis is responsible for (often) dealing ith

 Lexical analysis is responsible for (often) dealing with

pragmas

Trang 6

Lexical Analysis (cont.)

• Three main approaches to building a scanner:

1 Write a formal description of the tokens and use a software tool that constructs lexical analyzers given such a description

2 Design a state diagram that describes the token patterns and write a program that implements the diagram*

3 Design a state diagram that describes the token patterns and hand-construct a table-driven

impementation of the state diagram

The “longest possible token” rule

• The scanner returns to the parser only when the next character cannot be used to continue the

next character cannot be used to continue the current token

The next character will generally need to be saved

– The next character will generally need to be saved for the next token

• In some cases you may need to peek at more

• In some cases, you may need to peek at more than one character of look-ahead in order to

know whether to proceed

– In Pascal, when you have a 3 and you a see a ‘.’

• do you proceed (in hopes of getting 3.14)? or do you proceed (in hopes of getting 3.14)? or

• do you stop (in fear of getting 3 5)?

Trang 7

The rule …

• In messier cases, you may not be able to get by with any fixed amount of look-ahead In Fortran, for example, we have

DO 5 I = 1,25  loop

DO 5 I = 1.25  assignment

• Here, we need to remember we were in a

potentially final state, and save enough

information that we can back up to it, if we get stuck later

State Diagram Design

• Suppose we need a lexical analyzer that only

recognizes program names, reserved words, and integer literals

• A nạve state diagram would have a transition

from every state on every character in the source language - such a diagram would be very large!

Trang 8

State Diagram Design (cont.)

• In many cases, transitions can be combined to simplify the state diagram

– When recognizing an identifier, all uppercase and lowercase letters are equivalent - use a character

for each reserved word)

State Diagram Design (cont.)

• Convenient utility subprograms:

– getChar - gets the next character of input, puts

it in global variable nextChar, determines its

– lookup - determines whether the string in

lexeme is a reserved word (returns a code)

Trang 9

State Diagram

Lexical Analysis - Implementation

Trang 10

Lexical Analysis - Implementation

case DIGIT:

dd h () addChar();

getChar();

while (charClass == DIGIT) { while (charClass DIGIT) { addChar();

getChar();

} return INT_LIT;

} /* End of switch */

} /* End of function lex() */

A part of a Pascal scanner

• We read the characters one at a time with

look-h d

ahead

• If it is one of the one-character tokens

{ ( ) [ ] < > , ; = + - }

we announce that token

• If it is a ‘.’, we look at the next character

– If that is a dot, we announce ‘ ’

– Otherwise, we announce ‘.’ and reuse the ahead

Trang 11

look-A part of …

• If it is a ‘<’, we look at the next character

– if that is a ‘=‘ we announce ‘<=’ if that is a we announce <

– otherwise, we announce ‘<‘ and reuse the look-ahead, etc

• If it is a letter, we keep reading letters and digits and maybe underscores until we can't anymore

then e check to see if it is a reser ed ord

– then we check to see if it is a reserved word

• If it is a digit, we keep reading until we find a non-digit

– if that is not a ‘.’ we announce an integer

– otherwise, we keep looking for a real number

– if the character after the ‘.’ is not a digit, we announce

an integer and reuse the ‘.’ and the look-ahead

State

Diagram

Trang 12

we skip any initial white space (spaces, tabs, and newlines)

we read the next character

if it is a ( we look at the next character

if that is a * we have a comment;

we skip forward through the terminating *)

otherwise

we return a left parenthesis and reuse the look-ahead

if it is one of the one-character tokens ([ ] , ; = + - etc.)

we return that token

if it is a we look at the next character

if that is a we return

otherwise we return and reuse the look-ahead

if it is a < we look at the next character

if that is a = we return <=

otherwise we return < and reuse the look-ahead

etc.

if it is a letter we keep reading letters and digits

and maybe underscores until we can’t anymore;

then we check to see if it is a keyword

if so we return the keyword

otherwise we return an identifier

in either case we reuse the character beyond the end of

the token

if it is a digit we keep reading until we find a nondigit

if that is not a

we return an integer and reuse the nondigit

otherwise we keep looking for a real number

if the character after the is not a digit

we return an integer and

reuse the and the look-ahead

etc.

Trang 13

The Parsing Problem

• Goals of the parser, given an input program:

– Find all syntax errors; for each, produce an

appropriate diagnostic message, and recover pp p g g ,

quickly

– Produce the parse tree, or at least a trace of the Produce the parse tree, or at least a trace of the parse tree, for the program

The Parsing Problem (cont.)

• Two categories of parsers

– Top down - produce the parse tree, beginning at the root

 Order is that of a leftmost derivation

 Traces the parse tree in preorder

– Bottom up - produce the parse tree, beginning at the leaves

 Order is that of the reverse of a rightmost derivation

• Parsers look only one token ahead in the input

Trang 14

The Set of Notational Conventions

• Terminal symbols – Lowercase letters at the

b i i f th l h b t ( b )

beginning of the alphabet (a, b, )

• Nonterminal symbols - Uppercase letters at the

beginning of the alphabet (A, B, )

• Terminals or nonterminals - Uppercase letters at the end of the alphabet (W, X, Y, Z)

• Strings of terminals - Lowercase letters at the end of the alphabet (w, x, y, z)

• Mixed strings (terminals and/or nonterminals)

Lowercase Greek letters (, , , )

• Top-down Parsers

– Given a sentential form, xA , the parser must choose the correct A-rule to get the next

sentential form in the leftmost derivation using

sentential form in the leftmost derivation, using only the first token produced by A

• The most common top-down parsing

algorithms:

Recursive descent a coded implementation

– Recursive descent - a coded implementation

– LL parsers – table-driven implementation (1 st L stands for left-to-right 2 nd L stands for leftmost stands for left to right, 2 L stands for leftmost

Trang 15

 L stands for left-to-right, R stands for rightmost derivation

Trang 16

be generated by that nonterminal

– EBNF is ideally suited for being the basis for a recursive descent parser because EBNF

recursive-descent parser, because EBNF

minimizes the number of nonterminals

• A grammar for simple expressions:

<expr>  <term> {(+ | -) <term>}

<term>  <factor> {(* | /) <factor>} {( | /) }

<factor>  id | ( <expr> )

Trang 17

Recursive-Descent Parsing (cont.)

• Assume we have a lexical analyzer named

lex , which puts the next token code in

continue, else there is an error

– For each nonterminal symbol in the RHS, call its

associated parsing subprogram

Function expr()

/* Function expr()

Parses strings in the language

generated by the rule:

<expr> → <term> {(+ | -) <term>} p {( | ) }

Trang 18

Function expr() (cont.)

/* As long as the next token is + or -, call

lex() to get the next token, and parse the next term */

while (nextToken == PLUS_CODE ||

nextToken == MINUS CODE) { nextToken MINUS_CODE) { lex();

term(); ()

}

Recursive-Descent Parsing (cont.)

• A nonterminal that has more than one RHS

requires an initial process to determine which RHS it is to parse

– The correct RHS is chosen on the basis of the next token of input

– The next token is compared with the first token that can be generated by each RHS until a match is found

– If no match is found, it is a syntax error

Trang 19

Function factor() (cont.)

/* If the RHS is (<expr>) – call lex() to pass

over the left parenthesis call expr() and

over the left parenthesis, call expr(), and check for the right parenthesis */

else error();

}

else error(); /* Neither RHS matches */ }

Trang 20

The LL Grammar Class

• The Left Recursion Problem : If a grammar has left recursion, either direct or indirect, it cannot be the basis for a top-down parser– A grammar can be modified to remove left

recursion

• Example: consider the following rule

A  A + B

A  A + B – A recursive-descent parser subprogram for A

immediately calls itself to parse the first symbol

in its RHS …

Pairwise Disjointness Test

• The other characteristic of grammars that

disallows top-down parsing is the lack of

pairwise disjointness

– The inability to determine the correct RHS on the basis

of one token of lookahead

– FIRST() = {a |  * a } (If  *   ∈ FIRST()) FIRST() = {a |   a } (If   ,  ∈ FIRST())

• Pairwise Disjointness Test

– For each nonterminal A in the grammar that has For each nonterminal, A, in the grammar that has

more than one RHS, for each pair of rules, A  i and

A  j, it must be true that:

FIRST( ) ∩ FIRST( ) = 

Trang 21

• Example 1: A  aB | aAb

– The FIRST sets for the RHSs in these rules are {a} and {a}, which are clearly not disjoint So, these j rules fail the pairwise disjointness test

pairwise disjointness test

• Example: consider the rules

<variable>  identifier | identifier [<expression>] – The two rules can be replace by

<variable>  identifier <new>

<new>   | [<expression>]

– or

<variable>  identifier [[<expression>]]

(the outer brackets are metasymbols of EBNF)

Trang 22

Bottom-up Parsing

• The process of bottom-up parsing produces the reverse of a rightmost derivation

• A bottom-up parser starts with the input

sentence and produces the sequence of

sentential forms from there until all that remains

is the start symbol

• In each step, the task of the bottom-up parser is finding the correct RHS in a right sentential form

to reduce to get the previous right sentential

form in the derivation

is the correct one to be rewritten

– If the RHS E + T were chosen to be rewritten in this sentential form, the resulting sentential form would be

E * id But E * id is not a legal right sentential form g g for the given grammar

Trang 23

•  is the handle of the right sentential form

 = w if and only if S rm* Aw rm w

•  is a is a phrasephrase of the right sentential form  ifof the right sentential form  if and only if S *  = 1A2 + 12

i i l h f h i h i l f

•  is a simple phrase of the right sentential form

 if and only if S *  = 1A2  12

Example: Parser Tree of Sentential Form

• The only simple phrase is id

• The handle of a rightmost sentential form is the

leftmost simple phrase

Định dạng
Số trang	29
Dung lượng	288,28 KB