Chapter Lexical and Syntax Analysis ISBN 0-321-33025-0 Chapter Topics • Introduction • Lexical Analysis • The Parsing Problem • Recursive-Descent Parsing • Bottom-Up Parsing Copyright © 2006 Addison-Wesley All rights reserved 1-2 Introduction • Language implementation systems must analyze source code, regardless of the specific implementation approach: compilation, pure interpretation or hybrid method • Nearly all syntax analysis is based on a formal description of the syntax of the source language (CFG or BNF) Copyright © 2006 Addison-Wesley All rights reserved 1-3 Using BNF to Describe Syntax • Provides a clear and concise syntax description • The parser can be based directly on the BNF • Parsers based on BNF are easy to maintain Copyright © 2006 Addison-Wesley All rights reserved 1-4 Syntax Analysis • The syntax analysis portion of a language processor nearly always consists of two parts: – A low-level part called a lexical analyzer (mathematically, a finite automaton based on a regular grammar) – A high-level part called a syntax analyzer, or parser (mathematically, a push-down automaton based on a context-free grammar, or BNF) Copyright © 2006 Addison-Wesley All rights reserved 1-5 Reasons to Separate Lexical and Syntax Analysis • Simplicity - less complex approaches can be used for lexical analysis; separating them simplifies the parser • Efficiency - separation allows optimization of the lexical analyzer • Portability - parts of the lexical analyzer may not be portable, but the parser always is portable Copyright © 2006 Addison-Wesley All rights reserved 1-6 Lexical Analysis • A lexical analyzer is a pattern matcher for character strings • A lexical analyzer is a “front-end” for the parser • Identify substrings of the source program that belong together lexemes – Lexemes match a character pattern, which is associated with a lexical category called a token – sum is a lexeme; its token may be IDENT Copyright © 2006 Addison-Wesley All rights reserved 1-7 Example sum = oldsum – value / 100; Token Lexeme IDENT ASSIGN_OP IDENT sum = oldsum SUBSTRACT_OP IDENT DIVISION_OP INT_LIT SEMICOLON – value / 100 ; Copyright © 2006 Addison-Wesley All rights reserved 1-8 Lexical Analysis (cont.) • The lexical analyzer is usually a function that is called by the parser when it needs the next token • The lexical analysis process also: – Includes skipping comments, tabs, newlines, and blanks – Inserts lexemes for user-defined names (strings, identifiers, numbers) into the symbol table – Saves source locations (file, line, column) for error messages – Detects and reports lexical errors in tokens, such as ill-formed floating-point literals, to the user Copyright © 2006 Addison-Wesley All rights reserved 1-9 Lexical Analysis (cont.) • Three main approaches to building a scanner: Write a formal description of the tokens and use a software tool that constructs lexical analyzers given such a description Design a state diagram that describes the token patterns and write a program that implements the diagram* Design a state diagram that describes the token patterns and hand-construct a table-driven impementation of the state diagram Copyright © 2006 Addison-Wesley All rights reserved 1-10 Example • Consider the following simple grammar EE+T|T TT*F|F F (E) | id • The sentential form E + T * id includes three RHSs, E + T, T, and id Only one of these is the correct one to be rewritten – If the RHS E + T were chosen to be rewritten in this sentential form, the resulting sentential form would be E * id But E * id is not a legal right sentential form for the given grammar Copyright © 2006 Addison-Wesley All rights reserved 1-43 Definitions • is the handle of the right sentential form if and only if S rm* Aw rm w ( ) • is a phrase of the right sentential form if and only if S * 1A2 + 12 ( ) • is a simple phrase of the right sentential form if and only if S * 1A2 12 ( ) Copyright © 2006 Addison-Wesley All rights reserved 1-44 Example: Parser Tree of Sentential Form E + T * id E T F E + T * id • The phrases of the sentential form E + T * id are E + T * id, T * id, and id • The only simple phrase is id • The handle of a rightmost sentential form is the leftmost simple phrase Copyright © 2006 Addison-Wesley All rights reserved 1-45 Example: Consider the string id + id * id E (8) T E (3) (7) T T (2) (5) F F (1) id F (4) + id (6) * id E (8) E + T (7) E + T * F (6) E + T * id (5) E + F * id (4) E + id * id (3) T + id * id (2) F + id * id (1) id + id * id Copyright © 2006 Addison-Wesley All rights reserved 1-46 Shift-Reduce Algorithms • Reduce is the action of replacing the handle on the top of the parse stack with its corresponding LHS • Shift is the action of moving the next token to the top of the parse stack Copyright © 2006 Addison-Wesley All rights reserved 1-47 LR Parsers • Many different bottom-up parsing algorithms have been devised Most of these are variations of a process called LR parser – L means it scans the input string left to right and the R means it produces a rightmost derivation • The original LR algorithm was designed by Donald Knuth (1965) This algorithm, which is sometimes called canonical LR Copyright © 2006 Addison-Wesley All rights reserved 1-48 Advantages of LR parsers • They will work for nearly all grammars that describe programming languages • They work on a larger class of grammars than other bottom-up algorithms, but are as efficient as any other bottom-up parser • They can detect syntax errors as soon as it is possible • The LR class of grammars is a superset of the class parsable by LL parsers Copyright © 2006 Addison-Wesley All rights reserved 1-49 Structure of an LR Parser input a1 top a2 … … am $ Sm Xm … LR Parser S1 X1 Parsing Table S0 Copyright © 2006 Addison-Wesley All rights reserved 1-50 Configurations • The contents of the parse stack for an LR parser has the following form: S0X1S1…XmSm top of stack where the Si are state symbols, the Xi are grammar symbols • An LR parsing table has two parts: – The ACTION part has state symbols as its row labels and the terminal symbols as its column labels – The GOTO part has state symbols as its row labels and the nonterminals symbols as column labels Copyright © 2006 Addison-Wesley All rights reserved 1-51 Configurations (cont.) • The input string has a „$‟ at its right end It is used for normal termination of the parser • An LR parser configuration is a pair of strings (stack, input), with the detailed form (S0X1S1…XmSm, aiai+1 … an$) • The initial configuration of an LR parser is (S0, a1a2 … an$) Copyright © 2006 Addison-Wesley All rights reserved 1-52 The Parser Actions • ACTION[Sm, ai] = shift S (S0X1S1X2S2 … XmSm S, ai+1 … an $) • ACTION[Sm, ai] = reduce by A where r = ||, S = GOTO[Sm–r, A] (S0X1S1X2S2 … Xm-rSm-r A S, ai+1 … an $) • ACTION[Sm, ai] = accept the parse is complete and no errors were found • ACTION[Sm, ai] = error the parser calls an error-handling routine Copyright © 2006 Addison-Wesley All rights reserved 1-53 Example: The Grammar for Arithmetic Expressions E E + T E T T T * F T F F (E) F id Copyright © 2006 Addison-Wesley All rights reserved 1-54 LR Parsing Table State Action id + S5 * ( Goto ) $ S4 S6 R2 S7 R2 R2 R4 R4 R4 R4 S4 R6 T F accept S5 E R6 R6 S5 S4 S5 S4 R6 10 S6 S11 R1 S7 R1 R1 10 R3 R3 R3 R3 11 R5 R5 R5 R5 Copyright © 2006 Addison-Wesley All rights reserved EE+T ET TT*F TF F (E) F id 1-55 A Trace of a Parse of the String id + id * id Stack Input Action id * id + id $ Shift id * id + id $ Reduce by F id 0F3 * id + id $ Reduce by T F 0T2 * id + id $ Shift 0T2*7 id + id $ Shift T * id + id $ Reduce by F id T * F 10 + id $ Reduce by T T * F 0T2 + id $ Reduce by E T 0E1 + id $ Shift 0E1+6 id $ Shift E + id $ Reduce by F id 0E1+6F3 $ Reduce by T F 0E1+6T9 $ Reduce by E E + T 0E1 $ Accept Copyright © 2006 Addison-Wesley All rights reserved 1-56 Summary • Syntax analysis is a common part of language implementation • A lexical analyzer is a pattern matcher that isolates small-scale parts of a program – Detects syntax errors – Produces a parse tree • A recursive-descent parser is an LL parser • Parsing problem for bottom-up parsers: find the substring of current sentential form • The LR family of shift-reduce parsers is the most common bottom-up parsing approach Copyright © 2006 Addison-Wesley All rights reserved 1-57 ... Lexical and Syntax Analysis • Simplicity - less complex approaches can be used for lexical analysis; separating them simplifies the parser • Efficiency - separation allows optimization of the lexical. .. clear and concise syntax description • The parser can be based directly on the BNF • Parsers based on BNF are easy to maintain Copyright © 2006 Addison-Wesley All rights reserved 1 -4 Syntax Analysis. .. All rights reserved 1-8 Lexical Analysis (cont.) • The lexical analyzer is usually a function that is called by the parser when it needs the next token • The lexical analysis process also: –