Tài liệu hạn chế xem trước, để xem đầy đủ mời bạn chọn Tải xuống
1
/ 39 trang
THÔNG TIN TÀI LIỆU
Thông tin cơ bản
Định dạng
Số trang
39
Dung lượng
242,74 KB
Nội dung
LEXICAL ANALYSIS Phung Hua Nguyen University of Technology 2006 Outline • Introduction to Lexical Analysis • Token specification – Language – Regular Expressions (REs) • Token recoginition –REs ⇒ NFA (Thompson’s construction, Algorithm 3.3) –NFA ⇒ DFA (subset construction, Algorithm 3.2) –DFA ⇒ minimal DFA (Algorithm 3.6) • Programming CSE - HCMUT Lexical Analysis 2 Introduction • Read the input characters • Produce as output a sequence of tokens • Eliminate white space and comments lexical analyzer parser symbol table source program token get next token CSE - HCMUT Lexical Analysis 3 Why ? • Simplify design • Improve compiler efficiency • Enhance compiler portability CSE - HCMUT Lexical Analysis 4 Tokens, Patterns, Lexemes Token Sample Lexeme Informal description of pattern const const const if if if relation <,<=,==,!=,>,>= < or <= or == or != or > or >= id pi, count, x2 letter followed by letters or digits num 3.14, 25, 6.02E3 any numeric constant literal “core dumped” any characters between “ and “ except “ CSE - HCMUT Lexical Analysis 5 Outline • Introduction √ • Token specification – Language – Regular Expressions (REs) • Token recoginition –REs ⇒ NFA (Thompson’s construction, Algorithm 3.3) –NFA ⇒ DFA (subset construction, Algorithm 3.2) –DFA ⇒ minimal DFA (Algorithm 3.6) • Programming CSE - HCMUT Lexical Analysis 6 Alphabet, Strings and Languages • Alphabet ∑: any finite set of symbols – The Vietnamese alphabet {a, á, à, , ã, , b, c, d, đ,…} – The binary alphabet {0,1} – The ASCII alphabet •String: a finite sequence of symbols drawn from ∑ : – Length |s| of a string s: the number of symbols in s – The empty string, denoted ∈, |∈| = 0 • Language: any set of strings over ∑; – its two special cases: • ∅: the empty set •{ ∈} CSE - HCMUT Lexical Analysis 7 Examples of Languages • ∑ ={a, á, à, , ã, , b, c, d, đ,…} – Vietnamese language • ∑ = {0,1} – A string is an instruction – The set of Pentium instructions • ∑ = the ASCII set – A string is a program – The set of C programs CSE - HCMUT Lexical Analysis 8 Terms (Fig.3.7) Term Definition prefix of s a string obtained by removing 0 or more trailing symbols of s; e.g. ban is a prefix of banana suffix of s a string formed by deleting 0 or more the leading symbols of s; e.g. na is a suffix of banana substring of s a string obtained by deleting a prefix and a suffix from s; e.g. nan is a substring of banana proper prefix, suffix or substring of s Any nonempty string x that is, respectively, a prefix, suffix os substring of s such that s ≠ x CSE - HCMUT Lexical Analysis 9 String operations • String concatenation –If x and y are strings, xy is the string formed by appending y to x. E.g.: x = hom, y = nay ⇒ xy = homnay – ∈ is the identity: ∈y = y; x∈ = x • String exponentiation –s 0 = ∈ –s i = s i-1 s E.g. s = 01, s 0 = ∈, s 2 = 0101, s 3 = 010101 CSE - HCMUT Lexical Analysis 10 [...]... B A A 20 Transition table State Input symbol 0 b {0} 1 2 CSE - HCMUT a {0,1} - {2} {3} Lexical Analysis 21 Acceptance • A NFA accepts an input string x iff there is some path in the transition graph from start state to some accepting state such that the edge labels along this path spell out x 0 0 B A 01010 A 0 B 1 A0 B 1 A0 B 1 1 0 error CSE - HCMUT Lexical Analysis 01011 A 0 B 1 A0 B1 A 1 ? 22 Deterministic... is a special case of NFA in which 1 no state has an -transition, and 2 for each state s and input symbol a, there is at most one edge labeled a leaving s CSE - HCMUT Lexical Analysis 23 Thompson’s construction of NFA from REs • guided by the syntactic structure of the RE r • For , i f • For a in i CSE - HCMUT a f Lexical Analysis 24 Thompson’s construction (cont’d) • Suppose N(s) and N(t) are NFA’s... strings of letters, including D)* CSE - HCMUT all strings of letters and digits beginning with a letter all strings of one or more digits Lexical Analysis 12 Regular Expressions (REs) over Alphabet • Inductive base: 1 is a RE, denoting the RL { } 2 a is a RE, denoting the RL {a} • Inductive step: Suppose r and s are REs, denoting the language L(r) and L(s) Then 3 (r)|(s) is a RE, denoting the RL L(r)... Regular Expressions (REs) • Token recoginition – REs – NFA – DFA NFA (Thompson’s construction, Algorithm 3.3) DFA (subset construction, Algorithm 3 .2) minimal DFA (Algorithm 3.6) • Programming CSE - HCMUT Lexical Analysis 17 Overview RE 3.3 NFA CSE - HCMUT 3.5 3 .2 3.6 DFA Lexical Analysis mDFA 18 Nondeterministic finite automata • A nondeterministic finite automaton (NFA) is a mathematical model that consists... N(t) N(s) i N(s) f f – For (s), use N(s) itself CSE - HCMUT Lexical Analysis 25 Outline • Introduction • Token specification – Language – Regular Expressions (REs) • Token recoginition – REs – NFA – DFA NFA (Thompson’s construction) DFA (subset construction) minimal DFA (Algorithm 3.6) • Programming CSE - HCMUT Lexical Analysis 26 Subset construction Operation Description Set of NFA states reachable from... Lexical Analysis 27 Subset construction (cont’d) Let s0 be the start state of the NFA; Dstates contains the only unmarked state -closure(s0); while there is an unmarked state T in Dstates do begin mark T for each input symbol a do begin U := -closure(move(T; a)); if U is not in Dstates then Add U as an unmarked state to Dstates; DTran[T; a] := U; end; end; CSE - HCMUT Lexical Analysis 28 DFA • Let (... -closure(s0) CSE - HCMUT Lexical Analysis 29 Outline • Introduction • Token specification – Language – Regular Expressions (REs) • Token recoginition – REs – NFA – DFA NFA (Thompson’s construction) DFA (subset construction) minimal DFA (Algorithm 3.6) • Programming CSE - HCMUT Lexical Analysis 30 Minimise a DFA Initially, create two states: 1 one is the set of all final states: F 2 the other is the set of all... b b Step1: {A,B,C,D} b C {E} For a, {B,B,B,B} a For b, {C,D,C,E} A a B b a D b E a a Split b A B a CSE - HCMUT {E} For b, {C,D,C} Split a {D} Step 2: b b {A,B,C} b a {A,C} {B} {D} {E} Step 3: D b For a, {B,B} E a For b, {C,C} Terminate Lexical Analysis 32 Outline • Introduction • Token specification – Language – Regular Expressions (REs) • Token recoginition – REs – NFA – DFA NFA (Thompson’s construction)... highest precedence – “|” has the lowest precedence • Associativity: – all are left-associative E.g.: (a)|((b)*(c)) a|b*c H Unnecessary parentheses can be removed CSE - HCMUT Lexical Analysis 14 Example • 1 2 3 4 5 = {a, b} a|b denotes {a,b} (a|b)(a|b) denotes {aa,ab,ba,bb} a* denotes { ,a,aa,aaa,aaaa,…} (a|b)* denotes ? a|a*b denotes ? CSE - HCMUT Lexical Analysis 15 Notational Shorthands • One or more instances... reload second half forward++ } else if (forward at end of second half) { reload first half forward = 0 } else terminate the analysis } Lexical Analysis 35 Transition Diagrams relop 2 return(relop,LE) 3 return(relop,NE) 4 0 return(relop,LT) 7 return(id,lexeme) other letter id letter(letter|digit)* 5 6 other letter or digit Transition diagram is a DFA in which there is no edge leaving . ? • Simplify design • Improve compiler efficiency • Enhance compiler portability CSE - HCMUT Lexical Analysis 4 Tokens, Patterns, Lexemes Token Sample Lexeme Informal description of pattern const. <,<=,==,!=,>,>= < or <= or == or != or > or >= id pi, count, x2 letter followed by letters or digits num 3.14, 25 , 6.02E3 any numeric constant literal “core dumped” any characters between. Algorithm 3.3) –NFA ⇒ DFA (subset construction, Algorithm 3 .2) –DFA ⇒ minimal DFA (Algorithm 3.6) • Programming CSE - HCMUT Lexical Analysis 2 Introduction • Read the input characters • Produce as