Tài liệu An introduction to compilers ppt

THÔNG TIN TÀI LIỆU

Thông tin cơ bản

Định dạng
Số trang	178
Dung lượng	630,39 KB

Nội dung

An introduction to compilers D. Vermeir Dept. of Computer Science Free University of Brussels, VUB dvermeir@vub.ac.be S R E V I N U I T E I T E J I R V B R U S S E L E C N I V R E T E N E B R A S A I T N E I C S January 26, 2001 Contents 1 Introduction 5 1.1 Compilers and languages . . 5 1.2 Applications of compilers . . 6 1.3 Overview of the compilation process . . . 8 1.3.1 Micro . . . . 8 1.3.2 JVM code . . 9 1.3.3 Lexical analysis . . . 11 1.3.4 Syntax analysis . . . 12 1.3.5 Semantic analysis . . 13 1.3.6 Intermediate code generation . . . 14 1.3.7 Optimization 15 1.3.8 Code generation . . 16 2 Lexical analysis 17 2.1 Introduction . . . . . 17 2.2 Regular expressions . 23 2.3 Finite state automata 24 2.3.1 Deterministic finite automata . . . 24 2.3.2 Nondeterministic finite automata . 26 2.4 Regular expressions vs finite state automata . . . . . . 30 2.5 A scanner generator . 31 1 VUB-DINF/2001/1 2 3 Parsing 34 3.1 Context-free grammars . . . 34 3.2 Top-down parsing . . 37 3.2.1 Introduction . 37 3.2.2 Eliminating left recursion in a grammar . . . . 40 3.2.3 Avoiding backtracking: LL(1) grammars . . . 42 3.2.4 Predictive parsers . . 43 3.2.5 Construction of first and follow . 47 3.3 Bottom-up parsing . 49 3.3.1 Shift-reduce parsers 49 3.3.2 LR(1) parsing 54 3.3.3 LALR parsers and yacc/bison . . 61 4 Checking static semantics 64 4.1 Attribute grammars and syntax-directed translation . . 64 4.2 Symbol tables . . . . 67 4.2.1 String pool . 68 4.2.2 Symbol tables and scope rules . . 68 4.3 Type checking . . . . 70 5 Intermediate code generation 73 5.1 Postfix notation . . . 74 5.2 Abstract syntax trees 75 5.3 Three-address code . 77 5.4 Translating assignment statements . . . . 78 5.5 Translating boolean expressions . . . . . 80 5.6 Translating control flow statements . . . . 84 5.7 Translating procedure calls . 86 5.8 Translating array references . 87 VUB-DINF/2001/1 3 6 Optimization of intermediate code 91 6.1 Introduction . . . . . 91 6.2 Local optimization of basic blocks . . . . 93 6.2.1 DAG representation of basic blocks . . . . . . 94 6.2.2 Code simplification . 98 6.2.3 Array and pointer assignments . . 99 6.2.4 Algebraic identities . 100 6.3 Global flow graph information . . . . . . 101 6.3.1 Reaching definitions 102 6.3.2 Available expressions 105 6.3.3 Live variable analysis 107 6.3.4 Definition-use chaining . . . . . . 109 6.3.5 Application: uninitialized variables . . . . . . 110 6.4 Global optimization . 110 6.4.1 Elimination of global common subexpressions 110 6.4.2 Copy propagation . . 111 6.4.3 Constant folding and elimination of useless variables . . . 112 6.4.4 Loops . . . . 113 6.4.5 Moving loop invariants . . . . . . 117 6.4.6 Loop induction variables . . . . . 120 6.5 Aliasing: pointers and procedure calls . . 124 6.5.1 Pointers . . . 125 6.5.2 Procedures . 125 7 Code generation 128 7.1 Run-time storage management . . . . . . 129 7.1.1 Global data . 129 7.1.2 Stack-based local data . . . . . . 130 7.2 Instruction selection . 132 7.3 Register allocation . 134 7.4 Peephole optimization . . . 135 VUB-DINF/2001/1 4 A Mc: the Micro-JVM Compiler 137 A.1 Lexical analyzer . . . 137 A.2 Symbol table management . 139 A.3 Parser . 140 A.4 Driver script . . . . . 144 A.5 Makefile 145 B Minic parser and type checker 147 B.1 Lexical analyzer . . . 147 B.2 String pool management . . 149 B.3 Symbol table management . 151 B.4 Types library . . . . 155 B.5 Type checking routines . . . 160 B.6 Parser with semantic actions 163 B.7 Utilities 167 B.8 Driver script . . . . . 168 B.9 Makefile 168 Index 170 Bibliography 174 Chapter 1 Introduction 1.1 Compilers and languages A compiler is a program that translates a source language text into an equivalent target language text. E.g. for a C compiler, the source language is C while the target language may be Sparc assembly language. Of course, one expects a compiler to do a faithful translation, i.e. the meaning of the translated text should be the same as the meaning of the source text. One would not be pleased to see the C program in figure 1.1 #include stdio.h int main(int,char ) int x = 34; x = x 24; printf("%d\n",x); Figure 1.1: A source text in the C language translated to an assembler program that, when executed, printed “Goodbye world” on the standard output. So we want the translation performed by a compiler to be semantics preserving. This implies that the compiler is able to “understand” (compute the semantics of) 5 VUB-DINF/2001/1 6 the source text. The compiler must also “understand” the target language in order to be able to generate a semantically equivalent target text. Thus, in order to develop a compiler, we need a precise definition of both the source and the target language. This means that both source and target language must be formal. A language has two aspects: a syntax and a semantics. The syntax prescribes which texts are grammatically correct and the semantics specifies how to derive the meaning from a syntactically correct text. For the C language, the syntax specifies e.g. that “the body ofa functionmust be enclosedbetween matchingbraces (“ ”)”. The semantics says that the meaning of the second statement in figure 1.1 is that “the value of the variable is multiplied by and the result becomes the new value of the variable ” It turns out that there exist excellent formalisms and tools to describe the syntax of a formal language. For the description of the semantics, the situation is less clear in that existing semantics specification formalisms are not nearly as simple and easy to use as syntax specifications. 1.2 Applications of compilers Traditionally, a compiler is thought of as translating a so-called “high level language” such as C 1 or Modula2 into assembly language. Since assembly language cannot be directly executed, a further translation between assembly language and (relocatable) machine language is necessary. Such programs are usually called assemblers but it is clear that an assembler is just a special (easier) case of a compiler. Sometimes, a compiler translates between high level languages. E.g. the first C++ implementations used a compiler called “cfront” which translated C++ code to C code. Such a compiler is often called a “cross-compiler”. On the other hand, a compiler need not target a real assembly (or machine) language. E.g. Java compilers generate code for a virtual machine called the “Java 1 If you want to call C a high-level language VUB-DINF/2001/1 7 Virtual Machine” (JVM). The JVM interpreter then interprets JVM instructions without any further translation. In general, an interpreter needs to understand only the source language. Instead of translating the source text, an interpreter immediately executes the instructions in the source text. Many languages are usually “interpreted”, either directly, or after a compilation to some virtual machine code: Lisp, Smalltalk, Prolog, SQL are among those. The advantages of using an interpreter are that is easy to port a language to a new machine: all one has to do is to implement the virtual machine on the new hardware. Also, since instructions are evaluated and examined at run-time, it becomes possible to implement very flexible languages. E.g. for an interpreter it is not a problem to support variables that have a dynamic type, something which is hard to do in a traditional compiler. Interpreters can even construct “programs” at run time and interpret those without difficulties, a capability that is available e.g. for Lisp or Prolog. Finally, compilers (and interpreters) have wider applications than just translating programming languages. Conceivably any large and complex application might define its own “command language” which can be translated to a virtual machine associated with the application. Using compiler generating tools, defining and implementing such a language need not be difficult. Hence SQL can be regarded as such a language associated with a database management system. Other so- called “little languages” provide a convenient interface to specialized libraries. E.g. the language (n)awk is a language that is very convenient to do powerful pattern matching and extraction operations on large text files. VUB-DINF/2001/1 8 1.3 Overview of the compilation process In thissection we willillustrate the mainphases ofthe compilation processthrough a simple compiler for a toy programming language. The source for an implemen- tation of this compiler can be found in appendix A and on the web site of the course. program : statement list ; statement list : statement statement list ; statement : declaration assignment read statement write statement ; declaration : declare var ; assignment : var expression ; read statement : read var ; write statement : write expression ; expression : term term term term term ; term : NUMBER var expression ; var : NAME ; Figure 1.2: The syntax of the Micro language 1.3.1 Micro The source language “Micro” is very simple. It is based on the toy language described in [FL91]. The syntax ofMicro is described bythe rules infigure 1.2. We willsee in chapter 3 that such rules can be formalized into what is called a grammar. VUB-DINF/2001/1 9 Note that NUMBER and NAME have not been further defined. The idea is, of course, that NUMBER represents a sequence of digits and that NAME represents a string of letters and numers, starting with a letter. A simple Micro program is shown in figure 1.3 declare xyz; xyz = (33+3)-35; write xyz; Figure 1.3: A Micro program The semantics of Micro should be clear 2 : a Micro program consists of a sequence of read/write or assignment statements. There are integer-valued variables (which need to be declared before they are used) and expressions are restricted to addition and substraction. 1.3.2 JVM code The target language will be code for the Java Virtual Machine. Figure 1.4 shows the output of the compiler for the program of figure 1.3. The JVM is a so-called “stack based” machine, which means that there are no registers. Instead most instructions take their operands from the stack and place the result on the top of the stack. An item on the stack can be an integer, (a reference to) an object etc. line 1 The JVM is an object-oriented machine; JVM instructions are stored in so- called “class files”. A class file contains the code for all methods of a class. Therefore we are forced to package Micro programs in classes. The name of the class here is t4, which is derived by the compiler from the name of the Micro source file. line 3 Since the Micro language is not object-oriented, we choose to put the code for a Micro program in a so-called static method, essentially a method that can be called without an object. It so happens that the JVM interpreter (usually called “java” on Unix machines) takes a classname as argument and then executes a static method main(String[]) from this class. Therefore 2 The output of the program in figure 1.3 is, of course, 1. [...]... separators to distinguish one meaningful string from another The first job of a compiler is then to group sequences of raw characters into meaningful tokens The lexical analyzer module is responsible for this Conceptually, the lexical analyzer (often called scanner) transforms a sequence of characters into a sequence of tokens In addition, a lexical analyzer will typically access the symbol table to store... can be efficiently implemented, e.g by encoding the states as numbers and using an array to represent the transition function This is illustrated in figure 2.5 The next state array can be automatically generated from the DFA description What is not clear is how to translate regular expressions to DFA’s To show how this can be done, we need the more general concept of a nondeterministic finite automaton... regarded as an abstract machine to recognize strings described by regular expressions Automatic translation of such an automaton to actual code will turn out to be straightforward 2.3 Finite state automata 2.3.1 Deterministic finite automata 2 If two expressions match the same longest string, the one that was declared first is chosen VUB-DINF/2001/1 25 Definition 2 A deterministic finite automaton (DFA)... 2.2: The transition diagram for lex() of the token after reading an extra character that will not belong to the token In such a case we must push the extra character back onto the input before returning Such states have been marked with a * in figure 2.2 If we read a character that doesn’t fit any transition diagram, we return a special ERROR token type Clearly, writing a scanner by hand seems to be easy,... 1.5: Result of lexical analysis of program in figure 1.3 see in chapter 2 that regular expressions and finite automata provide a powerful and convenient method to automate this job 1.3.4 Syntax analysis Once lexical analysis is finished, the parser takes over to check whether the sequence of tokens is grammatically correct, according to the rules that define the syntax of the source language Looking at the... job of the parser is to construct a parse tree that fits, according to the syntax specification, the token sequence that was generated by the lexical analyzer In chapter 3, we’ll see how context-free grammars can be used to specify the syntax of a programming language and how it is possible to automatically generate parser programs from such a context-free grammar 1.3.5 Semantic analysis Having established... is that the automaton can arbitrarily (nondeterministically) choose one of the possibilities In addition, we will also allow -moves where the automaton makes a state transition (labeled by ) without reading an input symbol Definition 4 A nondeterministic finite automaton (NFA) is a tuple (Q q0 F ) where Q is a finite set of states, is a finite input alphabet :Q ( f g) ! 2Q is a (total) transition function... to from q0 without reading any input F 0 = fs 2 2Q j s \ F 6= g, i.e if M could end up in a final state, then M 0 will do so It can then be shown that L(M 0 ) = L(M ) 2 VUB-DINF/2001/1 30 2.4 Regular expressions vs finite state automata In this section we show how a regular expression can be translated to a nondeterministic finite automata that defines the same language Using theorem 1, we can then translate... generated scanner will Process input characters, trying to find a longest string that matches any of the regular expressions2 Execute the code associated with the selected regular expression This code can, e.g install something in the symbol table, return a token type or whatever In the next section we will see how a regular expression can be converted to a socalled deterministic finite automaton that can be... getchar(); if (isdigit(c)) pbuf++ = c; else state = 3; break; 70 case 3: token.info.value= atoi(buf); token.type = NUMBER; ungetc(c,stdin); state = 0; return &token; break; case 5: token.type = LBRACE; state = 0; return &token; break; case 7: token.type = RBRACE; state = 0; return &token; break; case 9: token.type = LPAREN; state = 0; return &token; break; case 11: 80 VUB-DINF/2001/1 c = getchar(); if (isspace(c)) . the source and the target language. This means that both source and target language must be formal. A language has two aspects: a syntax and a semantics separators to distinguish one meaningful string from another. The first job of a compiler is then to group sequences of raw characters into meaningful tokens.

Ngày đăng: 25/01/2014, 18:20

Xem thêm