INTRODUCTION TO COMPUTER SCIENCE HANDOUT #9. GRAMMARS K5 & K6, Computer Science Department, Vaên Lang University Second semester Feb, 2002 Instructor: Traàn Ñöùc Quang Major themes: 1. Context-Free Grammars 2. Languages from Grammars Reading: Sections 11.2 and 11.3. 9.1 CONTEXT-FREE GRAMMARS In the last two handouts, we met the two equivalent ways to decribe patterns. In this handout, we shall see another even more powerful way, called context-free grammars (or "grammars"), in the sense they can describe more languages than the two others. Suppose we want to define arithmetic expressions that involve 1. The four binary operators, +, −, ∗, and /, 2. Parentheses for grouping, and 3. Operands that are numbers. The usual definition is of the form: BASIS. A number is an expression. INDUCTION. If E is an expression, then each of the following is also an expression. 1. ( E ). That is, we may place parentheses around an expression to get a new expression. 2. E + E. That is, two expressions connected by a plus sign is an expression. 3. E − E. This and the next two rules are analogous to (2) with the other operators. 4. E ∗ E. 5. E / E. 50 INTRODUCTION TO COMPUTER SCIENCE: HANDOUT #9. GRAMMARS To be more succinct and concise, we can use a grammar to define our expressions: (1) <Expression> → number (2) <Expression> → (<Expression>) (3) <Expression> → <Expression> + <Expression> (4) <Expression> → <Expression> −− <Expression> (5) <Expression> → <Expression> ∗ <Expression> (6) <Expression> → <Expression> / <Expression> The symbol <Expression> is called a syntactic category or a variable which stands for any arithmetic expression. The symbol → means "can be composed of". For exam- ple, rule (2) states that an expression can be composed of a left parenthesis followed by any string that is an expression followed by a right parenthesis. There are three kinds of symbols that appear in grammars. 1. The first are "metasymbols," symbols that play special roles and do not stand for themselves. The only example we have seen so far is the symbol →, which is used to seperate the syntactic category being defined from a way in which strings of that syntactic category may be composed. 2. The second kind of symbol is a syntactic category, which as we mentioned repre- sents a set of strings being defined. 3. The third kind of symbol is called a terminal, which can be characters such as +, or (, or they can be any abstract symbol that is known or does not need to define in the grammar. The symbol number in our grammar is of this kind of symbol. A context-free grammar consists of one or more productions. Each line in our grammar is a production. In general, a production has three parts: 1. A head, which is the syntactic category on the left side of the arrow, 2. The metasymbol →, and 3. A body, consisting of zero or more syntactic categories and/or terminals on the right side of the arrow. Our grammar for simple expressions has six productions numbered 1 to 6. We can augment the grammar for expressions by providing productions for number, a symbol has been viewed as a terminal, and productions for a new syntactic category <Digit>. Three more productions can be added to our working grammar. (7) <Digit> → 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 (8) <Number> → <Digit> (9) <Number> → <Number> <Digit> 9.1 CONTEXT-FREE GRAMMARS 51 In fact, the production for <Digit> is composed of ten productions, each for one of ten decimal digits. <Digit> → 0 <Digit> → 1 . . . <Digit> → 9 A more complex grammar for expressions can be: (1) <Expression> → <Number> (2) <Expression> → ( <Expression> ) (3) <Expression> → <Expression> + <Expression> (4) <Expression> → <Expression> −− <Expression> (5) <Expression> → <Expression> * <Expression> (6) <Expression> → <Expression> / <Expression> (7) <Digit> → 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 (8) <Number> → <Digit> (9) <Number> → <Number> <Digit> We can also describe the structure of control flow in language like C grammatically. For a simple example, it helps to imagine that there are abstract terminals condition and simpleStat. The former stands for a conditional expression. We could replace this terminal by a syntactic category, say <Condition>. The productions for <Condition> would resemble those of our expression grammar above, but with logical operators like &&, comparison operators like <, and the arithmetic operators. The terminal simpleStat stands for a statement that does not involve nested control structure, such as an assignment, function call, break, continue, return. Again, we could replace this terminal by a syntactic category and the productions to expand it. In the grammar for statements below, we use keywords like if, else, or while, punctuators like { or ;, as terminals. <Statement> → while ( condition ) <Statement> <Statement> → if ( condition ) <Statement> <Statement> → if ( conditon ) <Statement> else <Statement> <Statement> → { <StatList> } <Statement> → simpleStat ; <StatList> → ε <StatList> → <StatList> <Statement> 52 INTRODUCTION TO COMPUTER SCIENCE: HANDOUT #9. GRAMMARS 9.2 LANGUAGES FROM GRAMMARS A grammar is essentially an inductive definition involving sets of strings. Thus, from a grammar for a syntactic category, we can produce the set of strings that are of this syntactic category by walking around the grammar and applying the productions to get more and more strings. If a grammar consists of more than one syntactic category, by convention, the syn- tactic category that we want to get its strings is written first. In some compiler text- books, this syntactic category is called the start symbol. For example, in the first our grammar, the start symbol is <Expression>; whereas in the second, the start symbol is <Statement>. 9.3 GLOSSARY Grammar: Văn phạm. Context-free grammar: Văn phạm phi ngữ cảnh. Syntax: Cú pháp. Syntactic Category: Phạm trù cú pháp. Plus sign: Dấu cộng. Minus sign: Dấu trừ. Metasymbol: Meta ký hiệu. Terminal: Ký hiệu tận, tận. Nonterminal: Ký hiệu chưa tận, chưa tận. Production: Luật sinh. Head: Đầu (luật sinh). Body: Thân (luật sinh). Decimal Digit: Ký số thập phân. Control Structure: Cấu trúc điều khiển. Start Symbol: Ký hiệu khởi đầu, khởi tự. . INTRODUCTION TO COMPUTER SCIENCE HANDOUT #9. GRAMMARS K5 & K6, Computer Science Department, Vaên Lang University Second semester Feb, 2002 Instructor: Traàn Ñöùc Quang Major. other operators. 4. E ∗ E. 5. E / E. 50 INTRODUCTION TO COMPUTER SCIENCE: HANDOUT #9. GRAMMARS To be more succinct and concise, we can use a grammar to define our expressions: (1) <Expression>. ; <StatList> → ε <StatList> → <StatList> <Statement> 52 INTRODUCTION TO COMPUTER SCIENCE: HANDOUT #9. GRAMMARS 9. 2 LANGUAGES FROM GRAMMARS A grammar is essentially an inductive definition