Basics of Compiler Design Extended edition Torben Ægidius Mogensen DEPARTMENT OF COMPUTER SCIENCE UNIVERSITY OF COPENHAGEN Published through c lulu.com. Torben Ægidius Mogensen 2000  2008 torbenm@diku.dk Department of Computer Science University of Copenhagen Universitetsparken 1 DK-2100 Copenhagen DENMARK Book homepage: http://www.diku.dk/∼torbenm/Basics First published 2000 This edition: July 25, 2008 Contents 1 Introduction 1 1.1 What is a compiler? . . . . . . . . . . . . . . . . . . . . . 1 1.2 The phases of a compiler . . . . . . . . . . . . . . . . . . . 2 1.3 Interpreters . . . . . . . . . . . . . . . . . . . . . . . . . . 3 1.4 Why learn about compilers? . . . . . . . . . . . . . . . . . 4 1.5 The structure of this book . . . . . . . . . . . . . . . . . . 5 1.6 To the lecturer 6 1.7 Acknowledgements . . . . . . . . . . . . . . . . . . . . . . 7 1.8 Permission to use . . . . . . . . . . . . . . . . . . . . . . . 8 . . . . . . . . . . . . . . . . . . . . . . . . 2 Lexical Analysis 9 2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . 9 2.2 Regular expressions . . . . . . . . . . . . . . . . . . . . . . 10 2.2.1 Shorthands . . . . . . . . . . . . . . . . . . . . . . 13 2.2.2 Examples . . . . . . . . . . . . . . . . . . . . . . . 14 2.3 Nondeterministic nite automata . . . . . . . . . . . . . . 16 2.4 Converting a regular expression to an NFA . . . . . . . . . 19 2.5 Deterministic nite automata . . . . . . . . . . . . . . . . 21 2.6 Converting an NFA to a DFA . . . . . . . . . . . . . . . . 23 2.6.1 Solving set equations . . . . . . . . . . . . . . . . . 24 2.6.2 The subset construction . . . . . . . . . . . . . . . 27 . . . . . . . . . . . . . . . . . . . . . . . 30 2.4.1 Optimisations . . . . . . . . . . . . . . . . . . . . . 2.7 Size versus speed 2.8 Minimisation of DFAs 2.9 19 . . . . . . . . . . . . . . . . . . . . 31 2.8.1 Example . . . . . . . . . . . . . . . . . . . . . . . . 33 2.8.2 Dead states . . . . . . . . . . . . . . . . . . . . . . 35 Lexers and lexer generators 2.9.1 Lexer generators . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.10 Properties of regular languages 2.10.1 Relative expressive power i 37 42 . . . . . . . . . . . . . . . 44 . . . . . . . . . . . . . . 44 CONTENTS ii 2.10.2 Limits to expressive power . . . . . . . . . . . . . . 2.10.3 Closure properties 46 . . . . . . . . . . . . . . . . . . 47 2.11 Further reading . . . . . . . . . . . . . . . . . . . . . . . . 48 Exercises 48 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 Syntax Analysis 55 3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . 55 3.2 Context-free grammars . . . . . . . . . . . . . . . . . . . . 56 . . . . . . . . 58 3.3 Derivation . . . . . . . . . . . . . . . . . . . . . . . . . . . 60 3.3.1 . . . . . . . . . . . . . 61 . . . . . . . . . . . . . . . . . . . . . 65 Rewriting ambiguous expression grammars . . . . . 67 3.2.1 3.4 How to write context free grammars Syntax trees and ambiguity Operator precedence 3.4.1 3.5 Other sources of ambiguity . . . . . . . . . . . . . . . . . 70 3.6 Syntax analysis . . . . . . . . . . . . . . . . . . . . . . . . 71 3.7 Predictive parsing . . . . . . . . . . . . . . . . . . . . . . . 72 73 3.8 Nullable 3.9 Predictive parsing revisited 3.10 FOLLOW and FIRST . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76 . . . . . . . . . . . . . . . . . . . . . . . . . . . 77 3.11 LL(1) parsing . . . . . . . . . . . . . . . . . . . . . . . . . 80 3.11.1 Recursive descent . . . . . . . . . . . . . . . . . . . 81 3.11.2 Table-driven LL(1) parsing . . . . . . . . . . . . . 82 3.11.3 Conicts . . . . . . . . . . . . . . . . . . . . . . . . 83 3.12 Rewriting a grammar for LL(1) parsing 3.12.1 Eliminating left-recursion . . . . . . . . . . 85 . . . . . . . . . . . . . . 85 3.12.2 Left-factorisation . . . . . . . . . . . . . . . . . . . 87 3.12.3 Construction of LL(1) parsers summarized . . . . . 88 3.13 SLR parsing . . . . . . . . . . . . . . . . . . . . . . . . . . 89 3.14 Constructing SLR parse tables 91 . . . . . . . . . . . . . . . 3.14.1 Conicts in SLR parse-tables . . . . . . . . . . . . 3.15 Using precedence rules in LR parse tables 3.16 Using LR-parser generators . . . . . . . . . 97 97 . . . . . . . . . . . . . . . . . 100 3.16.1 Declarations and actions . . . . . . . . . . . . . . . 100 3.16.2 Abstract syntax . . . . . . . . . . . . . . . . . . . . 101 3.16.3 Conict handling in parser generators . . . . . . . 104 3.17 Properties of context-free languages . . . . . . . . . . . . . 106 3.18 Further reading . . . . . . . . . . . . . . . . . . . . . . . . 107 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 108 CONTENTS iii 4 Symbol Tables 115 4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . 115 4.2 Symbol tables . . . . . . . . . . . . . . . . . . . . . . . . . 116 4.3 4.2.1 Implementation of symbol tables 4.2.2 Simple persistent symbol tables . . . . . . . . . . . 117 . . . . . . . . . . 116 4.2.3 A simple imperative symbol table . . . . . . . . . . 118 4.2.4 Eciency issues . . . . . . . . . . . . . . . . . . . . 119 4.2.5 Shared or separate name spaces . . . . . . . . . . . 119 Further reading . . . . . . . . . . . . . . . . . . . . . . . . 120 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 120 5 Type Checking 121 5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . 121 5.2 Attributes . . . . . . . . . . . . . . . . . . . . . . . . . . . 121 5.3 A small example language . . . . . . . . . . . . . . . . . . 122 5.4 Environments for type checking . . . . . . . . . . . . . . . 124 5.5 Type checking expressions . . . . . . . . . . . . . . . . . . 124 5.6 Type checking of function declarations . . . . . . . . . . . 127 5.7 Type checking a program 5.8 Advanced type checking 5.9 Further reading . . . . . . . . . . . . . . . . . . . . . . . . 132 Exercises . . . . . . . . . . . . . . . . . . 127 . . . . . . . . . . . . . . . . . . . 130 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 132 6 Intermediate-Code Generation 135 6.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . 135 6.2 Choosing an intermediate language . . . . . . . . . . . . . 136 6.3 The intermediate language . . . . . . . . . . . . . . . . . . 138 6.4 Generating code from expressions . . . . . . . . . . . . . . 140 6.5 Translating statements . . . . . . . . . . . . . . . . . . . . 144 6.6 Logical operators . . . . . . . . . . . . . . . . . . . . . . . 147 6.7 Advanced control statements 6.8 Translating structured data 6.4.1 6.6.1 6.9 Examples of translation . . . . . . . . . . . . . . . 142 Sequential logical operators . . . . . . . . . . . . . 149 . . . . . . . . . . . . . . . . 152 . . . . . . . . . . . . . . . . . 153 6.8.1 Floating-point values . . . . . . . . . . . . . . . . . 153 6.8.2 Arrays . . . . . . . . . . . . . . . . . . . . . . . . . 153 6.8.3 Strings . . . . . . . . . . . . . . . . . . . . . . . . . 159 6.8.4 Records/structs and unions Translating declarations 6.9.1 . . . . . . . . . . . . . 160 . . . . . . . . . . . . . . . . . . . 160 Example: Simple local declarations . . . . . . . . . 161 6.10 Further reading . . . . . . . . . . . . . . . . . . . . . . . . 161 CONTENTS iv Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 162 7 Machine-Code Generation 167 7.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . 167 7.2 Conditional jumps 7.3 Constants 7.4 Exploiting complex instructions . . . . . . . . . . . . . . . 170 7.5 Optimisations . . . . . . . . . . . . . . . . . . . . . . . . . 174 7.6 Further reading . . . . . . . . . . . . . . . . . . . . . . . . 176 7.4.1 Exercises . . . . . . . . . . . . . . . . . . . . . . 168 . . . . . . . . . . . . . . . . . . . . . . . . . . . 169 Two-address instructions . . . . . . . . . . . . . . . 172 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 176 8 Register Allocation 179 8.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . 179 8.2 Liveness . . . . . . . . . . . . . . . . . . . . . . . . . . . . 180 8.3 Liveness analysis 8.4 Interference 8.5 Register allocation by graph colouring 8.6 Spilling 8.7 Heuristics 8.8 Further reading . . . . . . . . . . . . . . . . . . . . . . . . 193 8.7.1 Exercises . . . . . . . . . . . . . . . . . . . . . . . 181 . . . . . . . . . . . . . . . . . . . . . . . . . . 185 . . . . . . . . . . . 186 . . . . . . . . . . . . . . . . . . . . . . . . . . . . 188 . . . . . . . . . . . . . . . . . . . . . . . . . . . 190 Removing redundant moves . . . . . . . . . . . . . 192 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 193 9 Function calls 197 9.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . 197 9.2 Activation records 9.3 Prologues, epilogues and call-sequences . . . . . . . . . . . 199 9.4 Caller-saves versus callee-saves 9.5 Using registers to pass parameters 9.6 Interaction with the register allocator . . . . . . . . . . . . 208 9.7 Accessing non-local variables 9.1.1 9.8 The call stack . . . . . . . . . . . . . . . . . . . . . 197 . . . . . . . . . . . . . . . . . . . . . . 198 . . . . . . . . . . . . . . . 202 . . . . . . . . . . . . . 204 . . . . . . . . . . . . . . . . 210 9.7.1 Global variables . . . . . . . . . . . . . . . . . . . . 210 9.7.2 Call-by-reference parameters 9.7.3 Nested scopes . . . . . . . . . . . . . . . . . . . . . 212 . . . . . . . . . . . . 211 Variants . . . . . . . . . . . . . . . . . . . . . . . . . . . . 216 9.8.1 Variable-sized frames . . . . . . . . . . . . . . . . . 216 9.8.2 Variable number of parameters 9.8.3 Direction of stack-growth and position of FP 9.8.4 Register stacks . . . . . . . . . . . 217 . . . 217 . . . . . . . . . . . . . . . . . . . . 218 CONTENTS 9.9 v Further reading . . . . . . . . . . . . . . . . . . . . . . . . 218 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 219 10 Analysis and optimisation 10.1 Data-ow analysis 221 . . . . . . . . . . . . . . . . . . . . . . 222 10.2 Common subexpression elimination . . . . . . . . . . . . . 223 10.2.1 Available assignments . . . . . . . . . . . . . . . . 223 10.2.2 Example of available-assignments analysis . . . . . 226 10.2.3 Using available assignments for common subexpression elimination . . . . . . . . . . . . . . . . . 228 10.3 Jump-to-jump elimination . . . . . . . . . . . . . . . . . . 228 10.4 Index-check elimination . . . . . . . . . . . . . . . . . . . 231 10.5 Limitations of data-ow analyses . . . . . . . . . . . . . . 235 10.6 Loop optimisations . . . . . . . . . . . . . . . . . . . . . . 235 10.6.1 code hoisting . . . . . . . . . . . . . . . . . . . . . 235 10.6.2 Memory prefetching . . . . . . . . . . . . . . . . . 237 10.7 Optimisations for function calls . . . . . . . . . . . . . . . 239 10.7.1 Inlining . . . . . . . . . . . . . . . . . . . . . . . . 239 10.7.2 Tail call optimisation . . . . . . . . . . . . . . . . . 240 10.8 Specialisation . . . . . . . . . . . . . . . . . . . . . . . . . 243 10.9 Further reading . . . . . . . . . . . . . . . . . . . . . . . . 245 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 245 11 Bootstrapping a compiler 247 11.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . 247 11.2 Notation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 248 11.3 Compiling compilers . . . . . . . . . . . . . . . . . . . . . 250 11.3.1 Full bootstrap . . . . . . . . . . . . . . . . . . . . . 252 11.4 Further reading . . . . . . . . . . . . . . . . . . . . . . . . 255 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 255 vi CONTENTS List of Figures 2.1 Regular expressions . . . . . . . . . . . . . . . . . . . . . . 11 2.2 Some algebraic properties of regular expressions . . . . . . 14 2.3 Example of an NFA 18 2.4 Constructing NFA fragments from regular expressions 2.5 NFA for the regular expression (a|b) 2.6 Optimised NFA construction for regular expression short- 2.7 Optimised NFA for . . . . . . . . . . . . . . . . . . 22 2.8 DFA constructed from the NFA in gure 2.5 . . . . . . . . 30 2.9 Non-minimal DFA . . . . . . . . . . . . . . . . . . . . . ∗ ac hands . . 20 . . . . . . . . . . 21 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . [0-9]+ 22 . . . . . . . . . . . . . . . . . . . . . . 33 2.10 Minimal DFA . . . . . . . . . . . . . . . . . . . . . . . . . 35 2.11 Combined NFA for several tokens . . . . . . . . . . . . . . 39 2.12 Combined DFA for several tokens . . . . . . . . . . . . . . 40 3.1 From regular expressions to context free grammars 58 3.2 Simple expression grammar . . . . . . . . . . . . . . . . . 59 3.3 Simple statement grammar . . . . . . . . . . . . . . . . . 60 3.4 Example grammar . . . . . . . . . . . . . . . . . . . . . . 61 3.5 Derivation of the string 3.6 3.7 3.8 aabbbcc using grammar 3.4 . . . Leftmost derivation of the string aabbbcc using grammar 3.4 Syntax tree for the string aabbbcc using grammar 3.4 . . Alternative syntax tree for the string aabbbcc using grammar 3.4 3.9 . . . . 62 62 63 . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63 Unambiguous version of grammar 3.4 . . . . . . . . . . . . 64 3.10 Preferred syntax tree for 2+3*4 using grammar 3.2 . . . . 66 . . . . . . . . . . . . . 69 using grammar 3.11 . . . . . . . . . 70 3.11 Unambiguous expression grammar 3.12 Syntax tree for 2+3*4 3.13 Unambiguous grammar for statements 3.14 Fixed-point iteration for calculation of 3.15 Fixed-point iteration for calculation of vii . . . . . . . . . . . Nullable FIRST . 71 . . . . . . 75 . . . . . . 76 LIST OF FIGURES viii 3.16 Recursive descent parser for grammar 3.9 . . . . . . . . . 82 3.17 LL(1) table for grammar 3.9 . . . . . . . . . . . . . . . . . 83 3.18 Program for table-driven LL(1) parsing . . . . . . . . . . . 83 3.19 Input and stack during table-driven LL(1) parsing 84 3.20 Removing left-recursion from grammar 3.11 . . . . . . . . . . . . 87 3.21 Left-factorised grammar for conditionals . . . . . . . . . . 88 3.22 SLR table for grammar 3.9 . . . . . . . . . . . . . . . . . 92 3.23 Algorithm for SLR parsing . . . . . . . . . . . . . . . . . . 92 3.24 Example SLR parsing 93 . . . . . . . . . . . . . . . . . . . . 3.25 Example grammar for SLR-table construction . . . . . . . 93 3.26 NFAs for the productions in grammar 3.25 . . . . . . . . . 94 3.27 Epsilon-transitions added to gure 3.26 . . . . . . . . . . . 94 3.28 SLR DFA for grammar 3.9 . . . . . . . . . . . . . . . . . . 95 3.29 Summary of SLR parse-table construction . . . . . . . . . 96 3.30 Textual representation of NFA states . . . . . . . . . . . . 105 5.1 Example language for type checking 5.2 Type checking of expressions . . . . . . . . . . . . 123 5.3 Type checking a function declaration . . . . . . . . . . . . 128 5.4 Type checking a program 6.1 The intermediate language . . . . . . . . . . . . . . . . . . 138 6.2 A simple expression language 6.3 Translating an expression 6.4 Statement language . . . . . . . . . . . . . . . . . . . . . . 145 6.5 Translation of statements 6.6 Translation of simple conditions . . . . . . . . . . . . . . . 147 6.7 Example language with logical operators . . . . . . . . . . 149 6.8 Translation of sequential logical operators 6.9 Translation for one-dimensional arrays . . . . . . . . . . . . . . . . 125 . . . . . . . . . . . . . . . . . . 129 . . . . . . . . . . . . . . . . 140 . . . . . . . . . . . . . . . . . . 143 . . . . . . . . . . . . . . . . . . 146 . . . . . . . . . 150 . . . . . . . . . . . 154 6.10 A two-dimensional array . . . . . . . . . . . . . . . . . . . 156 6.11 Translation of multi-dimensional arrays . . . . . . . . . . . 158 6.12 Translation of simple declarations . . . . . . . . . . . . . . 162 7.1 A subset of the MIPS instruction set . . . . . . . . . . . . 173 8.1 Gen and kill sets 8.2 Example program for liveness analysis 8.3 succ, gen 8.4 Fixed-point iteration for liveness analysis . . . . . . . . . . 184 8.5 Interference graph for the program in gure 8.2 . . . . . . 186 8.6 Algorithm 8.3 applied to the graph in gure 8.5 . . . . . . 190 and kill . . . . . . . . . . . . . . . . . . . . . . . 182 . . . . . . . . . . . 183 for the program in gure 8.2 . . . . . . 183 LIST OF FIGURES ix a 8.7 Program from gure 8.2 after spilling variable 8.8 Interference graph for the program in gure 8.7 . . . . . . 191 . . . . . . 191 8.9 Colouring of the graph in gure 8.8 . . . . . . . . . . . . . 192 9.1 Simple activation record layout 9.2 Prologue and epilogue for the frame layout shown in g- 9.3 Call sequence for . . . . . . . . . . . . . . . 199 ure 9.1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 200 x := CALL f (a1 , . . . , an ) using the frame layout shown in gure 9.1 . . . . . . . . . . . . . . . . . . 201 9.4 Activation record layout for callee-saves 9.5 Prologue and epilogue for callee-saves . . . . . . . . . . . . 203 9.6 Call sequence for 9.7 Possible division of registers for 16-register architecture . . 205 9.8 Activation record layout for the register division shown in 9.9 Prologue and epilogue for the register division shown in gure 9.7 gure 9.7 . . . . . . . . . . 202 x := CALL f (a1 , . . . , an ) for callee-saves . 203 . . . . . . . . . . . . . . . . . . . . . . . . . . . 205 . . . . . . . . . . . . . . . . . . . . . . . . . . . 206 9.10 Call sequence for x := CALL f (a1 , . . . , an ) division shown in gure 9.7 for the register . . . . . . . . . . . . . . . . . 207 9.11 Example of nested scopes in Pascal . . . . . . . . . . . . . 213 9.12 Adding an explicit frame-pointer to the program from gure 9.11 . . . . . . . . . . . . . . . . . . . . . . . . . . . . 214 9.13 Activation record with static link . . . . . . . . . . . . . . 215 9.14 Activation records for f and g from gure 9.11 . . . . . . 215 10.1 Gen and kill sets for available assignments . . . . . . . . . 225 10.2 Example program for available-assignments analysis 10.3 pred, gen and kill for the program in gure 10.2 . . . 227 . . . . . 227 10.4 Fixed-point iteration for available-assignment analysis . . 229 10.5 The program in gure 10.2 after common subexpression elimination. . . . . . . . . . . . . . . . . . . . . . . . . . . 230 10.6 Equations for index-check elimination . . . . . . . . . . . 233 10.7 Intermediate code for for-loop with index check . . . . . . 234 x LIST OF FIGURES Chapter 1 Introduction 1.1 What is a compiler? In order to reduce the complexity of designing and building computers, nearly all of these are made to execute relatively simple commands (but do so very quickly). A program for a computer must be built by combining these very simple commands into a program in what is called machine language. Since this is a tedious and error-prone process most programming is, instead, done using a high-level programming language. This language can be very dierent from the machine language that the computer can execute, so some means of bridging the gap is required. This is where the compiler comes in. A compiler translates (or compiles) a program written in a high- level programming language that is suitable for human programmers into the low-level machine language that is required by computers. During this process, the compiler will also attempt to spot and report obvious programmer mistakes. Using a high-level language for programming has a large impact on how fast programs can be developed. The main reasons for this are: • Compared to machine language, the notation used by programming languages is closer to the way humans think about problems. • The compiler can spot some obvious programming mistakes. • Programs written in a high-level language tend to be shorter than equivalent programs written in machine language. Another advantage of using a high-level level language is that the same 1 CHAPTER 1. INTRODUCTION 2 program can be compiled to many dierent machine languages and, hence, be brought to run on many dierent machines. On the other hand, programs that are written in a high-level language and automatically translated to machine language may run somewhat slower than programs that are hand-coded in machine language. Hence, some time-critical programs are still written partly in machine language. A good compiler will, however, be able to get very close to the speed of hand-written machine code when translating well-structured programs. 1.2 The phases of a compiler Since writing a compiler is a nontrivial task, it is a good idea to structure the work. A typical way of doing this is to split the compilation into several phases with well-dened interfaces. Conceptually, these phases operate in sequence (though in practice, they are often interleaved), each phase (except the rst) taking the output from the previous phase as its input. It is common to let each phase be handled by a separate module. Some of these modules are written by hand, while others may be generated from specications. Often, some of the modules can be shared between several compilers. A common division into phases is described below. In some com- pilers, the ordering of phases may dier slightly, some phases may be combined or split into several phases or some extra phases may be inserted between those mentioned below. Lexical analysis This is the initial part of reading and analysing the program text: The text is read and divided into tokens, each of which corresponds to a symbol in the programming language, e.g., a variable name, keyword or number. Syntax analysis This phase takes the list of tokens produced by the lexical analysis and arranges these in a tree-structure (called the syntax tree) that reects the structure of the program. This phase is often called Type checking parsing. This phase analyses the syntax tree to determine if the program violates certain consistency requirements, e.g., if a variable is used but not declared or if it is used in a context that doesn't make sense given the type of the variable, such as trying to use a boolean value as a function pointer. 1.3. INTERPRETERS 3 Intermediate code generation The program is translated to a sim- ple machine-independent intermediate language. Register allocation The symbolic variable names used in the interme- diate code are translated to numbers, each of which corresponds to a register in the target machine code. Machine code generation The intermediate language is translated to assembly language (a textual representation of machine code) for a specic machine architecture. Assembly and linking The assembly-language code is translated into binary representation and addresses of variables, functions, etc., are determined. The rst three phases are collectively called the frontend of the compiler the backend. The middle and the last three phases are collectively called part of the compiler is in this context only the intermediate code generation, but this often includes various optimisations and transformations on the intermediate code. Each phase, through checking and transformation, establishes stronger invariants on the things it passes on to the next, so that writing each subsequent phase is easier than if these have to take all the preceding into account. For example, the type checker can assume absence of syntax errors and the code generation can assume absence of type errors. Assembly and linking are typically done by programs supplied by the machine or operating system vendor, and are hence not part of the compiler itself, so we will not further discuss these phases in this book. 1.3 Interpreters An interpreter is another way of implementing a programming language. Interpretation shares many aspects with compiling. Lexing, parsing and type-checking are in an interpreter done just as in a compiler. But instead of generating code from the syntax tree, the syntax tree is processed directly to evaluate expressions and execute statements, and so on. An interpreter may need to process the same piece of the syntax tree (for example, the body of a loop) many times and, hence, interpretation is typically slower than executing a compiled program. But writing an interpreter is often simpler than writing a compiler and the interpreter is easier to move to a dierent machine (see chapter 11), so for applications where speed is not of essence, interpreters are often used. CHAPTER 1. INTRODUCTION 4 Compilation and interpretation may be combined to implement a programming language: The compiler may produce intermediate-level code which is then interpreted rather than compiled to machine code. In some systems, there may even be parts of a program that are compiled to machine code, some parts that are compiled to intermediate code, which is interpreted at runtime while other parts may be kept as a syntax tree and interpreted directly. Each choice is a compromise between speed and space: Compiled code tends to be bigger than intermediate code, which tend to be bigger than syntax, but each step of translation improves running speed. Using an interpreter is also useful during program development, where it is more important to be able to test a program modication quickly rather than run the program eciently. And since interpreters do less work on the program before execution starts, they are able to start running the program more quickly. Furthermore, since an interpreter works on a representation that is closer to the source code than is compiled code, error messages can be more precise and informative. We will not discuss interpreters in any detail in this book, except in relation to bootstrapping in chapter 11. A good introduction to interpreters can be found in [2]. 1.4 Why learn about compilers? Few people will ever be required to write a compiler for a general-purpose language like C, Pascal or SML. So why do most computer science institutions oer compiler courses and often make these mandatory? Some typical reasons are: a) It is considered a topic that you should know in order to be wellcultured in computer science. b) A good craftsman should know his tools, and compilers are important tools for programmers and computer scientists. c) The techniques used for constructing a compiler are useful for other purposes as well. d) There is a good chance that a programmer or computer scientist will need to write a compiler or interpreter for a domain-specic language. 1.5. THE STRUCTURE OF THIS BOOK 5 The rst of these reasons is somewhat dubious, though something can be said for knowing your roots, even in such a hastily changing eld as computer science. Reason b is more convincing: Understanding how a compiler is built will allow programmers to get an intuition about what their highlevel programs will look like when compiled and use this intuition to tune programs for better eciency. Furthermore, the error reports that compilers provide are often easier to understand when one knows about and understands the dierent phases of compilation, such as knowing the dierence between lexical errors, syntax errors, type errors and so on. The third reason is also quite valid. In particular, the techniques used for reading (lexing and into a form (abstract parsing) the text of a program and converting this syntax) that is easily manipulated by a computer, can be used to read and manipulate any kind of structured text such as XML documents, address lists, etc.. Reason d is becoming more and more important as domain specic languages (DSLs) are gaining in popularity. A DSL is a (typically small) language designed for a narrow class of problems. Examples are data-base query languages, text-formatting languages, scene description languages for ray-tracers and languages for setting up economic simulations. The target language for a compiler for a DSL may be traditional machine code, but it can also be another high-level language for which compilers already exist, a sequence of control signals for a machine, or formatted text and graphics in some printer-control language (e.g. PostScript). Even so, all DSL compilers will share similar front-ends for reading and analysing the program text. Hence, the methods needed to make a compiler front-end are more widely applicable than the methods needed to make a compiler backend, but the latter is more important for understanding how a program is executed on a machine. 1.5 The structure of this book The rst part of the book describes the methods and tools required to read program text and convert it into a form suitable for computer manipulation. This process is made in two stages: A lexical analysis stage that basically divides the input text into a list of words. This is followed by a syntax analysis (or parsing) stage that analyses the way these words form structures and converts the text into a data structure CHAPTER 1. INTRODUCTION 6 that reects the textual structure. Lexical analysis is covered in chapter 2 and syntactical analysis in chapter 3. The second part of the book (chapters 4  9) covers the middle part and back-end of the compiler, where the program is converted into machine language. Chapter 4 covers how denitions and uses of names (identiers) are connected through symbol tables. In chapter 5, this is used to type-check the program. In chapter 6, it is shown how expressions and statements can be compiled into an intermediate language, a language that is close to machine language but hides machine-specic details. In chapter 7, it is discussed how the intermediate language can be converted into real machine code. Doing this well requires that the registers in the processor are used to store the values of variables, which is achieved by a register allocation process, as described in chapter 8. Up to this point, a program has been what corresponds to the body of a single procedure. Procedure calls and nested procedure declarations add some issues, which are discussed in chapter 9. Chapter 10 deals with analysis and optimisation. Finally, chapter 11 will discuss the process of piler, i.e., bootstrapping a com- using a compiler to compile itself. Chapter 10 (on analysis and optimisation) was not found in editions before April 2008, which is why the latest editions are called extended 1.6 To the lecturer This book was written for use in the introductory compiler course at DIKU, the department of computer science at the University of Copenhagen, Denmark. At DIKU, the compiler course was previously taught right after the 1 introductory programming course , which is earlier than in most other universities. Hence, existing textbooks tended either to be too advanced for the level of the course or be too simplistic in their approach, for example only describing a single very simple compiler without bothering too much with the general picture. This book was written as a response to this and aims at bridging the gap: It is intended to convey the general picture without going into extreme detail about such things as ecient implementation or the newest techniques. It should give the students an understanding of how compilers work and the ability to make simple (but not simplistic) compilers 1 It is now in the second year. 1.7. ACKNOWLEDGEMENTS 7 for simple languages. It will also lay a foundation that can be used for studying more advanced compilation techniques, as found e.g. in [30]. At times, standard techniques from compiler construction have been simplied for presentation in this book. In such cases references are made to books or articles where the full version of the techniques can be found. The book aims at being language neutral. This means two things: • Little detail is given about how the methods in the book can be implemented in any specic language. Rather, the description of the methods is given in the form of algorithm sketches and textual suggestions of how these can be implemented in various types of languages, in particular imperative and functional languages. • There is no single through-going example of a language to be compiled. Instead, dierent small (sub-)languages are used in various places to cover exactly the points that the text needs. This is done to avoid drowning in detail, hopefully allowing the readers to see the wood for the trees. Each chapter has a set of exercises. Few of these require access to a computer, but can be solved on paper or black-board. In fact, many of the exercises are based on exercises that have been used in written exams at DIKU. Teaching with this book can be supplemented with project work, where students write simple compilers. Since the book is language neutral, no specic project is given. Instead the teacher must choose relevant tools and select a project that ts the level of the students and the time available. Suitable credit for a course that uses this book is from 5 to 10 ECTS points, depending on the amount of project work. 1.7 Acknowledgements The author wishes to thank all people who have been helpful in making this book a reality. This includes the students who have been exposed to draft versions of the book at the compiler courses Dat 1E and Oversættere at DIKU, and who have found numerous typos and other errors in the earlier versions. I would also like to thank the instructors at Dat 1E and Oversættere, who have pointed out places where things were not as clear as they could be. I am extremely grateful to the people who in 2000 read parts of or all of the rst draft and made helpful suggestions. CHAPTER 1. INTRODUCTION 8 1.8 Permission to use Permission to copy and print for personal use is granted. If you, as a lecturer, want to print the book and sell it to your students, you can do so if you only charge the printing cost. If you want to print the book and sell it at prot, please contact the author at torbenm@diku.dk and we will nd a suitable arrangement. In all cases, if you nd any misprints or other errors, please contact the author at torbenm@diku.dk. http://www.diku.dk/∼torbenm/Basics. See also the book homepage: Chapter 2 Lexical Analysis 2.1 Introduction The word lexical in the traditional sense means pertaining to words. In terms of programming languages, words are objects like variable names, numbers, keywords kens. A lexical analyser, or etc. lexer Such words are traditionally called to- for short, will as its input take a string of individual letters and divide this string into tokens. Additionally, it will lter out whatever separates the tokens (the so-called i.e., lay-out characters (spaces, newlines etc.) white-space), and comments. The main purpose of lexical analysis is to make life easier for the subsequent syntax analysis phase. In theory, the work that is done during lexical analysis can be made an integral part of syntax analysis, and in simple systems this is indeed often done. However, there are reasons for keeping the phases separate: • Eciency: A lexer may do the simple parts of the work faster than the more general parser can. Furthermore, the size of a system that is split in two may be smaller than a combined system. This may seem paradoxical but, as we shall see, there is a non-linear factor involved which may make a separated system smaller than a combined system. • Modularity: The syntactical description of the language need not be cluttered with small lexical details such as white-space and comments. • Tradition: Languages are often designed with separate lexical and syntactical phases in mind, and the standard documents of such 9 CHAPTER 2. LEXICAL ANALYSIS 10 languages typically separate lexical and syntactical elements of the languages. It is usually not terribly dicult to write a lexer by hand: You rst read past initial white-space, then you, in sequence, test to see if the next token is a keyword, a number, a variable or whatnot. However, this is not a very good way of handling the problem: You may read the same part of the input repeatedly while testing each possible token and in some cases it may not be clear where the next token ends. Furthermore, a handwritten lexer may be complex and dicult to maintain. Hence, lexers are normally constructed by lexer generators, which transform human-readable specications of tokens and white-space into ecient programs. We will see the same general strategy in the chapter about syntax analysis: Specications in a well-dened human-readable notation are transformed into ecient programs. For lexical analysis, specications are traditionally written using ular expressions: reg- An algebraic notation for describing sets of strings. The generated lexers are in a class of extremely simple programs called nite automata. This chapter will describe regular expressions and nite automata, their properties and how regular expressions can be converted to nite automata. Finally, we discuss some practical aspects of lexer generators. 2.2 Regular expressions The set of all integer constants or the set of all variable names are sets of strings, where the individual letters are taken from a particular alphabet. Such a set of strings is called a language. For integers, the alphabet consists of the digits 0-9 and for variable names the alphabet contains both letters and digits (and perhaps a few other characters, such as underscore). Given an alphabet, we will describe sets of strings by sions, regular expres- an algebraic notation that is compact and easy for humans to use and understand. The idea is that regular expressions that describe simple sets of strings can be combined to form regular expressions that describe more complex sets of strings. When talking about regular expressions, we will use the letters (r, s and t) in italics to denote unspecied regular expressions. When letters stand for themselves (i.e., in regular expressions that describe strings using these letters) we will use typewriter font, e.g., a or b. Hence, when 2.2. REGULAR EXPRESSIONS Regular 11 Language (set of strings) Informal description { a } The set the one-letter expression a  a. { }  consisting of string The set containing the empty string. s|t L(s) st {vw | v ∈ L(s), w ∈ L(t)} ∪ L(t) Strings from both languages Strings constructed by concatenating a string from the rst language with a string from the second language. Note:  | In set-formulas, isn't a part of a regular expression, but part of the set-builder notation and reads as where. s∗ { } ∪ {vw | v ∈ L(s), w ∈ ∗ L(s )} Each string in the language is a concatenation of any number of strings in the language of s. Figure 2.1: Regular expressions we say, e.g., The regular expression s we mean the regular expression that describes a single one-letter string  s, but when we say The regular expression s, happen to call we mean a regular expression of any form which we just s. We use the notation L(s) to denote the language (i.e., set of strings) described by the regular expression is the set { a }. s. For example, L(a) Figure 2.1 shows the constructions used to build regular expressions and the languages they describe: • A single letter describes the language that has the one-letter string consisting of that letter as its only element. CHAPTER 2. LEXICAL ANALYSIS 12 • The symbol  (the Greek letter epsilon) describes the language that consists solely of the empty string. Note that this is not the empty set of strings (see exercise 2.10). • s|t (pronounced  s or scribed by • st s and t. t) describes the union of the languages de- (pronounced  s t) describes the concatenation of the languages L(s) and L(t), i.e., the sets of strings obtained by taking a string from L(s) and putting this in front of a string from L(t). For example, if L(s) is { a,  b } and L(t) is { c,  d }, then L(st) is the set { ac,  ad,  bc,  bd }. • s∗ The language for (pronounced  s star) is described recursively: It consists of the empty string plus whatever can be obtained by concatenating a string from L(s) to a string from L(s ). ∗ This ∗ is equivalent to saying that L(s ) consists of strings that can be obtained by concatenating zero or more (possibly dierent) strings from L(s). If, for example, L(s) is { a,  b } then L(s ) is {,  a, ∗  b,  aa,  ab,  ba,  bb,  aaa, . . . }, the empty) that consists entirely of i.e., any as and bs. string (including Note that while we use the same notation for concrete strings and regular expressions denoting one-string languages, the context will make it clear which is meant. We will often show strings and sets of strings without using quotation marks, doing so, we will use  e.g., write {a, bb} instead of { a,  bb }. When to denote the empty string, so the example from a, b, aa, ab, ba, bb, aaa, . . . }. The letters u, v and w in italics will be used to denote unspecied single strings, i.e., members of some language. As an example, abw denotes any string starting with ab. ∗ L(s ) above is written as {, Precedence rules When we combine dierent constructor symbols, ∗ expression a|ab , it isn't a priori e.g., in the regular clear how the dierent subexpressions are grouped. We can use parentheses to make the grouping of symbols clear. Additionally, we use precedence rules, similar to the algebraic convention that 3+4∗5 means 3 added to the product of 4 and 5 and not multiplying the sum of 3 and 4 by 5. For regular expressions, we ∗ binds tighter than concatenation, which ∗ binds tighter than alternative (|). The example a|ab from above, hence, ∗ is equivalent to a|(a(b )). use the following conventions: 2.2. REGULAR EXPRESSIONS The | 13 operator is associative and commutative (as it is based on set union, which has these properties). Concatenation is associative (but obviously not commutative) and distributes over |. Figure 2.2 shows these and other algebraic properties of regular expressions, including denitions of some shorthands introduced below. 2.2.1 Shorthands While the constructions in gure 2.1 suce to describe e.g., number strings and variable names, we will often use extra shorthands for convenience. For example, if we want to describe non-negative integer constants, we can do so by saying that it is one or more digits, which is expressed by the regular expression (0|1|2|3|4|5|6|7|8|9)(0|1|2|3|4|5|6|7|8|9)∗ The large number of dierent digits makes this expression rather verbose. It gets even worse when we get to variable names, where we must enumerate all alphabetic letters (in both upper and lower case). Hence, we introduce a shorthand for sets of letters. Sequences of letters within square brackets represent the set of these letters. example, we use [ab01] as a shorthand for can use interval notation to abbreviate a|b|0|1. [0123456789] For Additionally, we to [0-9]. We can combine several intervals within one bracket and for example write [a- zA-Z] to denote all alphabetic letters in both lower and upper case. When using intervals, we must be aware of the ordering for the symbols involved. For the digits and letters used above, there is usually no confusion. However, if we write, e.g., [0-z] it is not immediately clear what is meant. When using such notation in lexer generators, standard ASCII or ISO 8859-1 character sets are usually used, with the hereby implied ordering of symbols. To avoid confusion, we will use the interval notation only for intervals of digits or alphabetic letters. Getting back to the example of integer constants above, we can now write this much shorter as [0-9][0-9] . ∗ s ∗ denotes zero or more occurrences of s , we needed to write the digits twice to describe that one or more digits are allowed. Such Since set of non-zero repetition is quite common, so we introduce another shorthand, s +, to denote one or more occurrences of s. With this notation, we can abbreviate our description of integers to [0-9] . On a similar note, it is + common that we can have zero or one occurrence of something (e.g., an optional sign to a number). s|. + and ? Hence we introduce the shorthand ∗ bind with the same precedence as . s? for CHAPTER 2. LEXICAL ANALYSIS 14 (r |s)|t = r |s|t = r |(s|t) s|t = t|s s|s = s s? = s| (rs)t = rst = r (st) s = s = s r (s|t) = rs|rt (r |s)t = rt|st (s ∗ )∗ = s ∗ s ∗s ∗ = s ∗ ss ∗ = s + = s ∗ s Figure 2.2: Some algebraic properties of regular expressions We must stress that these shorthands are just that. They don't add anything to the set of languages we can describe, they just make it possible to describe a language more compactly. In the case of s +, it + is nested n deep, recursive can even make an exponential dierence: If + to ss ∗ yields 2n − 1 occurrences of ∗ in the expanded expansion of s regular expression. 2.2.2 Examples We have already seen how we can describe non-negative integer constants using regular expressions. Here are a few examples of other typical programming language elements: Keywords. A keyword like if is described by a regular expression that e.g., the regular expression if (which is regular expressions i and f). looks exactly like that keyword, the concatenation of the two Variable names. In the programming language C, a variable name consists of letters, digits and the underscore symbol and it must begin with a letter or underscore. expression This can be described by the regular [a-zA-Z_][a-zA-Z_0-9]∗ . 2.2. REGULAR EXPRESSIONS Integers. 15 An integer constant is an optional sign followed by a non- empty sequence of digits: [+-]?[0-9]+ . In some languages, the sign is a separate symbol and not part of the constant itself. This will allow whitespace between the sign and the number, which is not possible with the above. Floats. A oating-point constant can have an optional sign. After this, the mantissa part is described as a sequence of digits followed by a decimal point and then another sequence of digits. Either one (but not both) of the digit sequences can be empty. Finally, there is an optional exponent part, which is the letter e (in upper or lower case) followed by an (optionally signed) integer constant. If there is an exponent part to the constant, the mantissa part can be written as an integer constant (i.e., without the decimal point). Some examples: 11.22e-3. 3.14 -3. .23 3e+4 This rather involved format can be described by the following regular expression: [+-]?((([0-9]+ . [0-9]∗ |. [0-9]+ )([eE][+-]?[0-9]+ )?)|[0-9]+ [eE][+-]?[0-9]+ ) This regular expression is complicated by the fact that the exponent is optional if the mantissa contains a decimal point, but not if it doesn't (as that would make the number an integer constant). We can make the description simpler if we make the regular expression for oats also include integers, and instead use other means of distinguishing integers from oats (see section 2.9 for details). If we do this, the regular expression can be simplied to [+-]?(([0-9]+ (. [0-9]∗ )?|. [0-9]+ )([eE][+-]?[0-9]+ )?) String constants. A string constant starts with a quotation mark followed by a sequence of symbols and nally another quotation mark. There are usually some restrictions on the symbols allowed between the quotation marks. For example, line-feed characters or quotes are typically not allowed, though these may be represented by special escape sequences of other characters, such as "\n\n" for a string containing two line-feeds. As a (much simplied) example, we can by the following regular expression describe string constants where the allowed symbols are alphanumeric characters and sequences consisting of the backslash symbol followed by a letter (where each such pair is intended to represent a non-alphanumeric symbol): CHAPTER 2. LEXICAL ANALYSIS 16 "([a-zA-Z0-9]|\[a-zA-Z])∗ " 2.3 Nondeterministic nite automata In our quest to transform regular expressions into ecient programs, we use a stepping stone: Nondeterministic nite automata. By their nondeterministic nature, these are not quite as close to real machines as we would like, so we will later see how these can be transformed into deterministic nite automata, which are easily and eciently executable on normal hardware. A nite automaton is, in the abstract sense, a machine that has a nite number of states and a nite number of transitions between these. A transition between states is usually labelled by a character from the input alphabet, but we will also use transitions marked with called , the so- epsilon transitions. A nite automaton can be used to decide if an input string is a member in some particular set of strings. To do this, we select one of the states of the automaton as the starting state. We start in this state and in each step, we can do one of the following: • Follow an epsilon transition to another state, or • Read a character from the input and follow a transition labelled by that character. When all characters from the input are read, we see if the current state is marked as being accepting. If so, the string we have read from the input is in the language dened by the automaton. We may have a choice of several actions at each step: We can choose between either an epsilon transition or a transition on an alphabet character, and if there are several transitions with the same symbol, we can choose between these. This makes the automaton nondeterministic, as the choice of action is not determined solely by looking at the current state and input. It may be that some choices lead to an accepting state while others do not. This does, however, not mean that the string is sometimes in the language and sometimes not: We will include a string in the language if it is possible to make a sequence of choices that makes the string lead to an accepting state. 2.3. NONDETERMINISTIC FINITE AUTOMATA 17 You can think of it as solving a maze with symbols written in the corridors. If you can nd the exit while walking over the letters of the string in the correct order, the string is recognized by the maze. We can formally dene a nondeterministic nite automaton by: Denition 2.1 A nondeterministic nite automaton consists of a set S of states. One of these states, s0 ∈ S , is called the starting state of the automaton and a subset F ⊆ S of the states are accepting states. Additionally, we have a set T of transitions. Each transition t connects a pair of states s1 and s2 and is labelled with a symbol, which is either a character c from the alphabet Σ, or the symbol , which indicates an epsilon-transition. A transition from state s to state t on the symbol c is written as sc t. Starting states are sometimes called can also be called nal states initial states and accepting states (which is why we use the letter denote the set of accepting states). F to We use the abbreviations FA for nite automaton, NFA for nondeterministic nite automaton and (later in this chapter) DFA for deterministic nite automaton. We will mostly use a graphical notation to describe nite automata. States are denoted by circles, possibly containing a number or name that identies the state. This name or number has, however, no operational signicance, it is solely used for identication purposes. Accepting states are denoted by using a double circle instead of a single circle. The initial state is marked by an arrow pointing to it from outside the automaton. A transition is denoted by an arrow connecting two states. Near its midpoint, the arrow is labelled by the symbol (possibly ) that triggers the transition. Note that the arrow that marks the initial state is not a transition and is, hence, not marked by a symbol. Repeating the maze analogue, the circles (states) are rooms and the arrows (transitions) are one-way corridors. The double circles (accepting states) are exits, while the unmarked arrow to the starting state is the entrance to the maze. Figure 2.3 shows an example of a nondeterministic nite automaton having three states. State 1 is the starting state and state 3 is accepting. There is an epsilon-transition from state 1 to state 2, transitions on the symbol b a from state 2 to states 1 and 3 and a transition on the symbol from state 1 to state 3. This NFA recognises the language described by the regular expression a∗ (a|b). As an example, the string recognised by the following sequence of transitions: aab is CHAPTER 2. LEXICAL ANALYSIS 18 from to by 1 2 2 1 1 2 2 1 1 3  a  a b At the end of the input we are in state 3, which is accepting. Hence, the string is accepted by the NFA. You can check this by placing a coin at the starting state and follow the transitions by moving the coin. Note that we sometimes have a choice of several transitions. If we are in state 2 and the next symbol is an a, we can, when reading this, either go to state 1 or to state 3. Likewise, if we are in state 1 and the next symbol is a b, we can either read this and go to state 3 or we can use the epsilon transition to go directly to state 2 without reading anything. If we in the example above had chosen to follow the a-transition to state 3 instead of state 1, we would have been stuck: We would have no legal transition and yet we would not be at the end of the input. But, as previously stated, it is enough that there to acceptance, so the string aab exists a path leading is still accepted. A program that decides if a string is accepted by a given NFA will have to check all possible paths to see if any of these accepts the string. This requires either backtracking until a successful path found or simultaneously following all possible paths, both of which are too timeconsuming to make NFAs suitable for ecient recognisers. We will, hence, use NFAs only as a stepping stone between regular expressions and the more ecient DFAs. We use this stepping stone because it makes the construction simpler than direct construction of a DFA from a regular expression.  2  a   a  ?  b- 1 3   Figure 2.3: Example of an NFA 2.4. CONVERTING A REGULAR EXPRESSION TO AN NFA 19 2.4 Converting a regular expression to an NFA We will construct an NFA i.e., compositionally from a regular expression, we will construct the NFA for a composite regular expression from the NFAs constructed from its subexpressions. To be precise, we will from each subexpression construct an fragment NFA and then combine these fragments into bigger fragments. A fragment is not a complete NFA, so we complete the construction by adding the necessary components to make a complete NFA. An NFA fragment consists of a number of states with transitions between these and additionally two incomplete transitions: One pointing into the fragment and one pointing out of the fragment. The in- coming half-transition is not labelled by a symbol, but the outgoing half-transition is labelled by either  or an alphabet symbol. These half- transitions are the entry and exit to the fragment and are used to connect it to other fragments or additional glue states. Construction of NFA fragments for regular expressions is shown in gure 2.4. The construction follows the structure of the regular expression by rst making NFA fragments for the subexpressions and then joining these to form an NFA fragment for the whole regular expression. The NFA fragments for the subexpressions are shown as dotted ovals with the incoming half-transition on the left and the outgoing halftransition on the right. When an NFA fragment has been constructed for the whole regular expression, the construction is completed by connecting the outgoing half-transition to an accepting state. The incoming half-transition serves to identify the starting state of the completed NFA. Note that even though we allow an NFA to have several accepting states, an NFA constructed using this method will have only one: the one added at the end of the construction. An NFA constructed this way for the regular expression (a|b) ∗ ac is shown in gure 2.5. We have numbered the states for future reference. 2.4.1 Optimisations We can use the construction in gure 2.4 for any regular expression by expanding out all shorthand, 0|1|2| · · · |9 and s? to s|, etc. e.g. converting s+ to ss ∗ , [0-9] to However, this will result in very large NFAs for some expressions, so we use a few optimised constructions for the shorthands. Additionally, we show an alternative construction for the regular expression . This construction doesn't quite follow the CHAPTER 2. LEXICAL ANALYSIS 20 Regular expression NFA fragment a  a      st s|t - s -     - -  s∗ s t  ^  t   s    Figure 2.4: Constructing NFA fragments from regular expressions 2.5. DETERMINISTIC FINITE AUTOMATA - 21  6 a   R - 8 5     - 7  b   -  +  1  -     ac2 3 4     Figure 2.5: NFA for the regular expression (a|b) ∗ ac formula used in gure 2.4, as it doesn't have two half-transitions. Rather, the line-segment notation is intended to indicate that the NFA fragment for  just connects the half-transitions of the NFA fragments that it is combined with. In the construction for [0-9], the vertical ellipsis is meant [0-9]. This to indicate that there is a transition for each of the digits in construction generalises in the obvious way to other sets of characters, e.g., [a-zA-Z0-9]. We have not shown a special construction s| will do ne if we use the optimised construction for . for s? as The optimised constructions are shown in gure 2.6. As an example, an NFA for mised, it is [0-9]+ is shown in gure 2.7. not optimal. You can make an Note that while this is opti- NFA for this language using only two states. 2.5 Deterministic nite automata Nondeterministic automata are, as mentioned earlier, not quite as close to the machine as we would like. Hence, we now introduce a more restricted form of nite automaton: The deterministic nite automaton, or DFA for short. DFAs are NFAs, but obey a number of additional restrictions: • There are no epsilon-transitions. • There may not be two identically labelled transitions out of the same state. CHAPTER 2. LEXICAL ANALYSIS 22 Regular expression NFA fragment  0 [0-9]   R  . . .   9 - s  s+     Figure 2.6: Optimised NFA construction for regular expression shorthands 0   R . . .     9       Figure 2.7: Optimised NFA for [0-9]+ 2.6. CONVERTING AN NFA TO A DFA 23 This means that we never have a choice of several next-states: The state and the next input symbol uniquely determines the transition (or lack of same). This is why these automata are called deterministic. The transition relation is now a (partial) function, and we often write move(s, c) is the state (if any) that is reached from state s by transition on the symbol c. If there is no such transition, move(s, c) it as such: a is undened. It is very easy to implement a DFA: A two-dimensional table can be cross-indexed by state and symbol to yield the next state (or an indication that there is no transition), essentially implementing the move function by table lookup. Another (one-dimensional) table can indicate which states are accepting. DFAs have the same expressive power as NFAs: A DFA is a special case of NFA and any NFA can (as we shall shortly see) be converted to an equivalent DFA. However, this comes at a cost: The resulting DFA can be exponentially larger than the NFA (see section 2.10). In practice (i.e., when describing tokens for a programming language) the increase in size is usually modest, which is why most lexical analysers are based on DFAs. 2.6 Converting an NFA to a DFA As promised, we will show how NFAs can be converted to DFAs such that we, by combining this with the conversion of regular expressions to NFAs shown in section 2.4, can convert any regular expression to a DFA. The conversion is done by simulating all possible paths in an NFA at once. This means that we operate with sets of NFA states: When we have several choices of a next state, we take all of the choices simultaneously and form a set of the possible next-states. The idea is that such a set of NFA states will become a single DFA state. For any given symbol we form the set of all possible next-states in the NFA, so we get a single transition (labelled by that symbol) going from one set of NFA states to another set. Hence, the transition becomes deterministic in the DFA that is formed from the sets of NFA states. Epsilon-transitions complicate the construction a bit: Whenever we are in an NFA state we can always choose to follow an epsilon-transition without reading any symbol. Hence, given a symbol, a next-state can be found by either following a transition with that symbol or by rst doing any number of epsilon-transitions and then a transition with the CHAPTER 2. LEXICAL ANALYSIS 24 symbol. We handle this in the construction by rst closing the set of NFA states under epsilon-transitions and then following transitions with input symbols. We dene the epsilon-closure of a set of states as the set extended with all states that can be reached from these using any number of epsilon-transitions. More formally: Denition 2.2 Given a set M of NFA states, we dene -closure(M ) to be the least (in terms of the subset relation) solution to the set equation -closure(M ) = M ∪ {t | s ∈ -closure(M ) and s t ∈ T } Where T is the set of transitions in the NFA. We will later on see several examples of set equations like the one above, so we use some time to discuss how such equations can be solved. 2.6.1 Solving set equations In general, a set equation over a single set-valued variable X has the form X = F (X) where F is a function from sets to sets. Not all such equations are solvable, so we will restrict ourselves to special cases, which we will describe below. We will use calculation of epsilon-closure as the driving example. In denition 2.2, replace this by X -closure(M) is the value we have to nd, so we and get the equation: X = M ∪ {t | s ∈ X and s t ∈ T } and hence F (X) = M ∪ {t | s ∈ X and s t ∈ T } This function has a property that is essential to our solution method: If X ⊆ Y F (X) that F (X) ⊆ F (Y ). We say that F is monotonic. Note -closure(X) and that F depends on M , so a new F is each M that we want to nd the epsilon-closure of. then is not required for There may be several solutions to this equation. For example, if the NFA has a pair of states that connect to each other by epsilon transitions, 2.6. CONVERTING AN NFA TO A DFA 25 adding this pair to a solution that does not already include the pair will create a new solution. The epsilon-closure of M is the least solution to the equation (i.e., the smallest set that satistifes the equation). When we have an equation of the form X = F (X) and F is mono- tonic, we can nd the least solution to the equation in the following way: We rst guess that the solution is the empty set and check to see if we are right: We compare ∅ ∅ with F (∅). If these are equal, we are done and is the solution. If not, we use the following properties: • The least solution • ∅⊆S S implies that to conclude that to the equation satises S = F (S). F (∅) ⊆ F (S). F (∅) ⊆ S . Hence, F (∅) is a new guess at S. We now form the chain ∅ ⊆ F (∅) ⊆ F (F (∅)) ⊆ . . . If at any point an element in the sequence is identical to the previous, we have a xed-point, i.e., a set S such that S = F (S). This xed-point of the sequence will be the least (in terms of set inclusion) solution to the equation. This isn't dicult to verify, but we will omit the details. Since we are iterating a function until we reach a xed-point, we call this process xed-point iteration. If we are working with sets over a nite domain (e.g., sets of NFA states), we will eventually reach a xed-point, as there can be no innite chain of strictly increasing sets. We can use this method for calculating the epsilon-closure of the set {1} with respect to the NFA shown in gure 2.5. We use a version of where M = {1}, so we start by calculating F (∅) = {1} ∪ {t | s ∈ ∅ = {1} As ∅= 6 {1}, and s t ∈ T } we continue. F ({1}) = {1} ∪ {t | s ∈ {1} and s t ∈ T } = {1} ∪ {2, 5} = {1, 2, 5} F ({1, 2, 5}) = {1} ∪ {t | s ∈ {1, 2, 5} and s t ∈ T } = {1} ∪ {2, 5, 6, 7} = {1, 2, 5, 6, 7} F ({1, 2, 5, 6, 7}) = {1} ∪ {t | s ∈ {1, 2, 5, 6, 7} and s t ∈ T } = {1} ∪ {2, 5, 6, 7} = {1, 2, 5, 6, 7} F CHAPTER 2. LEXICAL ANALYSIS 26 We have now reached a xed-point and found our solution. Hence, we conclude that -closure({1}) = {1, 2, 5, 6, 7}. We have done a good deal of repeated calculation in the iteration above: We have calculated the epsilon-transitions from state 1 three times and those from state 2 and 5 twice each. We can make an optimised xed-point iteration by exploiting that the function is not only monotonic, but also distributive: F (X ∪Y ) = F (X)∪F (Y ). This means that, when we during the iteration add elements to our set, we in the next iteration need only calculate F for the new elements and add the result to the set. In the example above, we get F (∅) = = = = F ({1}) {1} ∪ {t | s ∈ ∅ and s t ∈ T } {1} {1} ∪ {t | s ∈ {1} and s t ∈ T } {1} ∪ {2, 5} = {1, 2, 5} = F ({1}) ∪ F ({2, 5}) = {1, 2, 5} ∪ ({1} ∪ {t | s ∈ {2, 5} and s t ∈ T }) = {1, 2, 5} ∪ ({1} ∪ {6, 7}) = {1, 2, 5, 6, 7} F ({1, 2, 5}) F ({1, 2, 5, 6, 7}) = F ({1, 2, 5}) ∪ F ({6, 7}) = {1, 2, 5, 6, 7} ∪ ({1} ∪ {t | s ∈ {6, 7} and s t ∈ T }) = {1, 2, 5, 6, 7} ∪ ({1} ∪ ∅) = {1, 2, 5, 6, 7} We can use this principle to formulate a work-list algorithm for nding the least xed-points for distributive functions. The idea is that we stepby-step build a set that eventually becomes our solution. step we calculate F (∅). The elements in this initial set are In each subsequent step, we take an unmarked element mark it and add F ({x}) x In the rst unmarked. from the set, (unmarked) to the set. Note that if an element already occurs in the set (marked or not), it is not added again. When, eventually, all elements in the set are marked, we are done. This is perhaps best illustrated by an example (the same as before). We start by calculating F (∅) = {1}. The element 1 is unmarked, so we F ({1}) and add the new elements 2 and pick this, mark it and calculate 5 to the set. As we continue, we get this sequence of sets: 2.6. CONVERTING AN NFA TO A DFA 27 {1} √ { 1√ , 2,√ 5} { 1√ , 2√ , 5} √ { 1√ , 2√ , 5√ , 6,√ 7} { 1√ , 2√ , 5√ , 6√ , 7} √ {1, 2, 5, 6, 7} We will later also need to solve simultaneous equations over sets, i.e., several equations over several sets. These can also be solved by xedpoint iteration in the same way as single equations, though the work-list version of the algorithm becomes a bit more complicated. 2.6.2 The subset construction After this brief detour into the realm of set equations, we are now ready to continue with our construction of DFAs from NFAs. The construction is called the subset construction, as each state in the DFA is a subset of the states from the NFA. Algorithm 2.3 (The subset construction) Given an NFA N with states S , starting state s0 ∈ S , accepting states F ⊆ S , transitions T and alphabet Σ, we construct an equivalent DFA D with states S 0 , starting state s00 , accepting states F 0 and a transition function move by: s00 move(s0 , c) S0 F0 = = = = -closure({s0 }) -closure({t | s ∈ s0 and sc t ∈ T }) {s00 } ∪ {move(s0 , c) | s0 ∈ S 0 , c ∈ Σ} {s0 ∈ S 0 | s0 ∩ F 6= ∅} The DFA uses the same alphabet as the NFA. A little explanation: • The starting state of the DFA is the epsilon-closure of the set containing just the starting state of the NFA, i.e., the states that are reachable from the starting state by epsilon-transitions. CHAPTER 2. LEXICAL ANALYSIS 28 • A transition in the DFA is done by nding the set of NFA states that comprise the DFA state, following all transitions (on the same symbol) in the NFA from all these NFA states and nally combining the resulting sets of states and closing this under epsilon transitions. • The set S0 of states in the DFA is the set of DFA states that can be reached from s00 using the move function. S0 is dened as a set equation which can be solved as described in section 2.6.1. • A state in the DFA is an accepting state if at least one of the NFA states it contains is accepting. As an example, we will convert the NFA in gure 2.5 to a DFA. The initial state in the DFA is -closure({1}), which we have already 0 calculated to be s0 = {1, 2, 5, 6, 7}. This is now entered into the set 0 S of DFA states as unmarked (following the work-list algorithm from section 2.6.1). We now pick an unmarked element from the uncompleted S0. We 0 have only one choice: s0 . We now mark this and calculate the transitions for it. We get move(s00 , a) = = = = -closure({t | s ∈ {1, 2, 5, 6, 7} -closure({3, 8}) {3, 8, 1, 2, 5, 6, 7} s01 move(s00 , b) = = = = -closure({t | s ∈ {1, 2, 5, 6, 7} -closure({8}) {8, 1, 2, 5, 6, 7} s02 move(s00 , c) = -closure({t | s ∈ {1, 2, 5, 6, 7} = -closure({}) = {} and sa t ∈ T }) and sb t ∈ T }) and sc t ∈ T }) Note that the empy set of NFA states is not an DFA state, so there will be no transition from s00 on c. 0 0 0 We now add s1 and s2 to our incomplete S , which now is 0 We now pick s1 , mark it and calculate its transitions: √ {s00 , s01 , s02 }. 2.6. CONVERTING AN NFA TO A DFA 29 move(s01 , a) = = = = -closure({t | s ∈ {3, 8, 1, 2, 5, 6, 7} -closure({3, 8}) {3, 8, 1, 2, 5, 6, 7} s01 move(s01 , b) = = = = -closure({t | s ∈ {3, 8, 1, 2, 5, 6, 7} -closure({8}) {8, 1, 2, 5, 6, 7} s02 move(s01 , c) = = = = -closure({t | s ∈ {3, 8, 1, 2, 5, 6, 7} -closure({4}) {4} s03 and sa t ∈ T }) and sb t ∈ T }) and sc t ∈ T }) √ We have seen 0 next pick s2 : s01 and s02 before, so only s03 is added: √ {s00 , s01 , s02 , s03 }. move(s02 , a) = = = = -closure({t | s ∈ {8, 1, 2, 5, 6, 7} -closure({3, 8}) {3, 8, 1, 2, 5, 6, 7} s01 move(s02 , b) = = = = -closure({t | s ∈ {8, 1, 2, 5, 6, 7} -closure({8}) {8, 1, 2, 5, 6, 7} s02 move(s02 , c) = -closure({t | s ∈ {8, 1, 2, 5, 6, 7} = -closure({}) = {} and sa t ∈ T }) and sb t ∈ T }) and sc t ∈ T }) We No new elements are added, so we pick the remaining unmarked element s03 : CHAPTER 2. LEXICAL ANALYSIS 30 a  U a * s01 c    PP   q  P  - s0 a b s03 0      H HH b j s0 2  M b Figure 2.8: DFA constructed from the NFA in gure 2.5 move(s03 , a) = -closure({t | s ∈ {4} = -closure({}) = {} move(s03 , b) = -closure({t | s ∈ {4} = -closure({}) = {} move(s03 , c) = -closure({t | s ∈ {4} = -closure({}) = {} Which now completes the construction of and sa t ∈ T }) and sb t ∈ T }) and sc t ∈ T }) S 0 = {s00 , s01 , s02 , s03 }. Only s03 contains the accepting NFA state 4, so this is the only accepting state of our DFA. Figure 2.8 shows the completed DFA. 2.7 Size versus speed In the above example, we get a DFA with 4 states from an NFA with 8 states. However, as the states in the constructed DFA are (nonempty) sets of states from the NFA there may potentially be DFA constructed from an 2n − 1 states in a n-state NFA. It is not too dicult to construct 2.8. MINIMISATION OF DFAS 31 classes of NFAs that expand exponentially in this way when converted to DFAs, as we shall see in section 2.10.1. Since we are mainly interested in NFAs that are constructed from regular expressions as in section 2.4, we might ask ourselves if these might not be in a suitably simple class that do not risk exponential-sized DFAs. Alas, this is not the case. Just as we can construct a class of NFAs that expand exponentially, we can construct a class of regular expressions where the smallest equivalent DFAs are exponentially larger. This happens rarely when we use regular expressions or NFAs to describe tokens in programming languages, though. It is possible to avoid the blow-up in size by operating directly on regular expressions or NFAs when testing strings for inclusion in the languages these dene. However, there is a speed penalty for doing so. A DFA can be run in time string v and 1 k k ∗ |v|, where |v| is the length of the input is a small constant that is independent of the size of the DFA . Regular expressions and NFAs can be run in time close to c ∗ |N | ∗ |v|, where and the constant c |N | is the size of the NFA (or regular expression) typically is larger than k. All in all, DFAs are a lot faster to use than NFAs or regular expressions, so it is only when the size of the DFA is a real problem that one should consider using NFAs or regular expressions directly. 2.8 Minimisation of DFAs Even though the DFA in gure 2.8 has only four states, it is not minimal. It is easy to see that states s00 and s02 are equivalent: Neither are accepting and they have identical transitions. We can hence collapse these states into a single state and get a three-state DFA. DFAs constructed from regular expressions through NFAs are often non-minimal, though they are rarely very far from being minimal. Nevertheless, minimising a DFA is not terribly dicult and can be done fairly fast, so many lexer generators perform minimisation. An interesting property of DFAs is that any regular language (a language that can be expressed by a regular expression, NFA or DFA) has a unique minimal DFA. Hence, we can decide equivalence of regular expressions (or NFAs or DFAs) by converting both to minimal DFAs and compare the results. As hinted above, minimisation of DFAs is done by collapsing equivalent states. However, deciding whether two states are equivalent is not 1 If we don't consider the eects of cache-misses etc. CHAPTER 2. LEXICAL ANALYSIS 32 just done by testing if their immediate transitions are identical, since transitions to dierent states may be equivalent if the target states turn out to be equivalent. Hence, we use a strategy where we rst assume all states to be equivalent and then separate them only if we can prove them dierent. We use the following rules for this: • • An accepting state is not equivalent to a non-accepting state. s1 and s2 have transitions on the same symbol c to t1 and t2 that we have already proven to be dierent, then s1 and s2 are dierent. This also applies if only one of s1 or s2 have a dened transition on c. If two states states This leads to the following algorithm. Algorithm 2.4 (DFA minimisation) Given a DFA D over the al- phabet Σ with states S where F ⊆ S is the set of the accepting states, we construct a minimal DFA D0 where each state is a group of states from D. The groups in the minimal DFA are consistent: For any pair of states s1 , s2 in the same group G and any symbol c, move(s1 , c) is in the same group G0 as move(s2 , c) or both are undened. 1) We start with two groups: F and S \ F . These are unmarked. 2) We pick any unmarked group G and check if it is consistent. If it is, we mark it. If G is not consistent, we split it into maximal consistent subgroups and replace G by these. All groups are then unmarked. 3) If there are no unmarked groups left, we are done and the remaining groups are the states of the minimal DFA. Otherwise, we go back to step 2. The starting state of the minimal DFA is the group that contains the original starting state and any group of accepting states is an accepting state in the minimal DFA. The time needed for minimisation using algorithm 2.4 depends on the strategy used for picking groups in step 2. With random choices, the worst case is quadratic in the size of the DFA, but there exist strategies for choosing groups and data structures for representing these that guarantee a worst-case time that is O(n ∗ log(n)), where n is the number of states in the (non-minimal) DFA. In other words, the method can be 2.8. MINIMISATION OF DFAS 33 b start -     + b  a a 0 1 2 3       6 a b b ?   b - 5 4    a a a b a ?    6 7    Figure 2.9: Non-minimal DFA implemented so it uses little more than linear time to do minimisation. We will not here go into further detail but just refer to [3] for the optimal algorithm. We will, however, note that we can make a slight optimisation to algorithm 2.4: A group that consists of a single state need never be split, so we need never select such in step 2, and we can stop when all unmarked groups are singletons. 2.8.1 Example As an example of minimisation, take the DFA in gure 2.9. We now make the initial division into two groups: The accepting and the non-accepting states. G1 = {0, 6} G2 = {1, 2, 3, 4, 5, 7} These are both unmarked. We next pick any unmarked group, say G1 . To check if this is consistent, we make a table of its transitions: G1 a b 0 G2 − 6 G2 − This is consistent, so we just mark it and select the remaining unmarked group G2 and make a table for this CHAPTER 2. LEXICAL ANALYSIS 34 G2 1 2 3 4 5 7 G2 is evidently not a G2 G2 − G1 G2 G1 b G2 G2 G2 G2 G2 G2 consistent, so we split it into maximal consistent subgroups and erase all marks (including the one on G1 G3 G4 G5 We now pick G3 = = = = G1 ): {0, 6} {1, 2, 5} {3} {4, 7} for consideration: G3 a b 1 G5 G3 2 G4 G3 5 G5 G3 This isn't consistent either, so we split again and get G1 G4 G5 G6 G7 We now pick G5 = = = = = {0, 6} {3} {4, 7} {1, 5} {2} and check this: G5 a b 4 G1 G6 7 G1 G6 This is consistent, so we mark it and pick another group, say, b G6 a 1 G5 G7 5 G5 G7 G6 : 2.8. MINIMISATION OF DFAS 35 b start     + b  a q - G1 a - G6 G G4 7 i    b   I @ M @ a b a @ N G5  Figure 2.10: Minimal DFA This, also, is consistent, so we have only one unmarked non-singleton group left: G1 . G1 a b 0 G6 − 6 G6 − As we mark this, we see that there are no unmarked groups left (except the singletons). Hence, the groups form a minimal DFA equivalent to the one in gure 2.9. The minimised DFA is shown in gure 2.10. 2.8.2 Dead states Algorithm 2.4 works under some, as yet, unstated assumptions: • The move function is total, i.e., there are transitions on all symbols or from all states, • There are no dead states in the DFA. A dead state is a state from which no accepting state can be reached. Such do not occur in DFAs constructed from NFAs without dead states, and NFAs with dead states can not be constructed from regular expressions by the method shown in section 2.4. Hence, as long as we use minimisation only on DFAs constructed by this process, we are safe. CHAPTER 2. LEXICAL ANALYSIS 36 However, if we get a DFA of unknown origin, we risk that it may contain both dead states and undened transitions. A transition to a dead state should rightly be equivalent to an undened transition, as neither can yield future acceptance. The only dierence is that we discover this earlier on an undened transition than when we make a transition to a dead state. However, algorithm 2.4 will treat these dierently and may hence decree a group to be inconsistent even though it is not. This will make the algorithm split a group that doesn't need to be split, hence producing a non-minimal DFA. Consider, for example, the following DFA: a start    s b- 1 2 3      k a States 1 and 2 are, in fact, equivalent, as starting from either one, any sequence of a's (and no other sequences) will lead to an accepting state. A minimal equivalent DFA has only one accepting state with a transition to itself on a. But algorithm 2.4 will see a transition on b out of state 2 but no transition on b out of state 1, so it will not keep states 1 and 2 in the same group. As a result, no reduction in the DFA is made. There are two solutions to this problem: 1) Make sure there are no dead states. This can be ensured by invariant, as is the case for DFAs constructed by the methods shown in this chapter, or by explicitly removing dead states before minimisation. Dead states can be found by a simple reachability analysis for directed graphs. In the above example, state 3 is dead and can be removed (including the transition to it). This makes states 1 and 2 stay in the same group. 2) Make sure there are no undened transitions. This can be achieved by adding a new dead state (which has transitions to itself on all symbols) and replacing all undened transitions by transitions to this dead state. After minimisation, the group that contains the dead state will contain all dead states from the original DFA. This group can now be removed from the minimal DFA (which will once more have undened transitions). In the above example, a new (non-accepting) state 4 has to be added. State 1 has a transition to state 4 on b and state 3 has a transition to state 4 on a. State 2.9. LEXERS AND LEXER GENERATORS 37 4 has transitions to itself on both a and b. After minimisation, state 1 and 2 will be joined, as will state 3 and 4. Since state 4 is dead, all states joined with it are also dead, so we can remove the combined state 3 and 4 from the resulting minimized automata. 2.9 Lexers and lexer generators We have, in the previous sections, seen how we can convert a language description written as a regular expression into an eciently executable representation (a DFA). This is the heart of a lexer generator, but not the full story. There are several additional issues, which we address below: • A lexer has to distinguish between several dierent types of tokens, e.g., numbers, variables and keywords. Each of these are described by its own regular expression. • A lexer does not check if its entire input is included in the languages dened by the regular expressions. Instead, it has to cut the input into pieces (tokens), each of which is included in one of the languages. • If there are several ways to split the input into legal tokens, the lexer has to decide which of these it should use. We do not wish to scan the input repeatedly, once for every type of token, as this can be quite slow. Hence, we wish to generate a DFA that tests for all the token types simultaneously. This isn't too dicult: If the tokens are dened by regular expressions regular expression r1 | r2 | . . . | rn r1 , r2 , . . . , rn , then the describes the union of the languages and the DFA constructed from it will scan for all token types at the same time. However, we also wish to distinguish between dierent token types, so we must be able to know which of the many tokens was recognised by the DFA. The easiest way to do this is: 1) Construct NFAs N1 , N 2 , . . . , N n for each of r1 , r2 , . . . , rn . 2) Mark the accepting states of the NFAs by the name of the tokens they accept. CHAPTER 2. LEXICAL ANALYSIS 38 3) Combine the NFAs to a single NFA by adding a new starting state which has epsilon-transitions to each of the starting states of the NFAs. 4 Convert the combined NFA to a DFA. 5) Each accepting state of the DFA consists of a set of NFA states, some of which are accepting states which we marked by token type in step 2. These marks are used to mark the accepting states of the DFA so each of these will indicate the token types it accepts. If the same accepting state in the DFA can accept several dierent token types, it is because these overlap. This is not unusual, as keywords usually overlap with variable names and a description of oating point constants may include integer constants as well. In such cases, we can do one of two things: • Let the lexer generator generate an error and require the user to make sure the tokens are disjoint. • Let the user of the lexer generator choose which of the tokens is preferred. It can be quite dicult (though always possible) with regular expressions to dene, e.g., the set of names that are not keywords. Hence, it is common to let the lexer choose according to a prioritised list. Normally, the order in which tokens are dened in the input to the lexer generator indicates priority (earlier dened tokens take precedence over later dened tokens). Hence, keywords are usually dened before vari- able names, which means that, for example, the string  if is recognised as a keyword and not a variable name. When an accepting state in a DFA contains accepting NFA states with dierent marks, the mark corresponding to the highest priority (earliest dened) token is used. Hence, we can simply erase all but one mark from each accepting state. This is a very simple and eective solution to the problem. When we described minimisation of DFAs, we used two initial groups: One for the accepting states and one for the non-accepting states. As there are now several kinds of accepting states (one for each token), we must use one group for each token, so we will have a total of groups when we have n n + 1 initial dierent tokens. To illustrate the precedence rule, gure 2.11 shows an NFA made by combining NFAs for variable names, the keyword if, integers and 2.9. LEXERS AND LEXER GENERATORS  2   i - 3  39 IF f - l 4  [a-zA-Z 0-9]    - 1   5   [a-zA-Z ]  - 6l  ID   - [+-]- [0-9] - 9l 8 7   *Y NUM   [0-9]  [0-9]   j [+-]- [0-9] - 12l . - 13l 10 11    * FLOAT FLOAT  . [eE] [eE] ?[0-9]  [eE]  N ?    - 15l - 16 [+-]- 17 [0-9] - 18l 14 Y    *Y FLOAT FLOAT    Figure 2.11: Combined NFA for several tokens oats, as described by the regular expressions in section 2.2.2. The individual NFAs are (simplied versions of ) what you get from the method described in section 2.4. When a transition is labelled by a set of characters, it is a shorthand for a set of transitions each labelled by a single character. The accepting states are labelled with token names as described above. ure 2.12. The corresponding minimised DFA is shown in g- Note that state G is a combination of states 9 and 12 from NUM and FLOAT, but since integers take priority over oats, we have marked G with NUM only. the NFA, so it can accept both Splitting the input stream As mentioned, the lexer must cut the input into tokens. This may be done in several ways. For example, the string dierent ways: if17 can be split in many CHAPTER 2. LEXICAL ANALYSIS 40 IF l C [a-zA-Z 0-9]  6 [a-zA-Z 0-9] f U ID l[a-eg-zA-Z 0-9] - Dl B  *ID 6 [a-hj-zA-Z ] i   . - E - A   @  . @[+-] @  R @ [0-9] [0-9] F  [0-9] [0-9] ? NUM l . G   @ [eE] @ @  [0-9] R @ I  @ [+-]  J  ?  - Hl  [eE] FLOAT @[0-9] @  R @ [0-9] - Kl  [0-9] FLOAT Figure 2.12: Combined DFA for several tokens 2.9. LEXERS AND LEXER GENERATORS 41 if17. • As one token, which is the variable name • As the variable name • As the keyword if followed by the number • As the keyword if followed by the numbers • As the variable name • And several more. if1 i followed by the number 7. 17. 1 and 7. followed by the variable name f17. A common convention is that it is the longest prex of the input that matches any token which will be chosen. Hence, the rst of the above possible splittings of if17 will be chosen. Note that the principle of the longest match takes precedence over the order of denition of tokens, so even though the string starts with the keyword if, which has higher priority than variable names, the variable name is chosen because it is longer. Modern languages like C, Java or SML follow this convention, and so do most lexer generators, but some (mostly older) languages like FORTRAN do not. When other conventions are used, lexers must either be written by hand to handle these conventions or the conventions used by the lexer generator must be side-stepped. Some lexer generators allow the user to have some control over the conventions used. The principle of the longest matching prex is handled by letting the DFA read as far as it can, until it either reaches the end of the input or no transition is dened on the next input symbol. If the current state at this point is accepting, we are in luck and can simply output the corresponding token. If not, we must go back to the last time we were in an accepting state and output the token indicated by this. The characters read since then are put back in the input stream. The lexer must hence retain the symbols it has read since the last accepting state so it can re-insert these in the input in such situations. If we are not at the end of the input stream, we restart the DFA (in it sinitial state) on the remaining input to nd the next tokens. As an example, consider lexing of the string 3e-y with the DFA in gure 2.12. We get to the accepting state G after reading the digit However, we can continue making legal transitions to state I on then to state J on - e 3. and (as these could be the start of the exponent part of a real number). It is only when we, in state J, nd that there is no transition on y that we realise that this isn't the case. We must now go back to the last accepting state (G) and output the number 3 as the CHAPTER 2. LEXICAL ANALYSIS 42 rst token and re-insert with e-y - and e in the input stream, so we can continue when we look for the subsequent tokens. Lexical errors If no prex of the input string forms a valid token, a lexical error has occurred. When this happens, the lexer will usually report an error. At this point, it may stop reading the input or it may attempt continued lexical analysis by skipping characters until a valid prex is found. The purpose of the latter approach is to try nding further lexical errors in the same input, so several of these can be corrected by the user before re-running the lexer. Some of these subsequent errors may, however, not be real errors but may be caused by the lexer not skipping enough characters (or skipping too many) after the rst error is found. If, for example, the start of a comment is ill-formed, the lexer may try to interpret the contents of the comment as individual tokens, and if the end of a comment is ill-formed, the lexer will read until the end of the next comment (if any) before continuig, hence skipping too much text. When the lexer nds an error, the consumer of the tokens that the lexer produces (e.g., the rest of the compiler) can not usually itself produce a valid result. However, the compiler may try to nd other errors in the remaining input, again allowing the user to nd several errors in one edit-compile cycle. Again, some of the subsequent errors may really be spurious errors caused by lexical error(s), so the user will have to guess at the validity of every error message except the rst, as only the rst error message is guaranteed to be a real error. Nevertheless, such recovery error has, when the input is so large that restarting the lexer from the start of input incurs a considerable time overhead, proven to be an aid in productivity by locating more errors in less time. Less commonly, the lexer may work interactively with a text editor and restart from the point at which an error was spotted after the user has tried to x the error. 2.9.1 Lexer generators A lexer generator will typically use a notation for regular expressions similar to the one described in section 2.1, but may require alphabetcharacters to be quoted to distinguish them from the symbols used to * intended to match a multi* used to denote repetition by quoting the * symbol, e.g. as `*`. Additionally, some lexer build regular expressions. For example, an plication symbol in the input is distinguished from an 2.9. LEXERS AND LEXER GENERATORS 43 generators extend regular expressions in various ways, e.g., allowing a set of characters to be specied by listing the characters that are not in the set. This is useful, for example, to specify the symbols inside a comment up to the terminating character(s). The input to the lexer generator will normally contain a list of regular expressions that each denote a token. Each of these regular expressions has an associated action. The action describes what is passed on to the consumer (e.g., the parser), typically an element from a token data NUM, ID, etc.) type, which describes the type of token ( and sometimes additional information such as the value of a number token, the name of an identier token and, perhaps, the position of the token in the input le. The information needed to construct such values is typically provided by the lexer generator through library functions or variables that can be used in the actions. Normally, the lexer generator requires white-space and comments to be dened by regular expressions. The actions for these regular expressions are typically empty, meaning that white-space and comments are just ignored. An action can be more than just returning a token. If, for example, a language has a large number of keywords, then a DFA that recognises all of these individually can be fairly large. In such cases, the keywords are not described as separate regular expressions in the lexer denition but instead treated as special cases of the identier token. The action for identiers will then look the name up in a table of keywords and return the appropriate token type (or an identier token if the name is not a keyword). A similar strategy can be used if the language allows identiers to shadow keywords. Another use of non-trivial lexer actions is for nested comments. In principle, a regular expression (or nite automaton) cannot recognise arbitrarily nested comments (see section 2.10), but by using a global counter, the actions for comment tokens can keep track of the nesting level. If escape sequences (for dening, e.g., control characters) are allowed in string constants, the actions for string tokens will, typically, translate the string containing these sequences into a string where they have been substituted by the characters they represent. Sometimes lexer generators allow several dierent starting points. In the example in gures 2.11 and 2.12, all regular expressions share the same starting state. However, a single lexer may be used, e.g., for both tokens in the programming language and for tokens in the input to that language. Often, there will be a good deal of sharing between these token sets (the tokens allowed in the input may, for example, be CHAPTER 2. LEXICAL ANALYSIS 44 a subset of the tokens allowed in programs). Hence, it is useful to allow these to share a NFA, as this will save space. The resulting DFA will have several starting states. An accepting state may now have more than one token name attached, as long as these come from dierent token sets (corresponding to dierent starting points). In addition to using this feature for several sources of text (program and input), it can be used locally within a single text to read very complex tokens. For example, nested comments and complex-format strings (with nontrivial escape sequences) can be easier to handle if this feature is used. 2.10 Properties of regular languages We have talked about regular languages as the class of languages that can be described by regular expressions or nite automata, but this in itself may not give a clear understanding of what is possible and what is not possible to describe by a regular language. Hence, we will now state a few properties of regular languages and give some examples of some regular and non-regular languages and give informal rules of thumb that can (sometimes) be used to decide if a language is regular. 2.10.1 Relative expressive power First, we repeat that regular expressions, NFAs and DFAs have exactly the same expressive power: They all can describe all regular languages and only these. Some languages may, however, have much shorter descriptions in one of these forms than in others. We have already argued that we from a regular expression can construct an NFA whose size is linear in the size of the regular expression, and that converting an NFA to a DFA can potentially give an exponential increase in size (see below for a concrete example of this). Since DFAs are also NFAs, NFAs are clearly at least as compact as (and sometimes much more compact than) DFAs. Similarly, we can see that NFAs are at least as compact (up to a small constant factor) as regular expressions. But we have not yet considered if the converse is true: Can an NFA be converted to a regular expression of proportional size. The answer is, unfortunately, no: There exist classes of NFAs (and even DFAs) that need regular expressions that are exponentially larger to describe them. This is, however, mainly of academic interest as we rarely have to make conversions in this direction. 2.10. PROPERTIES OF REGULAR LANGUAGES If we are only interested in 45 if a language is regular rather than the size of its description, however, it doesn't matter which of the formalisms we choose, so we can in each case choose the formalism that suits us best. Sometimes it is easier to describe a regular language using a DFA or NFA instead of a regular expression. For example, the set of binary number strings that represent numbers that divide evenly by 5 can be described by a 6-state DFA (see exercise 2.9), but it requires a very complex regular expression to do so. For programming language tokens, regular expression are typically quite suitable. The subset construction (algorithm 2.3) maps sets of NFA states to DFA states. Since there are 2n − 1 n non-empty sets of NFA states, the resulting DFA can potentially have exponentially more states than the NFA. But can this potential ever be realised? isn't enough to nd one n-state To answer this, it NFA that yields a DFA with 2n − 1 states. We need to nd a family of ever bigger NFAs, all of which yield exponentially-sized DFAs. We also need to argue that the resulting DFAs are minimal. One construction that has these properties is the following: For each integer 1. State 2. If 0 n > 1, construct an n-state is the starting state and state 0 ≤ i < n − 1, a. state i NFA in the following way: n−1 is accepting. has a transition to state i+1 on the 0 on the symbol 3. All states have transitions to themselves symbol b. n-bit We can represent a set of these states by an the number if and only if state and i is in the set. to state number: Bit i is 1 in The set that contains only the initial NFA state is, hence, represented by the number 1. We shall see that the way a transition maps a set of states to a new set of states can be expressed as an operation on the number: • A transition on a • A transition on b maps the number maps the number x to x (2x to (x mod or (2n )). 1), using bit-wise or. This isn't hard to verify, so we leave this to the interested reader. It is also easy to see that these two operations can generate any from the number 1. n-bit number Hence, any subset can be reached by a sequence of transitions, which means that the subset-construction will generate a DFA state for every subset. CHAPTER 2. LEXICAL ANALYSIS 46 But is the DFA minimal? If we look at the NFA, we can see that a i + 1 (if i < n − 1), so for each NFA state i there is exactly one sequence of as that leads to the accepting state, and that sequence has n−1−i as. Hence, a DFA state whose subset contains the NFA state i will lead to acceptance on a string of n−1−i as, while a DFA state whose subset does not contain i will not. Hence, for any two dierent DFA states, we can nd an NFA state i that is in one of an leads from state i to the sets but not the other and use that to construct a string that will distinguish the DFA states. Hence, all the DFA states are distinct, so the DFA is minimal. 2.10.2 Limits to expressive power The most basic property of a DFA is that it is nite: It has a nite number of states and nowhere else to store information. This means, for example, that any language that requires unbounded counting cannot {an bn | n ≥ 0}, that is, the same number of bs. If be regular. An example of this is the language any sequence of as followed by a sequence of we must decide membership in this language by a DFA that reads the as, know how many there were, so we can compare this to the number of bs. input from left to right, we must, at the time we have read all the But since a nite automaton cannot count arbitrarily high, the language isn't regular. A similar non-regular language is the language of matching parentheses. However, if we limit the nesting depth of parentheses to a constant (0 to n), n, we can recognise this language by a DFA that has n+1 states i corresponds to i unmatched opening parentheses. where state State 0 is both the starting state and the only accepting state. Some surprisingly complex languages are regular. As all nite sets of strings are regular languages, the set of all legal Pascal programs of less than a million pages is a regular language, though it is by no means a simple one. While it can be argued that it would be an acceptable limitation for a language to allow only programs of less than a million pages, it isn't practical to describe a programming language as a regular language: The description would be far too large. Even if we ignore such absurdities, we can sometimes be surprised by the expressive power of regular languages. As an example, given any integer constant n, the set of numbers (written in binary or decimal notation) that divide evenly by n is a regular language (see exercise 2.9). 2.10. PROPERTIES OF REGULAR LANGUAGES 47 2.10.3 Closure properties We can also look at closure properties of regular languages. It is clear that regular languages are closed under set union: If we have regular expressions s and t for two languages, the regular expression s|t describes the union of these languages. Similarly, regular languages are closed under concatenation and unbounded repetition, as these correspond to basic operators of regular expressions. Less obviously, regular languages are also closed under set dierence and set intersection. To see this, we rst look at set complement: Given L is the set of all Σ, except the strings found in L. We write the complement of L as L. To get the complement of a regular language L, we rst construct a DFA for the language L and make sure a xed alphabet Σ, the complement of the language strings built from the alphabet that all states have transitions on all characters from the alphabet (as described in section 2.8.2). Now, we simply change every accepting state to non-accepting and vice versa, L. and thus get a DFA for We can now (by using the set-theoretic equivalent of De Morgan's law) construct L1 ∩ L2 as L1 ∪ L2 . Given this intersection L1 \ L2 = L1 ∩ L2 . construction, we can now get set dierence by Regular sets are also closed under a number of common string operations, such as prex, sux, subsequence and reversal. The precise meaning of these words in the present context is dened below. Prex. A prex of a string empty string and all of and Sux. abc. w w. is any initial part of The prexes of w, including the abc are hence , a, ab A sux of a string is what remains of the string after a prex has been taken o. The suxes of abc are hence abc, bc, c and . Subsequence. A subsequence of a string is obtained by deleting any number of symbols from anywhere in the string. The subsequences of abc Reversal. are hence abc, bc, ac, ab, c, b, a and . The reversal of a string is the string read backwards. The reversal of abc is hence cba. As with complement, these can be obtained by simple transformations of the DFAs for the language. CHAPTER 2. LEXICAL ANALYSIS 48 2.11 Further reading There are many variants of the method shown in section 2.4. The version presented here has been devised for use in this book in an attempt to make the method easy to understand and manageable to do by hand. Other variants can be found in [5] and [9]. It is possible to convert a regular expression to a DFA directly without going through an NFA. One such method [26] [5] actually at one stage during the calculation computes information equivalent to an NFA (without epsilon-transitions), but more direct methods based on algebraic properties of regular expressions also exist [12]. These, unlike NFA-based methods, generalise fairly easily to handle regular expressions extended with explicit set-intersection and set-dierence operators. A good deal of theoretic information about regular expressions and nite automata can be found in [18]. An ecient DFA minimization algorithm can be found in [22]. Lexer generators can be found for most programming languages. For C, the most common are Lex [24] and Flex [34]. The latter generates the states of the DFA as program code instead of using table-lookup. This makes the generated lexers fast, but can use much more space than a table-driven program. Finite automata and notation reminiscent of regular expressions are also used to describe behaviour of concurrent systems [28]. In this setting, a state represents the current state of a process and a transition corresponds to an event to which the process reacts by changing state. Exercises Exercise 2.1 In the following, a digits, i.e., sion [0-9] . + number-string is a non-empty sequence of decimal something in the language dened by the regular expresThe value of a number-string is the usual interpretation of a number-string as an integer number. Note that leading zeroes are allowed. Make for each of the following languages a regular expression that describes that language. a) All number-strings that have the value 42. b) All number-strings that do not have the value 42. 2.11. FURTHER READING 49 c) All number-strings that have a value that is strictly greater than 42. Exercise 2.2 Given the regular expression a∗ (a|b)aa: a) Construct an equivalent NFA using the method in section 2.4. b) convert this NFA to a DFA using algorithm 2.3. Exercise 2.3 Given the regular expression ((a|b)(a|bb))∗ : a) Construct an equivalent NFA using the method in section 2.4. b) convert this NFA to a DFA using algorithm 2.3. Exercise 2.4 Make a DFA equivalent to the following NFA:   - 0na- 1 a- 2 a- 3   Ia Ib b start Exercise 2.5 Minimise the following DFA:  ? 0n 3  Q a b   C QQ    C Q s  b  Cb 1 4   a  C  C a   CW  a a - 3  2  CHAPTER 2. LEXICAL ANALYSIS 50 Exercise 2.6 Minimise the following DFA: a  - 0 b- 1 a- 2   K I a b a b b   ? R R b - 4nb- 5  3  a Exercise 2.7 Construct DFAs for each of the following regular languages. In all cases the alphabet is {a, b}. a) The set of strings that has exactly 3 bs b) The set of strings where the number of there can be any number of as). (and any number of bs as). is a multiple of 3 (and c) The set of strings where the dierence between the number of and the number of bs as is a multiple of 3. Exercise 2.8 Construct a DFA that recognises balanced sequences of parenthesis with a maximal nesting depth of 3, e.g., , ()(), (()(())) or (()())()() but not (((()))) or (()(()(()))). Exercise 2.9 Given that binary number strings are read with the most signicant bit rst and may have leading zeroes, construct DFAs for each of the following languages: a) Binary number strings that represent numbers that are multiples of 4, e.g., 0, 100 and 10100. b) Binary number strings that represent numbers that are multiples of 5, e.g., 0, 101, 10100 and 11001. Hint: Make a state for each possible remainder after division by 5 and then add a state to avoid accepting the empty string. 2.11. FURTHER READING c) Given a number n, 51 what is the minimal number of states needed in a DFA that recognises binary numbers that are multiples of Hint: write n a∗ as 2b , where a n? is odd. Exercise 2.10 The empty language, i.e., the language that contains no strings can be recognised by a DFA (any DFA with no accepting states will accept this language), but it can not be dened by any regular expression using the constructions in section 2.2. Hence, the equivalence between DFAs and regular expressions is not complete. To remedy this, a new regular expression φ is introduced such that L(φ) = ∅. a) Argue why each of the following algebraic rules, where s is an arbitrary regular expression, is true: φ|s φs sφ φ∗ = = = = s φ φ  b) Extend the construction of NFAs from regular expressions to include a case for φ. c) What consequence will this extension have for converting the NFA to a minimal DFA? Hint: dead states. Exercise 2.11 Show that regular languages are closed under prex, sux, subsequence and reversal, as postulated in section 2.10. Hint: show how an NFA for a regular language L can be transformed to an NFA of prexes of strings from L, Np N for the set and similarly for the other operations. Exercise 2.12 Which of the following statements are true? Argue each answer informally. a) Any subset of a regular language is itself a regular language. b) Any superset of a regular language is itself a regular language. CHAPTER 2. LEXICAL ANALYSIS 52 c) The set of anagrams of strings from a regular language forms a regular language. (An anagram of a string is obtained by rear- ranging the order of characters in the string, but without adding or deleting any. The anagrams of the string acb, bac, bca, cab and cba). abc are hence abc, Exercise 2.13 In gures 2.11 and 2.12 we used character sets on transitions as shorthands for sets of transitions, each with one character. We can, instead, extend the denition of NFAs and DFAs such that such character sets are allowed on a single transition. For a DFA (to be deterministic), we must require that transitions out of the same state have disjoint character sets. a) Sketch how algorithm 2.3 must be modied to handle transitions with sets in such a way that the disjointedness requirement for DFAs are ensured. b) Sketch how algorithm 2.4 must be modied to handle character sets. A new requirement for DFA minimality is that the number of transitions as well as the number of states is minimal. How can this be ensured? Exercise 2.14 As mentioned in section 2.5, DFAs are often implemented by tables where the current state is cross-indexed by the next symbol to nd the next state. If the alphabet is large, such a table can take up quite a lot of room. If, for example, 16-bit UNI-code is used as the alphabet, there are 216 = 65536 entries in each row of the table. Even if each entry in the table is only one byte, each row will take up 64KB of memory, which may be a problem. c into c1 and c2 . In the regular expressions, each occurc is hence replaced by the regular expression c1 c2 . A possible solution is to split each 16-bit UNI-code character two 8-bit characters rence of a character This regular expression is then converted to an NFA and then to a DFA in the usual way. The DFA may (and probably will) have more states than the DFA using 16-bit characters, but each state in the new DFA use only 1/256th of the space used by the original DFA. a) How much larger is the new NFA compared to the old? 2.11. FURTHER READING 53 b) Estimate what the expected size (measured as number of states) of the new DFA is compared to the old. Hint: Some states in the NFA can be reached only after an even number of 8-bit characters are read and the rest only after an odd number of 8-bit characters are read. What does this imply for the sets constructed during the subset construction? c) Roughly, how much time does the new DFA require to analyse a string compared to the old? d) If space is a problem for a DFA over an 8-bit alphabet, do you expect that a similar trick (splitting each 8-bit character into two 4-bit characters) will help reduce the space requirements? Justify your answer. Exercise 2.15 If L is a regular language, so is L \ {}, i.e., the set of all nonempty L. strings in So we should be able to transform a regular expression for a regular expression for nonempty for L, i.e., φ L into We want to do this with a function that is recursive over the structure of the regular expression of the form: nonempty() nonempty(a) nonempty(s|t) nonempty(s t) nonempty(s?) nonempty(s∗ ) nonempty(s+ ) where L \ {}. = = = = = = = φ ... where a is an nonempty(s) | nonempty(t) ... ... ... ... alphabet symbol is the regular expression for the empty language (see exer- cise 2.10). a) Complete the denition of nonempty by replacing the occurrences of  . . . in the rules above by expressions similar to those shown in the rules for  and s|t. b) Use this denition to nd nonempty(a∗ b∗ ). CHAPTER 2. LEXICAL ANALYSIS 54 Exercise 2.16 If L is a regular language, so is the set of all prexes of strings in L (see section 2.10.3). So we should be able to transform a regular expression for regular expression for the set of all prexes of strings in L. L into a We want to prexes that is recursive over the structure of the L, i.e., of the form: do this with a function regular expression for prexes() prexes(a) prexes(s|t) prexes(s t) prexes(s∗ ) prexes(s+ ) = = = = = =  a? where a prexes(s) | prexes(t) ... ... ... a) Complete the denition of prexes is an alphabet symbol by replacing the occurrences of  . . . in the rules above by expressions similar to those shown in the rules for  and s|t. b) Use this denition to nd prexes(ab∗ c). Chapter 3 Syntax Analysis 3.1 Introduction Where lexical analysis splits the input into tokens, the purpose of syntax analysis (also known as parsing) is to recombine these tokens. Not back into a list of characters, but into something that reects the structure of the text. syntax tree This something is typically a data structure called the of the text. As the name indicates, this is a tree structure. The leaves of this tree are the tokens found by the lexical analysis, and if the leaves are read from left to right, the sequence is the same as in the input text. Hence, what is important in the syntax tree is how these leaves are combined to form the structure of the tree and how the interior nodes of the tree are labelled. In addition to nding the structure of the input text, the syntax analysis must also reject invalid texts by reporting syntax errors. As syntax analysis is less local in nature than lexical analysis, more advanced methods are required. We, however, use the same basic strategy: A notation suitable for human understanding is transformed into a machine-like low-level notation suitable for ecient execution. process is called parser generation. The notation we use for human manipulation is mars1 , This context-free gram- which is a recursive notation for describing sets of strings and imposing a structure on each such string. This notation can in some cases be translated almost directly into recursive programs, but it is often more convenient to generate stack automata. These are similar to the nite automata used for lexical analysis but they can additionally use a stack, which allows counting and non-local matching of symbols. 1 The name refers to the fact that derivation is independent of context. 55 CHAPTER 3. SYNTAX ANALYSIS 56 We shall see two ways of generating such automata. The rst of these, LL(1), is relatively simple, but works only for a somewhat restricted class of grammars. The SLR construction, which we present later, is more complex but accepts a wider class of grammars. Sadly, neither of these work for all context-free grammars. Tools that handle all contextfree grammars exist, but they can incur a severe speed penalty, which is why most parser generators restrict the class of input grammars. 3.2 Context-free grammars Like regular expressions, context-free grammars describe sets of strings, i.e., languages. Additionally, a context-free grammar also denes struc- ture on the strings in the language it denes. A language is dened over some alphabet, for example the set of tokens produced by a lexer or the set of alphanumeric characters. The symbols in the alphabet are called terminals. A context-free grammar recursively denes several sets of strings. Each set is denoted by a name, which is called a nonterminal. of nonterminals is disjoint from the set of terminals. The set One of the non- terminals are chosen to denote the language described by the grammar. start symbol of the grammar. The sets are described productions. Each production describes some of the pos- This is called the by a number of sible strings that are contained in the set denoted by a nonterminal. A production has the form N → X1 . . . X n where N is a nonterminal and X1 . . . X n are zero or more symbols, each of which is either a terminal or a nonterminal. The intended meaning of this notation is to say that the set denoted by N contains strings that are obtained by concatenating strings from the sets denoted by X1 . . . X n . In this setting, a terminal denotes a singleton set, just like alphabet characters in regular expressions. We will, when no confusion is likely, equate a nonterminal with the set of strings it denotes Some examples: A→a says that the set denoted by the nonterminal string a. A → aA A contains the one-character 3.2. CONTEXT-FREE GRAMMARS says that the set denoted by a an A contains all strings formed by putting in front of a string taken from the set denoted by these two productions indicate that of as 57 A A. Together, contains all non-empty sequences and is hence (in the absence of other productions) equivalent to the regular expression a+ . We can dene a grammar equivalent to the regular expression a∗ by the two productions B → B → aB where the rst production indicates that the empty string is part of the set B. Compare this grammar with the denition of s∗ in gure 2.1. Productions with empty right-hand sides are called tions. These are sometimes written with an  empty produc- on the right hand side instead of leaving it empty. So far, we have not described any set that could not just as well have been described using regular expressions. Context-free grammars are, however, capable of expressing much more complex languages. In section 2.10, we noted that the language {an bn | n ≥ 0} is not regular. It is, however, easily described by the grammar S → S → aS b The second production ensures that the as and bs are paired symmetri- cally around the middle of the string, ensuring that they occur in equal number. The examples above have used only one nonterminal per grammar. When several nonterminals are used, we must make it clear which of these is the start symbol. By convention (if nothing else is stated), the nonterminal on the left-hand side of the rst production is the start symbol. As an example, the grammar T T R R has T → → → → R aT a b bR as start symbol and denotes the set of strings that start with any number of number of as followed by a non-zero as with which it started. number of bs and then the same CHAPTER 3. SYNTAX ANALYSIS 58 Form of si  a sj sk sj |sk Productions for Ni Ni Ni Ni Ni Ni Ni Ni Ni Ni Ni sj ∗ sj + sj ? Ni → →a → Nj Nk → Nj → Nk → Nj Ni → → Nj Ni → Nj → Nj → Each subexpression of the regular expression is numbered and subexpression ductions for Ni si is assigned a nonterminal depend on the shape of si Ni . The pro- as shown in the table above. Figure 3.1: From regular expressions to context free grammars Sometimes, a shorthand notation is used where all the productions of the same nonterminal are combined to a single rule, using the alternative symbol (|) from regular expressions to separate the right-hand sides. In this notation, the above grammar would read T → R | aT a R → b | bR There are still four productions in the grammar, even though the arrow symbol → is only used twice. 3.2.1 How to write context free grammars As hinted above, a regular expression can systematically be rewritten as a context free grammar by using a nonterminal for every subexpression in the regular expression and using one or two productions for each nonterminal. The construction is shown in gure 3.1. So, if we can think of a way of expressing a language as a regular expression, it is easy to make a grammar for it. However, we will also want to use grammars to describe non-regular languages. An example is the kind of arithmetic expressions 3.2. CONTEXT-FREE GRAMMARS Exp Exp Exp Exp Exp Exp → → → → → → 59 Exp + Exp Exp - Exp Exp * Exp Exp / Exp num ( Exp ) Grammar 3.2: Simple expression grammar that are part of most programming languages (and also found on electronic calculators). Such expressions can be described by grammar 3.2. Note that, as mentioned in section 2.10, the matching parentheses can't be described by regular expressions, as these can't count the number of unmatched opening parentheses at a particular point in the string. However, if we didn't have parentheses in the language, it could be described by the regular expression num((+|-|*|/)num)∗ Even so, the regular description isn't useful if you want operators to have dierent precedence, as it treats the expression as a at string rather than as having structure. We will look at structure in sections 3.3.1 and 3.4. Most constructions from programming languages are easily expressed by context free grammars. In fact, most modern languages are designed this way. When writing a grammar for a programming language, one normally starts by dividing the constructs of the language into dierent tic categories. syntac- A syntactic category is a sub-language that embodies a particular concept. Examples of common syntactic categories in pro- gramming languages are: Expressions Statements are used to express calculation of values. express actions that occur in a particular sequence. Declarations express properties of names used in other parts of the program. Each syntactic category is denoted by a main nonterminal, e.g., Exp from grammar 3.2. More nonterminals might be needed to describe a CHAPTER 3. SYNTAX ANALYSIS 60 Stat Stat Stat Stat → → → → id := Exp Stat ; Stat if Exp then Stat else Stat if Exp then Stat Grammar 3.3: Simple statement grammar syntactic category or provide structure to it, as we shall see, and productions for one syntactic category can refer to nonterminals for other syntactic categories. For example, statements may contain expressions, so some of the productions for statements use the main nonterminal for expressions. A simple grammar for statements might look like gram- mar 3.3, which refers to the Exp nonterminal from grammar 3.2. 3.3 Derivation So far, we have just appealed to intuitive notions of recursion when we describe the set of strings that a grammar produces. Since the pro- ductions are similar to recursive set equations, we might expect to use the techniques from section 2.6.1 to nd the set of strings denoted by a grammar. However, though these methods in theory apply to innite sets by considering limits of chains of sets, they are only practically useful when the sets are nite. Instead, we below introduce the concept of derivation. An added advantage of this approach is, as we will later see, that syntax analysis is closely related to derivation. The basic idea of derivation is to consider productions as rewrite rules: Whenever we have a nonterminal, we can replace this by the right-hand side of any production in which the nonterminal appears on the left-hand side. We can do this anywhere in a sequence of symbols (terminals and nonterminals) and repeat doing so until we have only terminals left. The resulting sequence of terminals is a string in the language dened by the grammar. relation ⇒ by the three rules 1. αN β ⇒ αγβ 2. α ⇒ α 3. α ⇒ γ where α, β Formally, we dene the derivation and γ if there is a production if there is a β such that N →γ α⇒β and β⇒γ are (possibly empty) sequences of grammar symbols (terminals and nonterminals). The rst rule states that using a pro- 3.3. DERIVATION 61 T T R R → R → aT c → → RbR Grammar 3.4: Example grammar duction as a rewrite rule (anywhere in a sequence of grammar symbols) is a derivation step. reexive, i.e., that a i.e., that transitivity, The second states that the derivation relation is sequence derives itself. The third rule describes 2 a sequence of derivations is in itself a derivation . We can use derivation to formally dene the language that a contextfree grammar generates: Denition 3.1 Given a context-free grammar G with start symbol S , terminal symbols T and productions P , the language L(G) that G generates is dened to be the set of strings of terminal symbols that can be obtained by derivation from S using the productions P , i.e., the set {w ∈ T ∗ | S ⇒ w}. As an example, we see that grammar 3.4 generates the string by the derivation shown in gure 3.5. aabbbcc We have, for clarity, in each sequence of symbols underlined the nonterminal that is rewritten in the following step. In this derivation, we have applied derivation steps sometimes to the leftmost nonterminal, sometimes to the rightmost and sometimes to a nonterminal that was neither. However, since derivation steps are local, the order doesn't matter. So, we might as well decide to always rewrite the leftmost nonterminal, as shown in gure 3.6. A derivation that always rewrites the leftmost nonterminal is called a leftmost derivation. Similarly, a derivation that always rewrites the rightmost nonterminal is called a rightmost derivation. 3.3.1 Syntax trees and ambiguity We can draw a derivation as a tree: The root of the tree is the start symbol of the grammar, and whenever we rewrite a nonterminal we add 2 The mathematically inclined will recognise that derivation is a preorder on sequences of grammar symbols. CHAPTER 3. SYNTAX ANALYSIS 62 ⇒ ⇒ ⇒ ⇒ ⇒ ⇒ ⇒ ⇒ ⇒ ⇒ T aT c aaT cc aaRcc aaRbRcc aaRbcc aaRbRbcc aaRbRbRbcc aaRbbRbcc aabbRbcc aabbbcc Figure 3.5: Derivation of the string ⇒ ⇒ ⇒ ⇒ ⇒ ⇒ ⇒ ⇒ ⇒ ⇒ aabbbcc using grammar 3.4 T aT c aaT cc aaRcc aaRbRcc aaRbRbRcc aabRbRcc aabRbRbRcc aabbRbRcc aabbbRcc aabbbcc Figure 3.6: Leftmost derivation of the string aabbbcc using grammar 3.4 3.3. DERIVATION 63 T a T a T @ @ @ @ c c R b R R  b @ @ R  R b R @ @ @ @  R  Figure 3.7: Syntax tree for the string aabbbcc using grammar 3.4 T a a T T @ @ @ @ c c R R  b @ @ R R b  R @ @ R b @ @ R  Figure 3.8: Alternative syntax tree for the string mar 3.4  aabbbcc using gram- CHAPTER 3. SYNTAX ANALYSIS 64 T T R R → R → aT c → → bR Grammar 3.9: Unambiguous version of grammar 3.4 as its children the symbols on the right-hand side of the production that was used. The leaves of the tree are terminals which, when read from left to right, form the derived string. If a nonterminal is rewritten using an empty production, an  is shown as its child. This is also a leaf node, but is ignored when reading the string from the leaves of the tree. When we write such a syntax tree, the order of derivation is irrele- vant: We get the same tree for left derivation, right derivation or any other derivation order. Only the choice of production for rewriting each nonterminal matters. As an example, the derivations in gures 3.5 and 3.6 yield the same syntax tree, which is shown in gure 3.7. The syntax tree adds structure to the string that it derives. It is this structure that we exploit in the later phases of the compiler. For compilation, we do the derivation backwards: We start with a string and want to produce a syntax tree. This process is called analysis or parsing. syntax order of derivation doesn't matter when constructthe choice of production for that nonterminal does. Even though the ing a syntax tree, Obviously, dierent choices can lead to dierent strings being derived, but it may also happen that several dierent syntax trees can be built for the same string. As an example, gure 3.8 shows an alternative syntax tree for the same string that was derived in gure 3.7. When a grammar permits several dierent syntax trees for some strings we call the grammar ambiguous. If our only use of grammar is to describe sets of strings, ambiguity isn't a problem. However, when we want to use the grammar to impose structure on strings, the structure had better be the same every time. Hence, it is a desireable feature for a grammar to be unambiguous. In most (but not all) cases, an ambiguous grammar can be rewritten to an unambiguous grammar that generates the same set of strings, or external rules can be applied to decide which of the many possible syntax trees is the right one. An unambiguous version of grammar 3.4 is shown in gure 3.9. 3.4. OPERATOR PRECEDENCE 65 How do we know if a grammar is ambiguous? If we can nd a string and show two alternative syntax trees for it, this is a proof of ambiguity. It may, however, be hard to nd such a string and, when the grammar is unambiguous, even harder to show that this is the case. In fact, the problem is formally undecidable, i.e., there is no method that for all grammars can answer the question Is this grammar ambiguous?. But in many cases it is not dicult to detect and prove ambiguity. For example, any grammar that has a production of the form N → N αN where α is any sequence of grammar symbols, is ambiguous. This is, for example, the case with grammars 3.2 and 3.4. We will, in sections 3.11 and 3.13, see methods for constructing parsers from grammars. These methods have the property that they only work on unambiguous grammars, so successful construction of a parser is a proof of unambiguity. However, the methods may also fail on certain unambiguous grammars, so they can not be used to prove ambiguity. In the next section, we will see ways of rewriting a grammar to get rid of some sources of ambiguity. These transformations preserve the language that the grammar generates. By using such transformations (and others, which we will see later), we can create a large set of equiva- lent grammars, i.e., grammars that generate the same language (though they may impose dierent structures on the strings of the language). Given two grammars, it would be nice to be able to tell if they are equivalent. Unfortunately, no known method is able to decide this in all cases, but, unlike ambiguity, it is not (at the time of writing) known if such a method may or may not theoretically exist. Sometimes, equivalence can be proven e.g. by induction over the set of strings that the grammars produce. The converse can be proven by nding an example of a string that one grammar can generate but the other not. But in some cases, we just have to take claims of equivalence on faith or give up on deciding the issue. 3.4 Operator precedence As mentioned in section 3.2.1, we can describe traditional arithmetic expressions by grammar 3.2. Note that num is a terminal that denotes all integer constants and that, here, the parentheses are terminal symbols CHAPTER 3. SYNTAX ANALYSIS 66 Exp @ @ Exp + 2 Exp Exp * @ @ Exp 3 4 Figure 3.10: Preferred syntax tree for 2+3*4 using grammar 3.2 (unlike in regular expressions, where they are used to impose structure on the regular expressions). This grammar is ambiguous, as evidenced by, e.g., the production Exp → Exp + Exp which has the form that in section 3.3.1 was claimed to imply ambiguity. This ambiguity is not surprising, as we are used to the fact that an expression like 2+3*4 can be read in two ways: Either as multiplying the sum of 2 and 3 by 4 or as adding 2 to the product of 3 and 4. Simple electronic calculators will choose the rst of these interpretations (as they always calculate from left to right), whereas scientic calculators and most programming languages will choose the second, as they use a hierarchy of operator precedences which dictate that the product must be calculated before the sum. The hierarchy can be overridden by explicit parenthesisation, e.g., (2+3)*4. Most programming languages use the same convention as scientic calculators, so we want to make this explicit in the grammar. Ideally, we would like the expression 2+3*4 to generate the syntax tree shown in gure 3.10, which reects the operator precedences by grouping of subexpressions: When evaluating an expression, the subexpressions represented by subtrees of the syntax tree are evaluated before the topmost operator is applied. A possible way of resolving the ambiguity is to use precedence rules during syntax analysis to select among the possible syntax trees. Many parser generators allow this approach, as we shall see in section 3.15. However, some parsing methods require the grammars to be unambiguous, so we have to express the operator hierarchy in the grammar itself. To clarify this, we rst dene some concepts: • An operator be evaluated left-associative if the expression a ⊕ b ⊕ c from left to right, i.e., as (a ⊕ b) ⊕ c. ⊕ is must 3.4. OPERATOR PRECEDENCE • An operator ⊕ 67 right-associative if the expression a ⊕ b ⊕ c i.e., as a ⊕ (b ⊕ c). is must be evaluated from right to left, • An operator ⊕ is non-associative if expressions of the form a⊕b⊕c are illegal. By the usual convention, calculated as (2-3)-4. - and / are + and * are left-associative, as e.g., 2-3-4 is associative in the mathematical sense, meaning that it doesn't matter if we calculate from left to right or from right to left. one of these. However, to avoid ambiguity we have to choose By convention (and similarity to - and /) we choose to - and 2-3+4, as let these be left-associative as well. Also, having a left-associative right-associative + would not help resolving the ambiguity of the operators so-to-speak pull in dierent directions. List construction operators in functional languages, e.g., :: and @ in SML, are typically right-associative, as are function arrows in types: a -> b -> c is read as is also right-associative: a -> (b -> c). The assignment a=b=c is read as a=(b=c). operator in C < 2 < 3 < 4. In some languages (like Pascal), comparison operators (like are non-associative, i.e., you are not allowed to write or >) 3.4.1 Rewriting ambiguous expression grammars If we have an ambiguous grammar E → E ⊕E E → num we can rewrite this to an unambiguous grammar that generates the correct structure. As this depends on the associativity of ⊕, we use dierent rewrite rules for dierent associativities. If ⊕ is left-associative, we make the grammar left-recursive by having a recursive reference to the left only of the operator symbol: E → E ⊕ E0 E → E0 E 0 → num Now, the expression 2⊕3⊕4 can only be parsed as CHAPTER 3. SYNTAX ANALYSIS 68 E ⊕ E ⊕ E @ @ @ @ E0 E0 4 3 E0 2 We get a slightly more complex syntax tree than in gure 3.10, but not enormously so. We handle right-associativity in a similar fashion: We make the offending production right-recursive: E → E0 ⊕ E E → E0 E 0 → num Non-associative operators are handled by non-recursive productions: E → E0 ⊕ E0 E → E0 E 0 → num Note that the latter transformation actually changes the language that the grammar generates, as it makes expressions of the form num ⊕ num illegal. num ⊕ So far, we have handled only cases where an operator interacts with itself. This is easily extended to the case where several operators with the same precedence and associativity interact with each other, as for example + and -: E E E E0 → → → → E + E0 E - E0 E0 num Operators with the same precedence must have the same associativity for this to work, as mixing left-recursive and right-recursive productions for the same nonterminal makes the grammar ambiguous. As an example, the grammar 3.4. OPERATOR PRECEDENCE Exp Exp Exp Exp2 Exp2 Exp2 Exp3 Exp3 → → → → → → → → 69 Exp + Exp2 Exp - Exp2 Exp2 Exp2 * Exp3 Exp2 / Exp3 Exp3 num ( Exp ) Grammar 3.11: Unambiguous expression grammar E E E E0 → → → → E + E0 E0 ⊕ E E0 num seems like an obvious generalisation of the principles used above, giving + and ⊕ the same precedence and dierent associativity. But not only is the grammar ambiguous, it doesn't even accept the intended language. For example, the string num+num⊕num is not derivable by this grammar. In general, there is no obvious way to resolve ambiguity in an expres- 1+2⊕3, where + is left-associative and ⊕ is right-associative (or vice-versa). Hence, most programming languages (and most parser generators) require operators at the same precedence level to have identical sion like associativity. We also need to handle operators with dierent precedences. This is done by using a nonterminal for each precedence level. The idea is that if an expression uses an operator of a certain precedence level, then its subexpressions cannot use operators of lower precedence (unless these are inside parentheses). Hence, the productions for a nonterminal corresponding to a particular precedence level refers only to nonterminals that correspond to the same or higher precedence levels, unless parentheses or similar bracketing constructs disambiguate the use of these. Grammar 3.11 shows how these rules are used to make an unambiguous version of grammar 3.2. using this grammar. Figure 3.12 show the syntax tree for 2+3*4 CHAPTER 3. SYNTAX ANALYSIS 70 Exp Exp + @ @ Exp2 Exp2 Exp2 * @ @ Exp3 Exp3 2 Exp3 4 3 Figure 3.12: Syntax tree for 2+3*4 using grammar 3.11 3.5 Other sources of ambiguity Most of the potential ambiguity in grammars for programming languages comes from expression syntax and can be handled by exploiting precedence rules as shown in section 3.4. Another classical example of ambiguity is the dangling-else problem. Imperative languages like Pascal or C often let the else-part of a conditional be optional, like shown in grammar 3.3. The problem is that it isn't clear how to parse, for example, if p then if q then s1 else s2 else can equally well match either if. else matches the closest not previously matched if, which, in the example, will make the else match the second if. How do we make this clear in the grammar? We can treat if, then and else as a kind of right-associative operators, as this would make them group to the right, making an if-then match the closest else. According to the grammar, the The usual convention is that an However, the grammar transformations shown in section 3.4 can't directly be applied to grammar 3.3, as the productions for conditionals don't have the right form. Instead we use the following observation: When an match, all ifs if else elses. and an that occur between these must have matching This can easily be proven by assuming otherwise and concluding that this leads to a contradiction. elseelse-part) con- Hence, we make two nonterminals: One for matched (i.e. with part) conditionals and one for unmatched (i.e. without ditionals. The result is shown in grammar 3.13. This grammar also 3.6. SYNTAX ANALYSIS Stat Stat Stat2 Stat2 M atched M atched U nmatched U nmatched → → → → → → → → 71 Stat2 ; Stat Stat2 M atched U nmatched if Exp then M atched else M atched id := Exp if Exp then M atched else U nmatched if Exp then Stat2 Grammar 3.13: Unambiguous grammar for statements resolves the associativity of semicolon (right) and the precedence of if over semicolon. An alternative to rewriting grammars to resolve ambiguity is to use an ambiguous grammar and resolve conicts by using precedence rules during parsing. We shall look into this in section 3.15. All cases of ambiguity must be treated carefully: It is not enough that we eliminate ambiguity, we must do so in a way that results in the desired structure: The structure of arithmetic expressions is signicant, and it makes a dierence to which if an else is matched. 3.6 Syntax analysis The syntax analysis phase of a compiler will take a string of tokens produced by the lexer, and from this construct a syntax tree for the string by nding a derivation of the string from the start symbol of the grammar. This can be done by guessing derivations until the right one is found, but random guessing is hardly an eective method. Even so, some parsing techniques are based on guessing derivations. However, these make sure, by looking at the string, that they will always guess right. These are called predictive parsing methods. Predictive parsers always build the syntax tree from the root down to the leaves and are hence also called (deterministic) top-down parsers. Other parsers go the other way: They search for parts of the input string that matches right-hand sides of productions and rewrite these to the left-hand nonterminals, at the same time building pieces of the syntax tree. The syntax tree is eventually completed when the string has CHAPTER 3. SYNTAX ANALYSIS 72 been rewritten (by inverse derivation) to the start symbol. Also here, we wish to make sure that we always pick the right rewrites, so we get deterministic parsing. Such methods are called bottom-up parsing methods. We will in the next sections rst look at predictive parsing and later at a bottom-up parsing method called SLR parsing. 3.7 Predictive parsing If we look at the left-derivation in gure 3.6, we see that, to the left of the rewritten nonterminals, there are only terminals. These terminals correspond to a prex of the string that is being parsed. In a parsing situation, this prex will be the part of the input that has already been read. The job of the parser is now to choose the production by which the leftmost unexpanded nonterminal should be rewritten. Our aim is to be able to make this choice deterministically based on the next unmatched input symbol. If we look at the third line in gure 3.6, we have already read two as and (if the input string is the one shown in the bottom line) the next symbol is a b. Since the right-hand side of the production T → aT c starts with an T a, we obviously can't use this. Hence, we can only rewrite using the production T →R We are not quite as lucky in the next step. for R None of the productions start with a terminal symbol, so we can't immediately choose a production based on this. As the grammar (grammar 3.4) is ambiguous, it should not be a surprise that we can't always choose uniquely. If we instead use the unambiguous grammar (grammar 3.9) we can immediately choose the second production for we are at the following c, R. When all the bs are read and we choose the empty production for R and match the remaining input with the rest of the derived string. If we can always choose a unique production based on the next input symbol, we are able to do this kind of predictive parsing. 3.8. NULLABLE AND FIRST 73 3.8 Nullable and FIRST In simple cases, like the above, all but one of the productions for a nonterminal start with distinct terminals and the remaining production does not start with a terminal. However, the method can be applied also for grammers that don't have this property: Even if several productions start with nonterminals, we can choose among these if the strings these productions can derive begin with symbols from known disjoint sets. Hence, we dene the function FIRST, which given a sequence of grammar symbols (e.g. the right-hand side of a production) returns the set of symbols with which strings derived from that sequence can begin: Denition 3.2 A symbol c is in FIRST(α) if and only if α ⇒ cβ for some sequence β of grammar symbols. To calculate sequence α FIRST, we need an auxiliary function Nullable, which for a of grammar symbols indicates whether or not that sequence can derive the empty string: Denition 3.3 A sequence α of grammar symbols is Nullable (we write this as Nullable(α)) if and only if α ⇒ . N → α is called nullable if Nullable(α). We describe calculation of Nullable by case analysis over the possible forms of sequences A production of grammar symbols: Algorithm 3.4 Nullable() Nullable(a) Nullable(α β) Nullable(N ) = = = = true false Nullable(α) ∧ Nullable(β) Nullable(α1 ) ∨ . . . ∨ Nullable(αn ), where the productions for N are N → α1 , . . . , N → αn where a is a terminal, N is a nonterminal, α and β are sequences of grammar symbols and  represents the empty sequence of grammar symbols. The equations are quite natural: Any occurrence of a terminal on a right-hand side makes Nullable false for that right-hand side, but a nonterminal is nullable if any production has a nullable. CHAPTER 3. SYNTAX ANALYSIS 74 Note that this is a recursive denition since minal is dened in terms of Nullable Nullable for a nonter- for its right-hand sides, which may contain that same nonterminal. We can solve this in much the same way that we solved set equations in section 2.6.1. We have, however, now booleans instead of sets and several equations instead of one. Still, the method is essentially the same: We have a set of boolean equations: X1 = F1 (X1 , . . . , Xn ) . . . Xn = Fn (X1 , . . . , Xn ) We initially assume X1 , . . . , X n to be all false. We then, in any order, calculate the right-hand sides of the equations and update the variable on the left-hand side by the calculated value. We continue until all equations are satised. In section 2.6.1, we required the functions to be monotonic with respect to subset. Correspondingly, we now require the boolean functions to be monotonic with respect to truth: If we make more arguments true, the result will also be more true (i.e., it may stay unchanged, change from false). false to true, but never change from true to If we look at grammar 3.9, we get these equations for nonterminals and right-hand sides: Nullable(T ) Nullable(R) = Nullable(R) ∨ Nullable(aT c) = Nullable() ∨ Nullable(bR) Nullable(R) Nullable(aT c) Nullable() Nullable(bR) = = = = Nullable(R) Nullable(a) ∧ Nullable(T ) ∧ Nullable(c) true Nullable(b) ∧ Nullable(R) In a xed-point calculation, we initially assume that Nullable is false for Nullable for rst all nonterminals and use this as a basis for calculating the right-hand sides and then the nonterminals. We repeat recalculating these until there is no change between two iterations. Figure 3.14 shows the xed-point iteration for the above equations. In each iteration, we rst evaluate the formulae for the right-hand sides and then use the results of this to evaluate the nonterminals. The right-most column shows the nal result. We can calculate FIRST in a similar fashion to Nullable: 3.8. NULLABLE AND FIRST 75 Right-hand side Initialisation Iteration 1 Iteration 2 Iteration 3 R aT c  bR false false false false false false true false true false true false true false true false false false false true true true true true Nonterminal T R Figure 3.14: Fixed-point iteration for calculation of Nullable Algorithm 3.5 FIRST() FIRST(a) = ∅ = {a}  FIRST(α) ∪ FIRST(β) if Nullable(α) FIRST(α β) = FIRST(α) if not Nullable(α) FIRST(N ) = FIRST(α1 ) ∪ . . . ∪ FIRST(αn ) where the productions for N are N → α1 , . . . , N → αn where a is a terminal, N is a nonterminal, α and β are sequences of grammar symbols and  represents the empty sequence of grammar symbols. The only nontrivial equation is that for can start a string derivable from αβ . α αβ . Obviously, anything that can also start a string derivable from α is nullable, a derivation may proceed as αβ ⇒ β ⇒ · · ·, FIRST(β) is also in FIRST(αβ). However, if so anything in The set-equations are solved in the same general way as the boolean equations for Nullable, but since we work with sets, we initailly assume every set to be empty. For grammar 3.9, we get the following equations: FIRST(T ) FIRST(R) = FIRST(R) ∪ FIRST(aT c) = FIRST() ∪ FIRST(bR) FIRST(R) FIRST(aT c) FIRST() FIRST(bR) = = = = FIRST(R) FIRST(a) ∅ FIRST(b) CHAPTER 3. SYNTAX ANALYSIS 76 Right-hand side Initialisation Iteration 1 Iteration 2 Iteration 3 R aT c  bR ∅ ∅ ∅ ∅ ∅ {a} ∅ {b} {b} {a} ∅ {b} {b} {a} ∅ {b} ∅ ∅ {a} {b} {a, b} {b} {a, b} {b} Nonterminal T R Figure 3.15: Fixed-point iteration for calculation of FIRST The xed-point iteration is shown in gure 3.15. When working with grammars by hand, it is usually quite easy to see for most productions if they are nullable and what their FIRST sets are. For example, a production is not nullable if its right-hand side has a terminal anywhere, and if the right-hand side starts with a terminal, the FIRST set consists of only that symbol. Sometimes, however, it is necessary to go through the motions of solving the equations. When working by hand, it is often useful to simplify the equations before the xed-point iteration, e.g., reduce FIRST(aT c) to {a}. 3.9 Predictive parsing revisited We are now ready to construct predictive parsers for a wider class of grammars: If the right-hand sides of the productions for a nonterminal have disjoint FIRST sets, we can use the next input symbol to choose among the productions. In section 3.7, we picked the empty production (if any) on any symbol that was not in the same nonterminal. is Nullable. FIRST sets of the non-empty productions for the We must actually do this for any production that Hence, at most one production for a nonterminal may be nullable, as otherwise we would not be able to choose deterministically between the two. We said in section 3.3.1 that our syntax analysis methods will detect ambiguous grammars. However, this isn't true with the method as stated above: We will get unique choice of production even for some ambiguous grammars, including grammar 3.4. The syntax analysis will in this case just choose one of several possible syntax trees for a given input string. 3.10. FOLLOW 77 In many cases, we do not consider such behaviour acceptable. In fact, we would very much like our parser construction method to tell us if we by mistake write an ambiguous grammar. Even worse, the rules for predictive parsing as presented here might for some unambiguous grammars give deterministic choice of production, but reject strings that actually belong to the language described by the grammar. If we, for example, change the second production in grammar 3.9 to T → aT b this will not change the choices made by the predictive parser for nonterminal R. However, always choosing the last production for will lead to erroneous rejection of many strings, including ab. R on a b Hence, we add to our construction of predictive parsers a test that will reject ambiguous grammars and those unambiguous grammars that can cause the parser to fail erroneously. We have so far simply chosen a nullable production if and only if no other choice is possible. However, we should extend this to say that we choose a production N → α on symbol c if one of the two conditions below are satised: 1) c ∈ FIRST(α). 2) Nullable(α) and c can validly follow N in a derivation. This makes us choose nullable productions more often than before. This, in turn, leads to more cases where we can not choose uniquely, including the example above with the modied grammar 3.9 (since b can follow R in valid derivations) and all ambiguous grammars that are not caught by the original method. 3.10 FOLLOW For the purpose of rejecting grammars that are problematical for predictive parsing, we introduce FOLLOW sets for nonterminals. Denition 3.6 A terminal symbol a is in FOLLOW(N ) if and only if there is a derivation from the start symbol S of the grammar such that S ⇒ αN aβ , where α and β are (possibly empty) sequences of grammar symbols. CHAPTER 3. SYNTAX ANALYSIS 78 In other words, a terminal c is in FOLLOW(N ) if c may follow N at some point in a derivation. To correctly handle end-of-string conditions, we want to detect if S ⇒ αN , i.e., if there are derivations where end of input. It turns out to be easy to do this by adding an extra N can be followed by the production to the grammar: S 0 → S$ where S0 is a new nonterminal that replaces S as start symbol and $ is a new terminal symbol representing the end of input. Hence, in the new FOLLOW(N ) exactly if S 0 ⇒ αN $ which is the case exactly when S ⇒ αN . The easiest way to calculate FOLLOW is to generate a collection of set constraints, which are subsequently solved. A production grammar, $ will be in M → αN β generates the constraint can follow N. FIRST(β) ⊆ FOLLOW(N ), since β , obviously, Nullable(β ) the production also generates Furthermore, if the constraint FOLLOW(M ) ⊆ FOLLOW(N ) (note the direction of the inclusion). c is in FOLLOW(M ), then there (by 0 derivation S ⇒ γM cδ . But since M → αN β and β can continue this by γM cδ ⇒ γαN cδ , so c is also in The reason is that, if a symbol denition) is a is nullable, we FOLLOW(N ). If a right-hand side contains several occurrences of nonterminals, we i.e., splitting the right-hand side β s. For example, the production A → BcB generates the constraint {c} ⊆ FOLLOW (B) by splitting after the rst B and the constraint FOLLOW (A) ⊆ FOLLOW (B) by splitting after the last B . add constraints for all occurrences, into dierent αs, N s and We solve the constraints in the following fashion: FOLLOW sets for all nonterminals. We FIRST (β) ⊆ FOLLOW (N ): We compute FIRST(β ) and add this to FOLLOW(N). Thereafter, we handle the second type of constraints: For each constraint FOLLOW(M ) ⊆ FOLLOW (N ), we add FOLLOW(M ) to FOLLOW(N ). We iterate these last steps until no further changes hapWe start by assuming empty then handle the constraints of the form pen. The steps taken to calculate the follow sets of a grammar are, hence: 3.10. FOLLOW 79 S 0 → S$, where S is the 0 grammar. S is the start symbol for 1. Add a new nonterminal for the original start symbol the extended grammar. 2. For each nonterminal N, locate all occurrences of N on the right- hand sides of productions. For each occurrence do the following: β N. 2.1 Let of 2.2 Let be the rest of the right-hand side after the occurrence Note that β may be empty. m = FIRST(β). m ⊆ FOLLOW(N ) Add the constraint to the set of constraints. If β is empty, you can omit the constraint, as it doesn't add anything. 2.3 If Nullable(β), nd the nonterminal M at the left-hand side of the production and add the constraint FOLLOW(N ). If M = N, FOLLOW(M ) ⊆ you can omit the constraint, as it doesn't add anything. Note that if β is empty, Nullable(β) is true. 3. Solve the constraints using the following steps: 3.1 Start with empty sets for N (not including S 0 ). FOLLOW(N ) 3.2 For each constraint of the form for all nonterminals m ⊆ FOLLOW(N ) conm to FOLLOW(N ). structed in step 2.1, add the contents of 3.3 Iterating until a xed-point is reached, for each constraint of the form FOLLOW(M ) ⊆ FOLLOW(N ), add the contents of FOLLOW(M ) to FOLLOW(N ). We can take grammar 3.4 as an example of this. We rst add the production T0 → T$ to the grammar to handle end-of-text conditions. The table below shows the constraints generated by each production Production Constraints T0 → T$ T →R T → aT c R→ R → RbR {$} ⊆ FOLLOW(T ) FOLLOW(T ) ⊆ FOLLOW(R) {c} ⊆ FOLLOW(T ) {b} ⊆ FOLLOW(R), FOLLOW(R) ⊆ FOLLOW(R) CHAPTER 3. SYNTAX ANALYSIS 80 FIRST sets, FOLFIRST sets: In the above table, we have already calculated the required so they are shown as explicit lists of terminals. To initialise the LOW sets, we use the constraints that involve these FOLLOW(T ) = {$, c} FOLLOW(R) = {b} and then iterate calculation of the subset constraints. The only nontrivial constraint is FOLLOW(T ) ⊆ FOLLOW(R), so we get FOLLOW(T ) = {$, c} FOLLOW(R) = {$, c, b} Which is the nal values for the FOLLOW sets. If we return to the question of predictive parsing of grammar 3.4, we R we should choose the empty production FOLLOW(R), i.e., {$, c, b} and choose the non-empty production on the symbols in FIRST(RbR), i.e., {b}. Since these sets overlap (on the symbol b), we can not uniquely choose a production for R based on the next input symbol. Hence, the revised construction see that for the nonterminal on the symbols in of predictive parsers (see below) will reject this grammar as possibly ambiguous. 3.11 LL(1) parsing We have, in the previous sections, looked at how we can choose productions based on FIRST and FOLLOW sets, i.e. using N → α on input symbol c if the rule that we choose a production • c ∈ FIRST (α), • Nullable(α) or and c ∈ FOLLOW (N ). If we can always choose a production uniquely by using these rules, this is called called LL(1) parsing  the rst L indicates the reading direction (left-to-right), the second L indicates the derivation order (left) and the 1 indicates that there is a one-symbol lookahead. A grammar that can be parsed using LL(1) parsing is called an LL(1) grammar. In the rest of this section, we shall see how we can implement LL(1) parsers as programs. We look at two implementation methods: Recursive descent, where grammar structure is directly translated into the structure of a program, and a table-based approach that encodes the decision process in a table. 3.11. LL(1) PARSING 81 3.11.1 Recursive descent As the name indicates, recursive descent uses recursive functions to im- plement predictive parsing. The central idea is that each nonterminal in the grammar is implemented by a function in the program. Each such function looks at the next input symbol in order to choose one of the productions for the nonterminal, using the criteria shown in the beginning of section 3.11. The right-hand side of the chosen production is then used for parsing in the following way: A terminal on the right-hand side is matched against the next input symbol. If they match, we move on to the following input symbol and the next symbol on the right hand side, otherwise an error is reported. A nonterminal on the right-hand side is handled by calling the corresponding function and, after this call returns, continuing with the next symbol on the right-hand side. When there are no more symbols on the right-hand side, the function returns. As an example, gure 3.16 shows pseudo-code for a recursive descent parser for grammar 3.9. We have constructed this program by the following process: We have rst added a production FOLLOW for all productions. T 0 has only one production, so T0 → T$ and calculated FIRST and the choice is trivial. However, we have added a check on the next input symbol anyway, so we can report This is shown in the function parseT'. parseT function, we look at the productions for T . As FIRST(R) = {b}, the production T → R is chosen on the symbol b. Since R is also Nullable, we must choose this production also on symbols in FOLLOW(T), i.e., c or $. FIRST(aTc) = {a}, so we select T → aT c on an a. On all other symbols we report an error. For parseR, we must choose the empty production on symbols in FOLLOW(R) (c or $). The production R → bR is chosen on input b. an error if it isn't in FIRST(T 0 ). For the Again, all other symbols produce an error. The function match takes as argument a symbol, which it tests for equality with the next input symbol. the rst input symbol before If they are equal, the following next. We assume next parseT' is called. symbol is read into the variable is initialised to The program in gure 3.16 only checks if the input is valid. It can easily be extended to construct a syntax tree by letting the parse functions return the sub-trees for the parts of input that they parse. CHAPTER 3. SYNTAX ANALYSIS 82 function parseT'() = if next = 'a' or next = 'b' or next = '$' then parseT() ; match('$') else reportError() function parseT() = if next = 'b' or next = 'c' or next = '$' then parseR() else if next = 'a' then match('a') ; parseT() ; match('c') else reportError() function parseR() = if next = 'c' or next = '$' then (* do nothing *) else if next = 'b' then match('b') ; parseR() else reportError() Figure 3.16: Recursive descent parser for grammar 3.9 3.11.2 Table-driven LL(1) parsing In table-driven LL(1) parsing, we encode the selection of productions into a table instead of in the program text. A simple non-recursive program uses this table and a stack to perform the parsing. The table is cross-indexed by nonterminal and terminal and contains for each such pair the production (if any) that is chosen for that nonterminal when that terminal is the next input symbol. This decision is N →α Nullable(α) and a made just as for recursive descent parsing: The production the table at (N ,a) FOLLOW(N). if a is in FIRST(α) or if both is in is in For grammar 3.9 we get the table shown in gure 3.17. The program that uses this table is shown in gure 3.18. It uses a stack, which at any time (read from top to bottom) contains the part of the current derivation that has not yet been matched to the input. When this eventually becomes empty, the parse is nished. If the stack is non-empty, and the top of the stack contains a terminal, that terminal is matched against the input and popped from the stack. Otherwise, the top of the stack must be a nonterminal, which we cross-index in the table 3.11. LL(1) PARSING 83 a b c $ 0 0 → T$ T → T$ T → T$ T T → aT c T → R T → R T → R R R → bR R → R→ T0 T0 Figure 3.17: LL(1) table for grammar 3.9 stack := empty ; push(T',stack) while stack <> empty do if top(stack) is a terminal then match(top(stack)) ; pop(stack) else if table(top(stack),next) = empty then reportError else rhs := rightHandSide(table(top(stack),next)) ; pop(stack) ; pushList(rhs,stack) Figure 3.18: Program for table-driven LL(1) parsing with the next input symbol. If the table-entry is empty, we report an error. If not, we pop the nonterminal from the stack and replace this by the right-hand side of the production in the table entry. The list of symbols on the right-hand side are pushed such that the rst of these will be at the top of the stack. As an example, gure 3.19 shows the input and stack at each step during parsing of the string aabbbcc$ using the table in gure 3.17. The top of the stack is to the left. The program in gure 3.18, like the one in gure 3.16, only checks if the input is valid. It, too, can be extended to build a syntax tree. This can be done by letting each nonterminal on the stack point to its node in the partially built syntax tree. When the nonterminal is replaced by one of its right-hand sides, nodes for the symbols on the right-hand side are added as children to the node. 3.11.3 Conicts When a symbol N a allows several choices of production for nonterminal we say that there is a conict on that symbol for that nonterminal. 84 CHAPTER 3. SYNTAX ANALYSIS input stack aabbbcc$ aabbbcc$ aabbbcc$ abbbcc$ abbbcc$ bbbcc$ bbbcc$ bbbcc$ bbcc$ bbcc$ bcc$ bcc$ cc$ cc$ c$ T0 T$ aT c$ T c$ aT cc$ T cc$ Rcc$ bRcc$ Rcc$ bRcc$ Rcc$ bRcc$ Rcc$ cc$ c$ $ $ Figure 3.19: Input and stack during table-driven LL(1) parsing 3.12. REWRITING A GRAMMAR FOR LL(1) PARSING 85 Conicts may be caused by ambiguous grammars (indeed all ambiguous grammars will cause conicts) but there are also unambiguous grammars that cause conicts. An example of this is the unambiguous expression grammar (grammar 3.11). We will in the next section see how we can rewrite this grammar to avoid conicts, but it must be noted that this is not always possible: There are languages for which there exist unambiguous context-free grammars but where no grammar for the language generates a conict-free LL(1) table. Such languages are said to be non-LL(1). It is, however, important to note the dierence between a non-LL(1) language and a non-LL(1) grammar: A language may well be LL(1) even though the grammar used to describe it isn't. 3.12 Rewriting a grammar for LL(1) parsing In this section we will look at methods for rewriting grammars such that they are more palatable for LL(1) parsing. In particular, we will look at elimination of left-recursion and at left factorisation. It must, however, be noted that not all grammars can be rewritten to allow LL(1) parsing. In these cases stronger parsing techniques must be used. 3.12.1 Eliminating left-recursion As mentioned above, the unambiguous expression grammar (grammar 3.11) is not LL(1). The reason is that all productions in the same FIRST Exp and Exp2 have sets. Overlap like this will always happen when there FIRST set of a leftFIRST set of the nonterminal itself FIRST sets of all the other productions are left-recursive productions in the grammar, as the recursive production will include the and hence be a superset of the for that nonterminal. To solve this problem, we must avoid left-recursion in the grammar. We start by looking at direct left-recursion. When we have a nonterminal with some left-recursive and some nonleft-recursive productions, i.e., N → N α1 . . . N N → N αm → β1 . . . N → βn CHAPTER 3. SYNTAX ANALYSIS 86 where the βi N , we observe that this is equivalent to (β1 | . . . | βn )(α1 | . . . | αm )∗ . We can generate the do not start with the regular expression same set of strings by the grammar N → β1 N 0 . . . N → βn N 0 N 0 → α1 N 0 . . . N 0 → αm N 0 N0 → where N0 is a new nonterminal. Note that, since the βi do not start with recursion in this grammar. N, there is no direct left- There may, however, still be instances of indirect left-recursion. We will briey look at indirect left-recursion in section 3.12.1. Rewriting the grammar like above will change the syntax trees that are built from the strings that are parsed. Hence, after parsing, the syntax tree must be re-structured to obtain the structure that the original grammar describe. We will return to this in section 3.16. As an example of left-recursion removal, we take the unambiguous expression grammar 3.11. This has left recursion in both Exp and Exp2, so we apply the transformation to both of these to obtain grammar 3.20. The resulting grammar 3.20 is now LL(1). Indirect left-recursion The transformation shown in section 3.12.1 only serves in the simple case where there is no indirect left-recursion. Indirect left-recursion can have several faces: 1. There are productions N1 N2 → N2 α1 → N3 α2 . . . Nk−1 → Nk αk−1 Nk → N1 αk 3.12. REWRITING A GRAMMAR FOR LL(1) PARSING Exp Exp0 Exp0 Exp0 Exp2 Exp20 Exp20 Exp20 Exp3 Exp3 → → → → → → → → → → 87 Exp2 Exp0 + Exp2 Exp0 - Exp2 Exp0 Exp3 Exp20 * Exp3 Exp20 / Exp3 Exp20 num ( Exp ) Grammar 3.20: Removing left-recursion from grammar 3.11 2. There is a production N → αN β where α is Nullable. or any combination of the two. More precisely, a grammar is (directly or indirectly) left-recursive if there is a non-empty derivation sequence N ⇒ N α, i.e., if a nonterminal derives a sequence of grammar symbols that start by that same nonterminal. If there is indirect left-recursion, we must rst rewrite the grammar to make the left-recursion direct and then use the transformation above. Rewriting a grammar to turn indirect left-recursion into direct leftrecursion can be done systematically, but the process is a bit complicated. We will not go into this here, as in practise most cases of leftrecursion are direct left-recursion. Details can be found in [4]. 3.12.2 Left-factorisation If two productions for the same nonterminal begin with the same se- FIRST sets. As an if have overlapping quence of symbols, they obviously have overlapping example, in grammar 3.3 the two productions for prexes. We rewrite this in such a way that the overlapping productions are made into a single production that contains the common prex of the productions and uses a new auxiliary nonterminal for the dierent 3 suxes. Seeleft-recursion-elimination grammar 3.21. In this grammar , we can uniquely choose one of the productions for Stat based on one input token. 3 We have omitted the production for semicolon, as that would only muddle the issue by introducing more ambiguity. CHAPTER 3. SYNTAX ANALYSIS 88 → id := Exp → if Exp then Stat Elsepart Stat Stat Elsepart → else Stat Elsepart → Grammar 3.21: Left-factorised grammar for conditionals For most grammars, combining productions with common prex will solve the problem. However, in this particular example the grammar still isn't LL(1): nonterminal the FIRST We can't uniquely choose a production for the auxiliary Elsepart, else is in FOLLOW(Elsepart) as well as in production for Elsepart. This shouldn't be a since set of the rst surprise to us, since, after all, the grammar is ambiguous and ambiguous grammars can't be LL(1). The equivalent unambiguous grammar (grammar 3.13) can't easily be rewritten to a form suitable for LL(1), so in practice grammar 3.21 is used anyway and the conict is handled by choosing the non-empty production for else else Elsepart whenever the symbol is encountered, as this gives the desired behaviour of letting an match the nearest if. Very few LL(1) conicts caused by ambi- guity can be removed in this way, however, without also changing the language recognized by the grammar. For example, operator precedence ambiguity can not be resolved by deleting conicting entries in the LL(1) table. 3.12.3 Construction of LL(1) parsers summarized 1. Eliminate ambiguity 2. Eliminate left-recursion 3. Perform left factorisation where required 4. Add an extra start production 5. Calculate FIRST S 0 → S$ to the grammar. for every production and FOLLOW for every nonterminal. 6. For nonterminal when: N and input symbol c, choose production N → α 3.13. SLR PARSING 89 • c ∈ FIRST (α), or • Nullable(α) and c ∈ FOLLOW (N ). This choice is encoded either in a table or a recursive-descent program. 3.13 SLR parsing A problem with LL(1) parsing is that most grammars need extensive rewriting to get them into a form that allows unique choice of production. Even though this rewriting can, to a large extent, be automated, there are still a large number of grammars that can not be automatically transformed into LL(1) grammars. A class of bottom-up methods for parsing called LR parsers exist which accept a much larger class of grammars (though still not all grammars). The main advantage of LR parsing is that less rewriting is required to get a grammar in acceptable form, but there are also languages for which there exist LR-acceptable grammars but no LL(1) grammars. Furthermore, as we shall see in section 3.15, LR parsers allow external declaration of operator precedences for resolving ambiguity instead of requiring the grammar itself to be unambiguous. We will look at a simple form of LR-parsing called SLR parsing. While most parser generators use a somewhat more complex method called LALR(1) parsing, we limit the discussion to SLR for the following reasons: • It is simpler. • In practice, LALR(1) handles few grammars that are not also handled by SLR. • When a grammar is in the SLR class, the parse-tables produced by SLR are identical to those produced by LALR(1). • Understanding of SLR principles is sucient to know how a grammar should be rewritten when a LALR(1) parser generator rejects it. The letters SLR stand for Simple, Left and Right. Left indicates that the input is read from left to right and the Right indicates that a right-derivation is built. LR parsers are table-driven bottom-up parsers and use two kinds of actions involving the input stream and a stack: CHAPTER 3. SYNTAX ANALYSIS 90 shift: A symbol is read from the input and pushed on the stack. reduce: On the stack, a number of symbols that are identical to the right-hand side of a production are replaced by the left-hand side of that production. Contrary to LL parsers, the stack holds the right-hand-side symbols such that the last symbol on the right- hand side is at the top of the stack. When all of the input is read, the stack will have a single element, which will be the start symbol of the grammar. LR parsers are also called shift-reduce parsers. As with LL(1), our aim is to make the choice of action depend only on the next input symbol and the symbol on top of the stack. To achieve this, we construct a DFA. Conceptually, this DFA reads the contents of the stack, starting from the bottom. If the DFA is in an accepting state when it reaches the top of the stack, it will cause reduction by a production that is determined by the state and the next input symbol. If the DFA is not in an accepting state, it will cause a shift. Hence, at every step, the action can be determined by letting the DFA read the stack from bottom to top. Letting the DFA read the entire stack at every action is not very ecient, so, instead, we keep track of the DFA state every time we push an element on the stack, storing the state as part of the stack element. When the DFA has indicated a shift, the course of action is easy: We get the state from the top of the stack and follow the transition marked with the next input symbol to nd the next DFA state. If the DFA indicated a reduce, we pop the right-hand side of the production o the stack. stack top. We then read the DFA state from the new When we push the nonterminal that is the left-hand side of the production, we make a transition from this DFA state on the nonterminal. With these optimisations, the DFA only has to inspect a terminal or nonterminal at the time it is pushed on the stack. At all other times, it just need to read the DFA state that is stored with the stack element. Hence, we can forget about what the actual symbols are as soon as the DFA has made the transition. There is, thus, no reason to keep the symbols on the stack, so we let a stack element just contain the DFA state. We still use the DFA to determine the next action, but it now only needs to look at the current state (stored at the top of the stack) and the next input symbol (at a shift action) or nonterminal (at a reduce action). We represent the DFA as a table, where we cross-index a DFA state 3.14. CONSTRUCTING SLR PARSE TABLES 91 with a symbol (terminal or nonterminal) and nd one of the following actions: shift n: go n: reduce p: accept: error: Read next input symbol, push state Push state n n on the stack. on the stack. Reduce with the production numbered p. Parsing has completed successfully. A syntax error has been detected. Note that the current state is always found at the top of the stack. Shift and reduce actions are found when a state is cross-indexed with Go actions are found when a state is cross-indexed Go actions are only used immediately after a reduce, but we can't put them next to the reduce actions in the table, as the destination state of a go depends on the state on top of the stack after the right-hand side of the reduced production is popped o: A reduce in the current state is immediately followed by a go in the state that is a terminal symbol. with a nonterminal. found when the stack is popped. An example SLR table is shown in gure 3.22. The table has been produced from grammar 3.9 by the method shown below in section 3.14. The actions have been abbreviated to their rst letters and error is shown as a blank entry. The algorithm for parsing a string using the table is shown in gure 3.23. As written, the algorithm just determines if a string is in the language generated by the grammar. It can, however, easily be extended to build a syntax tree: Each stack element holds (in addition to the state number) a portion of a syntax tree. When doing a reduce action, a new (partial) syntax tree is built by using the nonterminal from the reduced production as root and the syntax trees attached to the popped-o stack elements as children. The new tree is then attached to the stack element that is pushed. Figure 3.24 shows an example of parsing the string aabbbcc using the table in gure 3.22. The stack grows from left to right. 3.14 Constructing SLR parse tables An SLR parse table has as its core a DFA. Constructing this DFA from the grammar is not much dierent from constructing a DFA from a regular expression as shown in chapter 2: We rst construct an NFA using techniques similar to those in section 2.4 and then convert this into a DFA using the construction shown in section 2.5. CHAPTER 3. SYNTAX ANALYSIS 92 0 a b c $ T R s3 s4 r3 r3 g1 g2 r1 r1 r3 r3 g5 g2 r3 r3 1 a 2 3 4 s3 s4 s4 5 s7 6 r4 r4 7 r2 r2 g6 Figure 3.22: SLR table for grammar 3.9 stack := empty ; push(0,stack) ; read(next) loop case table[top(stack),next] of shift s: push(s,stack) ; read(next) reduce p: n := the left-hand side of production p ; r := the number of symbols on the right-hand side of p ; pop r elements from the stack ; push(s,stack) where table[top(stack),n] = go s accept: error: endloop terminate with success reportError Figure 3.23: Algorithm for SLR parsing 3.14. CONSTRUCTING SLR PARSE TABLES input aabbbcc$ abbbcc$ bbbcc$ bbcc$ bcc$ cc$ cc$ cc$ cc$ cc$ cc$ c$ c$ $ $ stack action 0 s3 03 s3 033 s4 0334 s4 03344 s4 033444 r3 (R 0334446 r4 033446 r4 03346 r4 0332 r1 0335 s7 03357 r2 (T 035 s7 0357 r2 (T 01 accept 93 →) ; g6 (R → bR) ; g6 (R → bR) ; g6 (R → bR) ; g2 (T → R) ; g5 → aT c) ; g5 → aT c) ; g1 Figure 3.24: Example SLR parsing 0: 1: 2: 3: 4: T0 T T R R → → → → → T R aT c bR Grammar 3.25: Example grammar for SLR-table construction However, before we do this, we extend the grammar with a new starting production. Doing this to grammar 3.9 yields grammar 3.25. The next step is to make an NFA for each production. This is done exactly like in section 2.4, treating both terminals and nonterminals as alphabet symbols. The accepting state of each NFA is labelled with the number of the corresponding production. gure 3.26. The result is shown in Note that we have used the optimised construction for  (the empty production) as shown in gure 2.6. The NFAs in gure 3.26 make transitions both on terminals and non- shift actions and trango actions. A go action happens terminals. Transitions by terminal corresponds to sitions on nonterminals correspond to CHAPTER 3. SYNTAX ANALYSIS 94 Production T0 NFA  0 - A T- Bn   1 - C R- Dn   2 - E a- F T- G c- Hn   3 - In   4 - J b- K R- Ln  →T T →R T → aT c R→ R → bR Figure 3.26: NFAs for the productions in grammar 3.25 state epsilon-transitions A C, E C I, J F C, E K I, J Figure 3.27: Epsilon-transitions added to gure 3.26 after a reduction, whereby some elements of the stack (corresponding to the right-hand side of a production) are replaced by a nonterminal (corresponding to the left-hand side of that production). However, before we can do this, the symbols that form the right-hand side must be on the stack. To achieve this we must, whenever a transition by a nonterminal is possible, also allow transitions on the symbols on the right-hand side of a production for that nonterminal so these eventually can be reduced to the nonterminal. We do this by adding epsilon-transitions to the NFAs in gure 3.26: Whenever there is a transition from state a nonterminal N, we add epsilon-transitions from of all the NFAs for productions with N s s to state t on to the initial states on the left-hand side. Adding these graphically to gure 3.26 would make a very cluttered picture, so instead we simply note the transitions in a table, shown in gure 3.27. Together with these epsilon-transitions, the NFAs in gure 3.26 form 3.14. CONSTRUCTING SLR PARSE TABLES DFA NFA state states a 0 A, C, E, I, J s3 1 B 2 D 3 F, C, E, I, J 4 K, I, J 5 G 6 L 7 H s3 95 Transitions b c T R s4 g1 g2 s4 g5 s4 g2 g6 s7 Figure 3.28: SLR DFA for grammar 3.9 a single, combined NFA. This NFA has the starting state A (the starting state of the NFA for the added start production) and an accepting state for each production in the grammar. We must now convert this NFA into a DFA using the subset construction shown in section 2.5. Instead of showing the resulting DFA graphically, we construct a table where transitions on terminals are shown as on nonterminals as go actions. to gure 3.22, except that no shift actions and transitions This will make the table look similar reduce or accept actions are present yet. Figure 3.28 shows the DFA constructed from the NFA made by adding epsilon-transitions in 3.27 to gure 3.26. The set of NFA states that forms each DFA state is shown in the second column of the table in gure 3.28. We will need these below for adding reduce and accept actions, but once this is done we will not need then anymore, and we can remove then from the nal table. To add LOW reduce and accept actions, we rst need to compute the FOL- sets for each nonterminal, as described in section 3.10. For pur- pose of calculating T 00 → T 0 $, FOLLOW, we add yet another extra start production: to handle end-of-text conditions as described in section 3.10. This gives us the following result: FOLLOW(T 0 ) = {$} FOLLOW(T ) = {c, $} FOLLOW(R) = {c, $} We then add reduce If a DFA state s p : N → α, we FOLLOW(N ). Reduc- actions by the following rule: contains the accepting NFA state for a production add reduce p as action to s on all symbols in CHAPTER 3. SYNTAX ANALYSIS 96 tion on production 0 (the extra start production that was added before constructing the NFA) is written as accept. In gure 3.28, state 0 contains NFA state I, which accepts production 3. Hence, we add r3 as actions at the symbols FOLLOW(R)). 0. We add this at the symbol $ (FOLLOW(T written as c and $ (as these are in State 1 contains NFA state B, which accepts production accept 0 )). As noted above, this is (abbreviated to a). In the same way, we add reduce actions to state 3, 4, 6 and 7. The result is shown in gure 3.22. Figure 3.29 summarises the SLR construction. 1. Add the production S0 → S, where S is the start symbol of the grammar. 2. Make an NFA for the right-hand side of each production. s that has an outgoing transition on a nonterminal N , add epsilon-transitions from s to the starting states of the NFAs for the right-hand sides of the productions for N . 3. For each state 4. Convert the combined NFA to a DFA. Use the starting state of the NFA for the production added in step 1 as the starting state for the combined NFA. 5. Build a table cross-indexed by the DFA states and grammar symbols (terminals including $ and nonterminals). Add at transitions on terminals and go shift actions actions on transitions on non- terminals. 6. Calculate FOLLOW for each nonterminal. add one more start production: For this purpose, we S 00 → S 0 $. 7. When a DFA state contains an NFA state that accepts the right- p, add reduce p at all symFOLLOW(N ), where N is the nonterminal on the left of production p. If production p is the production added in step 1, the action is accept instead of reduce p. hand side of the production numbered bols in Figure 3.29: Summary of SLR parse-table construction 3.15. USING PRECEDENCE RULES IN LR PARSE TABLES 97 3.14.1 Conicts in SLR parse-tables When reduce actions are added to SLR parse-tables, we might add one to a place where there is already a shift action, or we may add reduce actions for several dierent productions to the same place. When either i.e., we shift-reduce conict and of this happens, we no longer have a unique choice of action, have a conict. the other case a The rst situation is called a reduce-reduce conict. Both may occur in the same place. Conicts are often caused by ambiguous grammars, but (as is the case for LL-parsers) even some non-ambiguous grammars may generate conicts. If a conict is caused by an ambiguous grammar, it is usually (but not always) possible to nd an equivalent unambiguous grammar. Methods for eliminating ambiguity were discussed in sections 3.4 and 3.5. Alternatively, operator precedence declarations may be used to disambiguate an ambiguous grammar, as we shall see in section 3.15. But even unambiguous grammars may in some cases generate conicts in SLR-tables. In some cases, it is still possible to rewrite the grammar to get around the problem, but in a few cases the language simply isn't SLR. Rewriting an unambiguous grammar to eliminate conicts is somewhat of an art. Investigation of the NFA states that form the problematic DFA state will often help identifying the exact nature of the problem, which is the rst step towards solving it. Sometimes, changing a production from left-recursive to right-recursive may help, even though left-recursion in general isn't a problem for SLR-parsers, as it is for LL(1)-parsers. 3.15 Using precedence rules in LR parse tables We saw in section 3.12.2, that the conict arising from the danglingelse ambiguity could be removed by removing one of the entries in the LL(1) parse table. Resolving ambiguity by deleting conicting actions can also be done in SLR-tables. In general, there are more cases where this can be done successfully for SLR-parsers than for LL(1)-parsers. In particular, ambiguity in expression grammars like grammar 3.2 can be eliminated this way in an SLR table, but not in an LL(1) table. Most LR-parser generators allow declarations of precedence and associativity for tokens used as inx-operators. These declarations are then used to eliminate conicts in the parse tables. There are several advantages to this approach: CHAPTER 3. SYNTAX ANALYSIS 98 • Ambiguous expression grammars are more compact and easier to read than unambiguous grammars in the style of section 3.4.1. • The parse tables constructed from ambiguous grammars are often smaller than tables produced from equivalent unambiguous grammars. • Parsing using ambiguous grammars is (slightly) faster, as fewer reductions of the form Exp2 → Exp3 etc. are required. Using precedence rules to eliminate conicts is very simple. Grammar 3.2 will generate several conicts: 1) A conict between shifting on + and reducing by the production 2) A conict between shifting on + and reducing by the production 3) A conict between shifting on * and reducing by the production 4) A conict between shifting on * and reducing by the production Exp → Exp + Exp. Exp → Exp * Exp. Exp → Exp + Exp. Exp → Exp * Exp. And several more of similar nature involving conicts. - and /, for a total of 16 Let us take each of the four conicts above in turn and see how precedence rules can be used to eliminate them. 1) This conict arises from expressions like a+b, the next input symbol is a reduce a+b, +. a+b+c. After having read We can now either choose to grouping around the rst addition before the second, or shift on the plus, which will later lead to b+c being reduced and hence grouping around the second addition before the rst. Since + is left-associative, we prefer the rst of these options and hence eliminate the shift-action from the table and keep the reduceaction. 2) The oending expressions here have the form a*b+c. Since we want multiplication to bind stronger than addition, we, again, prefer reduction over shifting. 3) In expressions of the form a+b*c, we, as before, want multiplication + to group stronger, so we do a shift to avoid grouping around the operator and, hence, eliminate the reduce-action from the table. 3.15. USING PRECEDENCE RULES IN LR PARSE TABLES 99 4) This case is identical to case 1, where a left-associative operator conicts with itself and is likewise handled by eliminating the shift. In general, elimination of conicts by operator precedence declarations can be summarised into the following rules: a) If the conict is between two operators of dierent priority, eliminate the action with the lowest priority operator in favour of the action with the highest priority. The operator associated with a reduce-action is the operator used in the production that is reduced. b) If the conict is between operators of the same priority, the associativity (which must be the same, as noted in section 3.4.1) of the operators is used: If the operators are left-associative, the shiftaction is eliminated and the reduce-action retained. If the operators are right-associative, the reduce-action is eliminated and the shift-action retained. If the operators are non-associative, both actions are eliminated. c) If there are several operators with declared precedence in the production that is used in a reduce-action, the last of these is used to determine the precedence of the reduce-action. 4 Prex and postx operators can be handled similarly. Associativity only applies to inx operators, so only the precedence of prex and postx operators matters. Note that only shift-reduce conicts are eliminated by the above rules. Some parser generators allow also reduce-reduce conicts to be eliminated by precedence rules (in which case the production with the highest-precedence operator is preferred), but this is not as obviously useful as the above. The dangling-else ambiguity (section 3.5) can also be eliminated using precedence rules: Giving else a higher precedence than then or giv- ing them the same precedence and making them right-associative will handle the problem, as either of these will make the parser shift on instead of reducing else. Stat → if Exp then Stat else when this is followed by Not all conicts should be eliminated by precedence rules. Excessive use of precedence rules may cause the parser to accept only a subset 4 Using several operators with declared priorities in the same production should be done with care. CHAPTER 3. SYNTAX ANALYSIS 100 of the intended language (i.e., if a necessary action is eliminated by a precedence rule). So, unless you know what you are doing, you should limit the use of precedence declarations to operators in expressions. 3.16 Using LR-parser generators Most LR-parser generators use an extended version of the SLR construction called LALR(1). In practice, however, there is little dierence between these, so a LALR(1) parser generator can be used with knowledge of SLR only. Most LR-parser generators organise their input in several sections: • Declarations of the terminals and nonterminals used. • Declaration of the start symbol of the grammar. • Declarations of operator precedence. • The productions of the grammar. • Declaration of various auxiliary functions and data-types used in the actions (see below). 3.16.1 Declarations and actions Each nonterminal and terminal is declared and associated with a datatype. For a terminal, the data-type is used to hold the values that are associated with the tokens that come from the lexer, numbers or names of identiers. e.g., the values of For a nonterminal, the type is used for the values that are built for the nonterminals during parsing (at reduce-actions). While, conceptually, parsing a string produces a syntax tree for that string, parser generators usually allow more control over what is actually produced. This is done by assigning an action to each production. The action is a piece of program text that is used to calculate the value of a reduced production by using the values associated with the symbols on the right-hand side. For example, by putting appropriate actions on each production, the numerical value of an expression may be calculated as the result of parsing the expression. Indeed, compilers can be made such that the value produced during parsing is the compiled code of a program. For all but the simplest compilers it is, however, better to build some kind of syntax representation during parsing and then later operate on this representation. 3.16. USING LR-PARSER GENERATORS 101 3.16.2 Abstract syntax The syntax trees described in section 3.3.1 are not always optimally suitable for compilation. They contain a lot of redundant information: Parentheses, keywords used for grouping purposes only, and so on. They also reect structures in the grammar that are only introduced to eliminate ambiguity or to get the grammar accepted by a parser generator (such as left-factorisation or elimination of left-recursion). stract syntax Hence, ab- is commonly used. Abstract syntax keeps the essence of the structure of the text but omits the irrelevant details. An abstract syntax tree is a tree structure where each node corresponds to one or more nodes in the (concrete) syntax tree. For example, the concrete syntax tree shown in gure 3.12 may be represented by the following abstract syntax tree: PlusExp NumExp(2) @ @ MulExp @ @ NumExp(3) NumExp(4) Here the names PlusExp, MulExp and NumExp may be constructors in a data-type or they may be elements from an enumerated type used as tags in a union-type. The names indicate which production is chosen, so there is no need to keep the subtrees that are implied by the choice of production, such as the subtree from gure refexpression-tree2 that +. Likewise, the sequence of nodes Exp, Exp2, Exp3, 2 at the left of gure refexpression-tree2 are combined to a single node NumExp(2) that includes both the choice of productions for Exp, Exp2 and Exp3 and the value of the terminal node. holds the symbol A compiler designer has much freedom in the choice of abstract syntax. Some use abstract syntax that retain alls of the structure of the concrete syntax trees plus additional positioning information used for error-reporting. Others prefer abstract syntax that contains only the information necessary for compilation, skipping parenthesis and other (for this purpose) irrelevant structure. Exactly how the abstract syntax tree is represented and built depends on the parser generator used. Normally, the action assigned to a production can access the values of the terminals and nonterminals on the right-hand side of a production through specially named variables CHAPTER 3. SYNTAX ANALYSIS 102 (often called $1, $2, etc.) and produces the value for the node corre- sponding to the left-hand-side either by assigning it to a special variable ($0) or letting it be the return value of the action. The data structures used for building abstract syntax trees depend on the language. Most statically typed functional languages support tree-structured datatypes with named constructors. In such languages, it is natural to represent abstract syntax by one datatype per syntactic category (e.g., Exp above) and one constructor for each instance of the syntactic category (e.g., PlusExp, NumExp and MulExp above). In Pas- cal, each syntactic category can be represented by a variant record type and each instance as a variant of that. In C, a syntactic category can be represented by a union of structs, each struct representing an instance of the syntactic category and the union covering all possible instances. In object-oriented languages such as Java, a syntactic category can be represented as an abstract class or interface where each instance in a syntactic category is a concrete class that implements the abstract class or interface. In most cases, it is fairly simple to build abstract syntax using the actions for the productions in the grammar. It becomes complex only when the abstract syntax tree must have a structure that diers nontrivially from the concrete syntax tree. One example of this is if left-recursion has been eliminated for the purpose of making an LL(1) parser. The preferred abstract syntax tree will in most cases be similar to the concrete syntax tree of the original left-recursive grammar rather than that of the transformed grammar. As an example, the left-recursive grammar E → E + num E → num gets transformed by left-recursion elimination into E → num E 0 E 0 → + num E 0 E0 → Which yields a completely dierent syntax tree. We can use the actions assigned to the productions in the transformed grammar to build an abstract syntax tree that reects the structure in the original grammar. In the transformed grammar, tree with a hole. E0 should return an abstract syntax The intention is that this hole will eventually be lled by another abstract syntax tree: 3.16. USING LR-PARSER GENERATORS • The second production for E0 • In the rst production for E0, 103 returns just a hole. the + num terminals are used and to produce a tree for a plus-expression (i.e., a a hole in place of the rst subtree. PlusExp node) with This tree is used to ll the hole in the tree returned by the recursive use of E 0 , so the abstract syntax tree is essentially built outside-in. The result is a new tree with a hole. • In the production for E, the hole in the tree returned by the nonterminal is lled by a the value of the NumExp num terminal. E0 node with the number that is The best way of building trees with holes depends on the type of language used to implement the actions. Let us rst look at the case where a functional language is used. The actions shown below for the original grammar will build an abstract syntax tree similar to the one shown in the beginning of this section. E → E + num { PlusExp($1,NumExp($3)) } E → num { NumExp($1) } We now want to make actions for the transformed grammar that will produce the same abstract syntax trees as this will. In functional languages, an abstract syntax tree with a hole can be represented by a function. The function takes as argument what should be put into the hole and returns a syntax tree where the hole is lled with this argument. The hole is represented by the argument variable of the function. We can write this as actions to the transformed grammar: E → num E 0 { $2(NumExp($1)) } E 0 → + num E 0 { λx.$3(PlusExp(x,NumExp($2))) } E0 → { λx.x } where λx.e is a nameless function that takes the value of the expression e. x as argument and returns The empty production returns the identity function, which works like a top-level hole. The non-empty production for E0 applies the function $3 returned by the to a subtree, hence lling the hole in itself has a hole x, $3 E0 on the right-hand side by this subtree. The subtree which is lled when applying the function returned by the right-hand side. The production for E applies the function $2 CHAPTER 3. SYNTAX ANALYSIS 104 returned by E0 to a subtree that has no holes and, hence, returns a tree with no holes. λx.e is written as fn x => e, (lambda (x) e). In SML, Scheme as in Haskell as \x -> e and in The imperative version of the actions in the original grammar is E → E + num { $0 = PlusExp($1,NumExp($3)) } E → num { $0 = NumExp($1) } In this setting, NumExp and PlusExp aren't constructors but functions that allocate and build node and return pointers to these. Unnamed functions of the kind used in the above solution for functional languages can not be built in most imperative languages, so holes must be an explicit part of the data-type that is used to represent abstract syntax. These holes will be overwritten when the values are supplied. E0 will, hence, return a record holding both an abstract syntax tree (in a eld named tree) and a pointer to the hole that should be overwritten (in hole). As actions (using C-style notation), this becomes a eld named E → num E 0 E 0 → + num E 0 E0 → { $2->hole = NumExp($1); $0 = $2.tree } { $0.hole = makeHole(); $3->hole = PlusExp($0.hole,NumExp($2)); $0.tree = $3.tree } { $0.hole = makeHole(); $0.tree = $0.hole } This may look bad, but when using LR-parser generators, left-recursion removal is rarely needed, and parser generators based on LL(1) often do left-recursion removal automatically and transform the actions appropriately. An alternative approach is to let the parser build an intermediate (semi-abstract) syntax tree from the transformed grammar, and then let a separate pass restructure the intermediate syntax tree to produce the intended abstract syntax. 3.16.3 Conict handling in parser generators For all but the simplest grammars, the user of a parser generator should expect conicts to be reported when the grammar is rst presented to the parser generator. These conicts can be caused by ambiguity or by 3.16. USING LR-PARSER GENERATORS NFA-state A B C D E F G H I J K L 105 Textual representation T' -> . T T' -> T . T -> . R T -> R . T -> . aTc T -> a . Tc T -> aT . c T -> aTc . R -> . R -> . bR R -> b . R R -> bR . Figure 3.30: Textual representation of NFA states the limitations of the parsing method. In any case, the conicts can normally be eliminated by rewriting the grammar or by adding precedence declarations. Most parser generators can provide information that is useful to locate where in the grammar the problems are. When a parser generator reports conicts, it will tell in which state in the table these occur. This state can be written out in a (barely) human-readable form as a set of NFA-states. Since most parser generators rely on pure ASCII, they can not actually draw the NFAs as diagrams. Instead, they rely on the fact that each state in the NFA corresponds to a position in a production in the grammar. If we, for example, look at the NFA states in gure 3.26, these would be written as shown in gure 3.30. Note that a `.' is used to indicate the position of the state in the production. State 4 of the table in gure 3.28 will hence be written as R -> b . R R -> . R -> . bR The set of NFA states, combined with information about on which symbols a conict occurs, can be used to nd a remedy, e.g. by adding precedence declarations. If all eorts to get a grammar through a parser generator fails, a practical solution may be to change the grammar so it accepts a larger CHAPTER 3. SYNTAX ANALYSIS 106 language than the intended language and then post-process the syntax tree to reject false positives. This elimination can be done at the same time as type-checking (which, too, may reject programs). Some languages allow programs to declare precedence and associativity for user-dened operators. This can make it dicult to handle precedence during parsing, as the precedences are not known when the parser is generated. A typical solution is to parse all operators using the same precedence and then restructure the syntax tree afterwards, but see also exercise 3.20. 3.17 Properties of context-free languages In section 2.10, we described some properties of regular languages. Context-free languages share some, but not all, of these. For regular languages, deterministic (nite) automata cover exactly the same class of languages as nondeterministic automata. This is not the case for context-free languages: Nondeterministic stack automata do indeed cover all context-free languages, but deterministic stack automata cover only a strict subset. The subset of context-free languages that can be recognised by deterministic stack automata are called deterministic context-free languages. Deterministic context-free languages can be recognised by LR parsers. We have noted that the basic limitation of regular languages is niteness: A nite automaton can not count unboundedly and hence can not keep track of matching parentheses or similar properties. Context-free languages are capable of such counting, essentially using the stack for this purpose. Even so, there are limitations: A context-free language can only keep count of one thing at a time, so while it is possible (even {an bn | n ≥ 0} by a context-free gram{an bn cn | n ≥ 0} is not a context-free language. The trivial) to describe the language mar, the language information kept on the stack follows a strict LIFO order, which further restricts the languages that can be described. It is, for example, trivial to represent the language of palindromes (strings that read the same forwards and backwards) by a context-free grammar, but the language of strings that can be constructed by repeating a string twice is not context-free. Context-free languages are, as regular languages, closed under union: It is easy to construct a grammar for the union of two languages given grammars for each of these. Context-free languages are also closed under prex, sux, subsequence and reversal. Indeed, the language con- 3.18. FURTHER READING 107 sisting of all subsequences of a context-free language is actually regular. However, context-free languages are not closed under intersec- tion or complement. For example, the languages {an bn cm | m, n ≥ 0} m n n and {a b c | m, n ≥ 0} are both context-free while their intersection n n {a b cn | n ≥ 0} is not. 3.18 Further reading Context-free grammars were rst proposed as a notation for describing natural languages (e.g., English or French) by the linguist Noam Chomsky [13], who dened this as one of three grammar notations for this purpose. The qualier context-free distinguishes this notation from the other two grammar notations, which were called context-sensitive and unconstrained. In context-free grammars, derivation of a nonterminal is independent of the context in which the terminal occurs, whereas the context can restrict the set of derivations in a context-sensitive grammar. Unrestricted grammars can use the full power of a universal computer, so these represent all computable languages. Context-free grammars are actually too weak to describe natural languages, but were adopted for dening the Algol60 programming language [15]. Since then, variants of this notation has been used for dening or describing almost all programming languages. Some languages have been designed with specic parsing methods in mind: Pascal [19] has been designed for LL(1) parsing while C [23] was originally designed to t LALR(1) parsing, but this property was lost in subsequent versions of the language. Most parser generators are based on LALR(1) parsing, but a few use LL(1) parsing. An example of this is ANTLR (http://www.antlr.org/). The Dragon Book [4] tells more about parsing methods than the present book. Several textbooks exist that describe properties of context-free languages, e.g., [18]. The methods presented here for rewriting grammars based on operator precedence uses only inx operators. If prex or postx operators have higher precedence than all inx operators, the method presented here will work (with trivial modications), but if there are inx operators that have higher precedence than some prex or postx operators, it breaks down. A method for handling arbitrary precedences of inx, prex and postx operators is presented in [1]. CHAPTER 3. SYNTAX ANALYSIS 108 Exercises Exercise 3.1 Figures 3.7 and 3.8 show two dierent syntax trees for the string aabbbcc using grammar 3.4. Draw a third, dierent syntax tree for aabbbcc using the same grammar and show the left-derivation that corresponds to this syntax tree. Exercise 3.2 Draw the syntax tree for the string aabbbcc using grammar 3.9. Exercise 3.3 Write an unambiguous grammar for the language of balanced parentheses, i.e. the language that contains (among other) the sequences (i.e. the empty string)  () (()) ()() (()(())) but none of the following ( ) )( (() ()()) Exercise 3.4 Write grammars for each of the following languages: a) All sequences of bs as and bs that contain the same number of as and (in any order). b) All sequences of as c) All sequences of as and bs. and and bs bs that contain strictly more as than bs. that contain a dierent number of as 3.18. FURTHER READING d) All sequences of as and 109 bs that contain twice as many as as bs. Exercise 3.5 We extend the language of balanced parentheses from exercise 3.3 with [ two symbols: parentheses and and ]. [ corresponds to exactly two normal opening ] corresponds to exactly two normal closing parentheses. A string of mixed parentheses is legal if and only if the string produced by replacing [ by (( and ] by )) is a balanced parentheses sequence. Examples of legal strings are  ()() ((] [] [)(] [(]) a) Write a grammar that recognises this language. b) Draw the syntax trees for [)(] and [(]). Exercise 3.6 Show that the grammar A → −A A → A − id A → id is ambiguous by nding a string that has two dierent syntax trees. Now make two dierent unambiguous grammars for the same language: a) One where prex minus binds stronger than inx minus. b) One where inx minus binds stronger than prex minus. Show the syntax trees using the new grammars for the string you used to prove the original grammar ambiguous. CHAPTER 3. SYNTAX ANALYSIS 110 Exercise 3.7 In grammar 3.2, replace the operators − and / by < and :. These have the following precedence rules: < is non-associative and binds less tightly than + but more tightly than :. : is right-associative and binds less tightly than any other operator. Write an unambiguous grammar for this modied grammar using the method shown in section 3.4.1. Show the syntax tree for 6∗7 2:3<4+5: using the unambiguous grammar. Exercise 3.8 Extend grammar 3.13 with the productions Exp → id M atched → then calculate Nullable and FIRST for every production in the grammar. Add an extra start production as described in section 3.10 and calculate FOLLOW for every nonterminal in the grammar. Exercise 3.9 Calculate Nullable, FIRST and FOLLOW for the nonterminals A and B in the grammar A A B B → BAa → → bBc → AA Remember to extend the grammar with an extra start production when calculating FOLLOW. Exercise 3.10 Eliminate left-recursion from grammar 3.2. Exercise 3.11 Calculate Nullable and FIRST for every production in grammar 3.20. 3.18. FURTHER READING 111 Exercise 3.12 Exp0 → Exp $ to the grammar produced in exercise 3.10 and calculate FOLLOW for all nonterminals in the resulting Add a new start production grammar. Exercise 3.13 Make a LL(1) parser-table for the grammar produced in exercise 3.12. Exercise 3.14 Consider the following grammar for postx expressions: E → EE+ E → EE∗ E → num a) Eliminate left-recursion in the grammar. b) Do left-factorisation of the grammar produced in question a. c) Calculate Nullable, FIRST for every production and FOLLOW for every nonterminal in the grammar produced in question b. d) Make a LL(1) parse-table for the grammar produced in question b. Exercise 3.15 Extend grammar 3.11 with a new start production as shown in section 3.14 and calculate FOLLOW for every nonterminal. Remember to add an extra start production for the purpose of calculating FOLLOW as described in section 3.10. Exercise 3.16 Make NFAs (as in gure 3.26) for the productions in grammar 3.11 (after extending it as shown in section 3.14) and show the epsilon-transitions as in gure 3.27. Convert the combined NFA into an SLR DFA like the one in gure 3.28. Finally, add reduce and accept actions based on the FOLLOW sets calculated in exercise 3.15. CHAPTER 3. SYNTAX ANALYSIS 112 Exercise 3.17 Extend grammar 3.2 with a new start production as shown in section 3.14 and calculate FOLLOW for every nonterminal. Remember to add an extra start production for the purpose of calculating FOLLOW as described in section 3.10. Exercise 3.18 Make NFAs (as in gure 3.26) for the productions in grammar 3.2 (after extending it as shown in section 3.14) and show the epsilon-transitions as in gure 3.27. Convert the combined NFA into an SLR DFA like the one in gure 3.28. Add reduce actions based on the FOLLOW sets calculated in exercise 3.17. Eliminate the conicts in the table by using operator precedence rules as described in section 3.15. Compare the size of the table to that from exercise 3.16. Exercise 3.19 Consider the grammar T T T where -> → T -> T → T *T → int is considered a single terminal symbol. a) Add a new start production as shown in section 3.14. b) Calculate FOLLOW(T). Remember to add an extra start produc- tion. c) Construct an SLR parser-table for the grammar. d) Eliminate conicts using the following precedence rules:  * binds tighter than ->.  * is left-associative.  -> is right-associative. 3.18. FURTHER READING 113 Exercise 3.20 In section 3.16.3 it is mentioned that user-dened operator precedences in programming languages can be handled by parsing all operators with a single xed precedence and associativity and then using a separate pass to restructure the syntax tree to reect the declared precedences. Below are two other methods that have been used for this purpose: a) An ambiguous grammar is used and conicts exist in the SLR table. Whenever a conict arises during parsing, the parser consults a table of precedences to resolve this conict. The precedence table is extended whenever a precedence declaration is read. b) A terminal symbol is made for every possible precedence and associativity combination. A conict-free parse table is made either by writing an unambiguous grammar or by eliminating conicts in the usual way. The lexical analyser uses a table of precedences to assign the correct terminal symbol to each operator it reads. Compare all three methods. What are the advantages and disadvantages of each method?. Exercise 3.21 Consider the grammar A → aAa A → bAb A → a) Describe the language that the grammar denes. b) Is the grammar ambiguous? Justify your answer. c) Construct a SLR parse table for the grammar. d) Can the conicts in the table be eliminated? Exercise 3.22 The following ambiguous grammar describes boolean expressions: CHAPTER 3. SYNTAX ANALYSIS 114 B B B B B → → → → → true false B ∨ B B ∧ B ¬B a) Given that negation (¬) binds tighter than conjunction (∧) which binds tighter than disjunction (∨) and that conjunction and disjunction are both right-associative, rewrite the grammar to be unambiguous. b) Write a grammar that generates only true boolean expressions. Hint: Use the answer from question a) and add an additional nonterminal for false boolean expressions. Chapter 4 Symbol Tables 4.1 Introduction An important concept in programming languages is the ability to name objects such as variables, functions and types. Each such named object will have a declaration, where the name is dened as a synonym for the binding. Each name will also have a number of object. This is called uses, where the name is used as a reference to the object to which it is bound. Often, the declaration of a name has a limited scope: a portion of the local declarations, whereas a declaration that makes the declared name visible in the entire program is called global. It may happen that the same name program where the name will be visible. Such declarations are called is declared in several nested scopes. In this case, it is normal that the declaration closest to a use of the name will be the one that denes that particular use. In this context closest is related to the syntax tree of the program: The scope of a declaration will be a sub-tree of the syntax tree and nested declarations will give rise to scopes that are nested sub-trees. The closest declaration of a name is hence the declaration corresponding to the smallest sub-tree that encloses the use of the name. Scoping based in this way on the structure of the syntax tree is called static or lexical binding and is the most common scoping rule in modern programming languages. We will in the rest of this chapter (indeed, the rest of this book) assume that static binding is used. A few languages have dynamic binding, where the declaration that was most recently encountered during execution of the program denes the current use of the name. By its nature, dynamic binding can not be resolved at compile-time, so the techniques that in the rest of this chapter are 115 CHAPTER 4. SYMBOL TABLES 116 described as being used in a compiler will have to be used at run-time if the language uses dynamic binding. A compiler will need to keep track of names and the objects these are bound to, so that any use of a name will be attributed correctly to its declaration. This is typically done using a symbol table (or environment, as it is sometimes called). 4.2 Symbol tables A symbol table is a table that binds names to objects. We need a number of operations on symbol tables to accomplish this: empty • We need an • We need to be able to symbol table, in which no name is dened. bind a name to an object. In case the name is already dened in the symbol table, the new binding takes precedence over the old. • We need to be able to look up a name in a symbol table to nd the object the name is bound to. If the name is not dened in the symbol table, we need to be told that. • We need to be able to • We need to be able to enter a new scope. exit a scope, reestablishing the symbol table to what it was before the scope was entered. 4.2.1 Implementation of symbol tables There are many ways to implement symbol tables, but the most important distinction between these is how scopes are handled. This may be persistent (or functional) data structure, or it may be done imperative (or destructively-updated) data structure. done using a using an A persistent data structure has the property that no operation on the structure will destroy it. Conceptually, a new copy is made of the data structure whenever an operation updates it, hence preserving the old structure unchanged. This means that it is trivial to reestablish the old symbol table when exiting a scope, as it has been preserved by the persistent nature of the data structure. In practice, only a small portion of the data structure is copied, most is shared with the previous version. In the imperative approach, only one copy of the symbol table exists, so explicit actions are required to store the information needed to restore 4.2. SYMBOL TABLES 117 the symbol table to a previous state. This can be done by using a stack. When an update is made, the old binding of a name that is overwritten is recorded (pushed) on the stack. When a new scope is entered, a marker is pushed on the stack. When the scope is exited, the bindings on the stack (down to the marker) are used to reestablish the old symbol table. The bindings and the marker are popped o the stack in the process, returning the stack to the state it was in before the scope was entered. Below, we will look at simple implementations of both approaches and discuss how more advanced approaches can overcome some of the eciency problems with the simple approaches. 4.2.2 Simple persistent symbol tables In functional languages like SML, Scheme or Haskell, persistent data structures are the norm rather than the exception (which is why persistent data structures are sometimes called functional). For example, when a new element is added to a list or an element is taken o the head of the list, the old list still exists and can be used elsewhere. A list is a natural way to implement a symbol table in a functional language: A binding is a pair of a name and its associated object, and a symbol table is a list of such pairs. The operations are implemented in the following way: empty: An empty symbol table is an empty list. binding: A new binding (name/object pair) is added (cons'ed) to the front of the list. lookup: The list is searched until a matching name is found. The object paired with the name is then returned. If the end of the list is reached, an indication that this happened is returned instead. This indication can be made by raising an exception or by letting the lookup function return a type that can hold both objects and errorindications, enter: exit: i.e., a sum-type. The old list is remembered, The old list is recalled, i.e., i.e., a reference is made to it. the above reference is used. The latter two operations are not really explicit operations. Entering and exiting a scope is done by binding a symbol table to a name before entering a new scope and then referring to this name again after the scope is exited. CHAPTER 4. SYMBOL TABLES 118 As new bindings are added to the front of the list, these will automatically take precedence over old bindings as the list is searched from the front to the back. Another functional approach to symbol tables is using functions: A symbol table is quite naturally seen as a function from names to objects. The operations are: empty: An empty symbol table is a function that returns an error in- dication (or raises an exception) no matter what its argument is. binding: Adding a binding of the name table t n to the object o in a symbol is done by dening a new symbol-table function t0 in terms t and the new binding. When t0 is called with a name n1 as argu0 ment, it compares n1 to n. If they are equal, t returns the object o. Otherwise, t0 calls t with n1 as argument and returns the result that this call yields. lookup: The symbol-table function is called with the name as argu- ment. enter: exit: The old function is remembered (referenced). The old function is recalled (by using a reference). Again, the latter two operations are mostly implicit. 4.2.3 A simple imperative symbol table Imperative symbol tables are natural to use if the compiler is written in an imperative language. A simple imperative symbol table can be implemented as a stack, which works in a way similar to the list-based functional implementation: empty: An empty symbol table is an empty stack. binding: A new binding (name/object pair) is pushed on top of the stack. lookup: The stack is searched top-to-bottom until a matching name is found. The object paired with the name is then returned. If the bottom of the stack is reached, we instead return an errorindication. enter: The top-of-stack pointer is remembered. 4.2. SYMBOL TABLES exit: 119 The old top-of-stack pointer is recalled and becomes the current. This is not quite a persistent data structure, as leaving a scope will destroy its symbol table. For most languages, this won't matter, as a scope isn't needed again after it is exited. If this is not the case, a real persistent symbol table must be used, or the needed parts of the symbol table must be stored for later retrieval before exiting a scope. 4.2.4 Eciency issues While all of the above implementations are simple, they all share the same eciency problem: Lookup is done by linear search, so the worstcase time for lookup is proportional to the size of the symbol table. This is mostly a problem in relation to libraries: It is quite common for a program to use libraries that dene literally hundreds of names. A common solution to this problem is hashing: Names are hashed (processed) into integers, which are used to index an array. Each array element is then a linear list of the bindings of names that share the same hash code. Given a large enough hash table, these lists will typically be very short, so lookup time is basically constant. Using hash tables complicates entering and exiting scopes somewhat. While each element of the hash table is a list that can be handled like in the simple cases, doing this for all the array-elements at every entry and exit imposes a major overhead. Instead, it is typical for imperative implementations to use a single stack to record all updates to the table such that they can be undone in time proportional to the number of updates that were done in the local scope. Functional implementations typically use persistent hash-tables, which eliminates the problem. 4.2.5 Shared or separate name spaces In some languages (like C) a variable and a function in the same scope may have the same name, as the context of use will make it clear whether a variable or a function is used. have separate name spaces, We say that functions and variables which means that dening a name in one space doesn't aect the other. In other languages (e.g. Pascal or SML) the context can not (easily) distinguish variables from functions. Hence, declaring a local variable might hide a function declared in an outer scope or vice versa. These languages have a shared name space for variables and functions. Name spaces may be shared or separate for all the kinds of names that can appear in a program, e.g., variables, functions, types, excep- CHAPTER 4. SYMBOL TABLES 120 tions, constructors, classes, eld selectors etc. Sharing can exist between any subsets of these name spaces, and which name spaces are shared is language-dependent. Separate name spaces are easily implemented using a symbol table per name space, whereas shared name spaces naturally share a single symbol table. However, it is sometimes convenient to use a single symbol table even if there are separate name spaces. This can be done fairly easily by adding name-space indicators to the names. A name-space indicator can be a textual prex to the name or it may be a tag that is paired with the name. In either case, a lookup in the symbol table must match both name and name-space indicator of the symbol that is looked up with the entry in the table. 4.3 Further reading Most algorithms-and-data-structures textbooks include descriptions of methods for hashing strings and implementing hash tables. A description of ecient persistent data structures for functional languages can be found in [32]. Exercises Exercise 4.1 Pick some programming language that you know well and determine which of the following objects share name spaces: Variables, functions, procedures and types. If there are more kinds of named objects (labels, data constructors, modules, etc.) in the language, include these in the investigation. Exercise 4.2 Implement, in a programming language of your choice, data structures and operations (empty, binding, lookup, enter and exit) for simple symbol tables. Exercise 4.3 In some programming languages, identiers are case-insensitive so, size and SiZe e.g., refer to the same identier. Describe how symbol tables can be made case-insensitive. Chapter 5 Type Checking 5.1 Introduction Lexing and parsing will reject many texts as not being correct programs. However, many languages have well-formedness requirements that can not be handled exclusively by the techniques seen so far. These requirements can, for example, be static type correctness or a requirement that pattern-matching or case-statements are exhaustive. These properties are most often not context-free, i.e., they can not be checked by membership of a context-free language. Consequently, they are checked by a phase that (conceptually) comes after syntax analysis (though it may be interleaved with it). These checks may happen in a phase that does nothing else, or they may be combined with the actual translation. Often, the translator may exploit or depend on type information, which makes it natural to combine calculation of types with the actual translation. We will here, for the sake of exposition, assume that a separate phase is used for type checking and related checks, and similarly assume that any information gathered by this phase is available in subsequent phases. 5.2 Attributes The checking phase operates on the abstract syntax tree of the program and may make several passes over this. Typically, each pass is a recursive walk over the syntax tree, gathering information or using information gathered in earlier passes. Such information is often called attributes of the syntax tree. Typically, we distinguish between two types of attributes: Synthesised attributes are passed upwards in the syntax 121 CHAPTER 5. TYPE CHECKING 122 tree, from the leaves up to the root. Inherited attributes are, conversely, passed downwards in the syntax tree. Note, however, that information that is synthesised in one subtree may be inherited by another subtree or, in a later pass, by the same subtree. An example of this is a symbol table: This is synthesised by a declaration and inherited by the scope of the declaration. When declarations are recursive, the scope may be the same syntax tree as the declaration itself, in which case one pass over this tree will build the symbol table as a synthesised attribute while a second pass will use it as an inherited attribute. Typically, each syntactical category (represented by a type in the data structure for the abstract syntax tree or by a group of related nonterminals in the grammar) will have its own set of attributes. When we write a checker as a set of mutually recursive functions, there will be one or more such functions for each syntactical category. Each of these functions will take inherited attributes (including the syntax tree itself ) as arguments and return synthesised attributes as results. We will, in this chapter, focus on type checking, and only briey mention other properties that can be checked. The methods used for type checking can in most cases easily be modied to handle such other checks. 5.3 A small example language We will use a small (somewhat contrived) language to show the principles of type checking. The language is a rst-order functional language with recursive denitions. The syntax is given in grammar 5.1. The shown grammar is clearly ambiguous, but that doesn't matter since we operate on the abstract syntax, where such ambiguities have been resolved. In the example language, a program is a list of function declarations. The functions are all mutually recursive, and no function may be declared more than once. Each function declares its result type and the types and names of its arguments. There may not be repetitions in the list of parameters for a function. Functions and variables have separate name spaces. The body of a function is an expression, which may be an integer constant, a variable name, a sum-expression, a comparison, a conditional, a function call or an expression with a local declaration. Comparison is dened both on booleans and integers, but addition only on integers. 5.3. A SMALL EXAMPLE LANGUAGE P rogram → F uns F uns F uns → F un → F un F uns F un → T ypeId ( T ypeIds ) = Exp T ypeId T ypeId → int id → bool id T ypeIds T ypeIds → T ypeId → T ypeId , T ypeIds Exp Exp Exp Exp Exp Exp Exp → → → → → → → Exps Exps → Exp → Exp , Exps num id Exp + Exp Exp = Exp if Exp then Exp else Exp id ( Exps ) let id = Exp in Exp Grammar 5.1: Example language for type checking 123 CHAPTER 5. TYPE CHECKING 124 5.4 Environments for type checking In order to type check the program, we need symbol tables that bind variables and functions to their types. Since there are separate name spaces for variables and functions, we will use two symbol tables, one for variables and one for functions. A variable is bound to one of the two types int or bool. A function is bound to its type, which consists of the types of its arguments and the type of its result. Function types are written as a parenthesised list of the argument types, an arrow and e.g., (int,bool) → int for a function taking two patype int and bool, respectively) and returning an integer. the result type, rameters (of 5.5 Type checking expressions When we type check expressions, the symbol tables for variables and functions are inherited attributes. The type (int or sion is returned as a synthesised attribute. bool) of the expres- To make the presentation independent of any specic data structure for abstract syntax, we will let the type checker function use a notation similar to the concrete syntax for pattern-matching purposes. But you should still think of it as abstract syntax, so all issues of ambiguity etc. have been resolved. For terminals (variable names and numeric constants) with attributes, we assume that there are predened functions for extracting these. Hence, id has an associated function name, that extracts the name of the identier. Similarly, num has a function value, that returns the value of the number. The latter is not required for type checking, though, but we will use it in chapter 6. For each nonterminal, we dene one or more functions that take an abstract syntax subtree and inherited attributes as arguments and return the synthesised attributes. In gure 5.2, we show the type-checking function for expressions. The CheckExp . The symbol table for variables is given by the parameter vtable, and the symbol table for functions by the parameter ftable. The function error reports a type function for type checking expressions is called error. To allow the type checker to continue and report more than one error, we let the error-reporting function return. After reporting a type error, the type checker can make a guess at what the type should have been and return this guess, allowing the type checking to continue. This guess might, however, be wrong, which can cause spurious type errors to be reported later on. Hence, all but the rst type error message should 5.5. TYPE CHECKING EXPRESSIONS CheckExp (Exp, vtable, f table) = case Exp of num int id t = lookup(vtable, name(id)) if t = unbound then error(); int else t Exp1 + Exp2 t1 = CheckExp (Exp1 , vtable, f table) t2 = CheckExp (Exp2 , vtable, f table) if t1 = int and t2 = int then int else error(); int Exp1 = Exp2 t1 = CheckExp (Exp1 , vtable, f table) t2 = CheckExp (Exp2 , vtable, f table) if t1 = t2 then bool else error(); bool if Exp1 t1 = CheckExp (Exp1 , vtable, f table) then Exp2 t2 = CheckExp (Exp2 , vtable, f table) else Exp3 t3 = CheckExp (Exp3 , vtable, f table) if t1 = bool and t2 = t3 then t2 else error(); t2 id ( Exps ) t = lookup(f table, name(id)) if t = unbound then error(); int else ((t1 , . . . , tn ) → t0 ) = t [t01 , . . . , t0m ] = CheckExps (Exps, vtable, f table) if m = n and t1 = t01 , . . . , tn = t0n then t0 else error(); t0 let id = Exp1 t1 = CheckExp (Exp1 , vtable, f table) in Exp2 vtable0 = bind(vtable, name(id), t1 ) CheckExp (Exp2 , vtable0 , f table) CheckExps (Exps, vtable, f table) = case Exps of Exp [CheckExp (Exp, vtable, f table)] Exp , Exps CheckExp (Exp, vtable, f table) :: CheckExps (Exps, vtable, f table) Figure 5.2: Type checking of expressions 125 CHAPTER 5. TYPE CHECKING 126 be taken with a grain of salt. We will briey explain each of the cases handled by CheckExp . int. • A number has type • The type of a variable is found by looking its name up in the symbol table for variables. If the variable is not found in the symbol table, the lookup-function returns the special value unbound. When this happens, an error is reported and the type checker arbitrarily guesses that the type is type returned by • lookup. int. Otherwise, it returns the A plus-expression requires both arguments to be integers and has an integer result. • Comparison requires that the arguments have the same type. In either case, the result is a boolean. • In a conditional expression, the condition must be of type bool and the two branches must have identical types. The result of a condition is the value of one of the branches, so it has the same type as these. If the branches have dierent types, the type checker reports an error and arbitrarily chooses the type of the then-branch as its guess for the type of the whole expression. • At a function call, the function name is looked up in the function environment to nd the number and types of the arguments as well as the return type. The number of arguments to the call must coincide with the expected number and their types must match the declared types. function. The resulting type is the return-type of the If the function name isn't found in ftable, an error is reported and the type checker arbitrarily guesses the result type to be • A int. let-expression declares a new variable, the type of which is that of the expression that denes the value of the variable. The symbol table for variables is extended using the function bind, and the extended table is used for checking the body-expression and nding its type, which in turn is the type of the whole expression. A let-expression can not in itself be the cause of a type error (though its parts may), so no testing is done. Since CheckExp Exps and its related typeincluded CheckExps in gure 5.2. mentions the nonterminal checking function CheckExps , we have 5.6. TYPE CHECKING OF FUNCTION DECLARATIONS CheckExps 127 builds a list of the types of the expressions in the expres- sion list. The notation is taken from SML: A list is written in square brackets with commas between the elements. The operator :: adds an element to the front of a list. 5.6 Type checking of function declarations A function declaration explicitly declares the types of the arguments. This information is used to build a symbol table for variables, which is used when type checking the body of the function. The type of the body must match the declared result type of the function. function for functions, CheckF un , The type check has as inherited attribute the symbol table for functions, which is passed down to the type check function for CheckF un returns no information, it just checks for internal errors. CheckF un is shown in gure 5.3, along with the functions for T ypeId and T ypeIds, which it uses. The function GetT ypeId just returns a pair of the declared name and type, and CheckT ypeIds builds a symbol table from such pairs. CheckT ypeIds also checks if all parameters have dierent names. emptytable is an empty symbol table. Looking any name up in the empty symbol table returns unbound. expressions. 5.7 Type checking a program A program is a list of functions and is deemed type correct if all the functions are type correct, and there are no two function denitions dening the same function name. Since all functions are mutually recursive, each of these must be type checked using a symbol table where all functions are bound to their type. This requires two passes over the list of functions: One to build the symbol table and one to check the function denitions using this table. Hence, we need two functions operating over F uns and two functions operating over F un. We have already seen CheckF un . The other, GetF un , returns the pair of one of the latter, the function's declared name and type, which consists of the types of the arguments and the type of the result. It uses an auxiliary function GetT ypes to nd the types of the arguments. The two functions for the F uns are GetF uns , which builds the symbol table and checks for duplicate denitions, and CheckF uns , which calls CheckF un for all functions. These functions and the main function CheckP rogram , syntactic category which ties the loose ends, are shown in gure 5.4. This completes type checking of our small example language. 128 CHAPTER 5. TYPE CHECKING CheckF un (F un, f table) = case F un of T ypeId ( T ypeIds ) = Exp (x, t0 ) = GetT ypeId (T ypeId) vtable = CheckT ypeIds (T ypeIds) t1 = CheckExp (Exp, vtable, f table) if t0 6= t1 then error() GetT ypeId (T ypeId) = case T ypeId of int id (name(id), int) bool id (name(id), bool) CheckT ypeIds (T ypeIds) = case T ypeIds of T ypeId (x, t) = GetT ypeId (T ypeId) bind(emptytable, x, t) T ypeId , T ypeIds (x, t) = GetT ypeId (T ypeId) vtable = CheckT ypeIds (T ypeIds) if lookup(vtable, x) = unbound then bind(vtable, x, t) else error(); vtable Figure 5.3: Type checking a function declaration 5.7. TYPE CHECKING A PROGRAM CheckP rogram (P rogram) = case P rogram of F uns f table = GetF uns (F uns) CheckF uns (F uns, f table) GetF uns (F uns) = case F uns of F un (f, t) = GetF un (F un) bind(emptytable, f, t) F un F uns (f, t) = GetF un (F un) f table = GetF uns (F uns) if lookup(f table, f ) = unbound then bind(f table, f, t) else error(); f table GetF un (F un) = case F un of T ypeId ( T ypeIds ) = Exp (f, t0 ) = GetT ypeId (T ypeId) [t1 , . . . , tn ] = GetT ypes (T ypeIds) (f, (t1 , . . . , tn ) → t0 ) GetT ypes (T ypeIds) = case T ypeIds of T ypeId (x, t) = GetT ypeId (T ypeId) [t] T ypeId T ypeIds (x1 , t1 ) = GetT ypeId (T ypeId) [t2 , . . . , tn ] = GetT ypes (T ypeIds) [t1 , t2 , . . . , tn ] CheckF uns (F uns, f table) = case F uns of CheckF un (F un, f table) F un F un F uns CheckF un (F un, f table) CheckF uns (F uns, f table) Figure 5.4: Type checking a program 129 CHAPTER 5. TYPE CHECKING 130 5.8 Advanced type checking Our example language is very simple and obviously doesn't cover all aspects of type checking. A few examples of other features and brief explanations of how they can be handled are listed below. Assignments. When a variable is given a value by an assignment, it must be veried that the type of the value is the same as the declared type of the variable. Some compilers may check if a variable is potentially used before it is given a value, or if a variable is not used after its assignment. While not exactly type errors, such behaviour is likely to be undesirable. Testing for such behaviour does, however, require somewhat more complicated analysis than the simple type checking presented in this chapter, as it relies on non-structural information. Data structures. components (e.g., a A data structure may dene a value with several struct, tuple or record), or a value that may be of union, variant or sum). To type dierent types at dierent times (e.g., a check such structures, the type checker must be able to represent their types. Hence, the type checker may need a data structure that describes complex types. This may be similar to the data structure used for the abstract syntax trees of declarations. Operations that build or take apart structured data need to be tested for correctness. If each operation on structured data has well-dened types for its arguments and a type for its result, this can be done in a way similar to how function calls are tested. Overloading. Overloading means that the same name is used for sev- eral dierent operations over several dierent types. We saw a simple example of this in the example language, where = was used both for com- paring integers and booleans. In many languages, arithmetic operators like + and − are dened both over integers and oating point numbers, and possibly other types as well. If these operators are predened, and there is only a nite number of cases they cover, all the possible cases may be tried in turn, just like in our example. This, however, requires that the dierent instances of the operator have disjoint argument types. If, for example, there is a function read that reads a value from a text stream and this is dened to read either integers or oating point numbers, the argument (the text stream) alone can not be used to select the right operator. Hence, the type checker 5.8. ADVANCED TYPE CHECKING 131 must pass the expected type of each expression down as an inherited attribute, so this (possibly in combination with the types of the arguments) can be used to pick the correct instance of the overloaded operator. It may not always be possible to send down an expected type due to lack of information. In our example language, this is the case for = (as these may be either int or bool) and the rst let-expression ( since the variable bound in the let- the arguments to expression in a expression is not declared to be a specic type). If the type checker for this or some other reason is unable to pick a unique operator, it may report unresolved overloading as a type error, or it may pick a default instance. Type conversion. A language may have operators for converting a value of one type to a value of another type, point number. e.g. an integer to a oating Sometimes these operators are explicit in the program and hence easy to check. However, many languages allow implicit conversion of integers to oats, such that, for example, 3+3.12 is well-typed with the implicit assumption that the integer 3 is converted to a oat before the addition. This can be handled as follows: If the type checker discovers that the arguments to an operator do not have the correct type, it can try to convert one or both arguments to see if this helps. If there is a small number of predened legal conversions, this is no major problem. However, a combination of user-dened overloaded operators and user-dened types with conversions can make the type-checking process quite dicult, as the information needed to choose correctly may not be available at compile-time. This is typically the case in object-oriented languages, where method selection is often done at run-time. We will not go into details of how this can be done. Polymorphism / Generic types. polymorphic or generic, similar types, e.g. over all Some languages allow a function to be that is, to be dened over a large class of arrays no matter what the types of the elements are. A function can explicitly declare which parts of the type is generic/polymorphic or this can be implicit (see below). The type checker can insert the actual types at every use of the generic/polymorphic function to create instances of the generic/polymorphic type. This mech- anism is dierent from overloading as the instances will be related by a common generic type and because a polymorphic/generic function can be instantiated by any type, not just by a limited list of declared alternatives as is the case with overloading. CHAPTER 5. TYPE CHECKING 132 Implicit types. Some languages (like Standard ML and Haskell) re- quire programs to be well-typed, but do not require explicit type declarations for variables or functions. For such to work, a type inference algorithm is used. A type inference algorithm gathers information about uses of functions and variables and uses this information to infer the types of these. If there are inconsistent uses of a variable, a type error is reported. 5.9 Further reading Overloading of operators and functions is described in section 6.5 of [5]. Section 6.7 of same describes how polymorphism can be handled. Some theory and a more detailed algorithm for inferring types in a language with implicit types and polymorphism can be found in [27]. Exercises Exercise 5.1 Add the productions Exp → floatconst T ypeId → float id to grammar 5.1. This introduces oating-point numbers to the language. The operator + is overloaded so it can do integer addition or oating- point addition, and = is extended so it can also compare oating point numbers for equality. a) Extend the type checking functions in gures 5.2-5.4 to handle these extensions. b) We now add implicit conversion of integers to oats to the language, using the rules: Whenever an operator has one integer argument and one oating-point argument, the integer is converted to a oat. Similarly, if an if-then-else expression has one in- teger branch and one oating-point branch, the integer branch is converted to oating-point. Extend the type checking functions from question a) above to handle this. 5.9. FURTHER READING 133 Exercise 5.2 The type check function in gure 5.2 tries to guess the correct type when there is a type error. In some cases, the guess is arbitrarily chosen to be int, which may lead to spurious type errors later on. A way around this is to have an extra type: checking. unknown, which is only used during type If there is a type error and there is no basis for guessing a unknown is returned (the error is still reported, though). If argument to an operator is of type unknown, the type checker should correct type, an not report this as a type error but continue as if the type is correct. unknown The use of an unknown argument to an operator may make the result as well, so these can be propagated arbitrarily far. Change gure 5.2 to use the unknown type as described above. Exercise 5.3 We look at a simple language with an exception mechanism: → → → → S S S S A throw throw id S catch id ⇒ S S or S other statement throws a named exception. nearest enclosing catch in the left sub-statement of the catch or throw statement is statement) using the same name, whereby the statement after the arrow in the cuted. An This is caught by the statement (i.e., where the catch statement is exe- statement is a non-deterministic choice between the two statements, so either one can be executed. other is a statement that don't throw any exceptions. We want the type checker to ensure that all possible exceptions are caught and that no catch statement is superuous, i.e., that the excep- tion it catches can, in fact, be thrown by its left sub-statement. Write type-check functions that implement these checks. Hint: Let the type of a statement be the set of possible exceptions it can throw. 134 CHAPTER 5. TYPE CHECKING Chapter 6 Intermediate-Code Generation 6.1 Introduction The nal goal of a compiler is to get programs written in a high-level language to run on a computer. This means that, eventually, the program will have to be expressed as machine code which can run on the computer. This doesn't mean that we need to translate directly from the high-level abstract syntax to machine code. Many compilers use a medium-level language as a stepping-stone between the high-level language and the very low-level machine code. guages are called Such stepping-stone lan- intermediate code. Apart from structuring the compiler into smaller jobs, using an intermediate language has other advantages: • If the compiler needs to generate code for several dierent machinearchitectures, only one translation to intermediate code is needed. Only the translation from intermediate code to machine language (i.e., the • back-end) needs to be written in several versions. If several high-level languages need to be compiled, only the translation to intermediate code need to be written for each language. They can all share the back-end, i.e., the translation from inter- mediate code to machine code. • Instead of translating the intermediate language to machine code, it can be interpreted by a small program written in machine code or a language for which a compiler already exists. 135 CHAPTER 6. INTERMEDIATE-CODE GENERATION 136 The advantage of using an intermediate language is most obvious if many languages are to be compiled to many machines. If translation is done directly, the number of compilers is equal to the product of the number of languages and the number of machines. If a common intermediate language is used, one front-end (i.e., compiler to intermediate code) is needed for every language and one back-end is needed for each machine, making the total equal to the sum of the number of languages and the number of machines. If an interpreter for the intermediate language is written in a language for which there already exist compilers on the target machines, the interpreter can be compiled on each of these. This way, there is no need to write a separate back-end for each machine. The advantages of this approach are: • No actual back-end needs to be written for each new machine. • A compiled program can be distributed in a single intermediate form for all machines, as opposed to shipping separate binaries for each machine. • The intermediate form may be more compact than machine code. This saves space both in distribution and on the machine that executes the programs (though the latter is somewhat oset by requiring the interpreter to be kept in memory during execution). The disadvantage is speed: Interpreting the intermediate form will in most cases be a lot slower than executing translated code directly. Nevertheless, the approach has seen some success, e.g., with Java. Some of the speed penalty can be eliminated by translating the intermediate code to machine code immediately before or during execution of the program. This hybrid form is called just-in-time compilation and is often used for executing the intermediate code for Java. We will in this book, however, focus mainly on using the intermediate code for traditional compilation, where the intermediate form will be translated to machine code by a the back-end of the compiler. 6.2 Choosing an intermediate language An intermediate language should, ideally, have the following properties: • It should be easy to translate from a high-level language to the intermediate language. This should be the case for a wide range of dierent source languages. 6.2. CHOOSING AN INTERMEDIATE LANGUAGE • 137 It should be easy to translate from the intermediate language to machine code. This should be true for a wide range of dierent target architectures. • The intermediate format should be suitable for optimisations. The rst two of these properties can be somewhat hard to reconcile. A language that is intended as target for translation from a high-level language should be fairly close to this. However, this may be hard to achieve for more than a small number of similar languages. Furthermore, a high-level intermediate language puts more burden on the back-ends. A low-level intermediate language may make it easy to write back-ends, but puts more burden on the front-ends. A low-level intermediate language, also, may not t all machines equally well, though this is usually less of a problem than the similar problem for front-ends, as machines typically are more similar than high-level languages. A solution that may reduce the translation burden, though it doesn't address the other problems, is to have two intermediate levels: One, which is fairly high-level, is used for the front-ends and the other, which is fairly low-level, is used for the back-ends. A single shared translator is then used to translate between these two intermediate formats. When the intermediate format is shared between many compilers, it makes sense to do as many optimisations as possible on the intermediate format. This way, the (often substantial) eort of writing good optimisations is done only once instead of in every compiler. Another thing to consider when choosing an intermediate language is the granularity: Should an operation in the intermediate language correspond to a large amount of work or to a small amount of work? The rst of these approaches is often used when the intermediate language is interpreted, as the overhead of decoding instructions is amortised over more actual work, but it can also be used for compiling. In this case, each intermediate-code operation is typically translated into a sequence of machine-code instructions. When coarse-grained intermediate code is used, there is typically a fairly large number of dierent intermediate-code operations. The opposite approach is to let each intermediate-code operation be as small as possible. This means that each intermediate-code operation is typically translated into a single machine-code instruction or that several intermediate-code operations can be combined into one machinecode operation. The latter can, to some degree, be automated as each machine-code instruction can be described as a sequence of intermediate- CHAPTER 6. INTERMEDIATE-CODE GENERATION 138 P rogram → [ Instructions ] Instructions → Instruction Instructions → Instruction , Instructions Instruction Instruction Instruction Instruction Instruction Instruction Instruction Instruction Instruction → → → → → → → → → Atom Atom → id → num Args Args → id → id , Args LABEL labelid id := Atom id := unop Atom id := id binop Atom id := M [Atom] M [Atom] := id GOTO labelid IF id relop Atom THEN labelid ELSE labelid id := CALL functionid(Args) Grammar 6.1: The intermediate language code instructions. When intermediate-code is translated to machine- code, the code generator can look for sequences that match machinecode operations. By assigning cost to each machine-code operation, this can be turned into a combinatorial optimisation problem, where the least-cost solution is found. We will return to this in chapter 7. 6.3 The intermediate language In this chapter we have chosen a fairly low-level ne-grained intermediate language, as it is best suited to convey the techniques we want to cover. We will not treat translation of function calls until chapter 9, so a program in our intermediate language will, for the time being, correspond to the body of a function or procedure in a real program. For the same reason, function calls are initially treated as primitive operations in the intermediate language. 6.4. GENERATING CODE FROM EXPRESSIONS 139 The grammar for the intermediate language is shown in grammar 6.1. A program is a sequence of instructions. The instructions are: • A label. This has no eect but serves only to mark the position in the program as a target for jumps. • An assignment of an atomic expression (constant or variable) to a variable. • A unary operator applied to an atomic expression, with the result stored in a variable. • A binary operator applied to a variable and an atomic expression, with the result stored in a variable. • A transfer from memory to a variable. The memory location is an atomic expression. • A transfer from a variable to memory. The memory location is an atomic expression. • A jump to a label. • A conditional selection between jumps to two labels. The condition is found by comparing a variable with an atomic expression by using a relational operator (=, • 6=, <, >, ≤ or ≥). A function call. The arguments to the function call are variables and the result is assigned to a variable. This instruction is used even if there is no actual result (i.e, if a procedure is called instead of a function), in which case the result variable is a dummy variable. An atomic expression is either a variable or a constant. We have not specied the set of unary and binary operations, but we expect these to include normal integer arithmetic and bitwise logical operations. We assume that all values are integers. Adding oating-point numbers and other primitive types isn't dicult, though. CHAPTER 6. INTERMEDIATE-CODE GENERATION 140 Exp Exp Exp Exp Exp → → → → → num id unop Exp Exp binop Exp id(Exps) Exps → Exp Exps → Exp , Exps Grammar 6.2: A simple expression language 6.4 Generating code from expressions Grammar 6.2 shows a simple language of expressions, which we will use as our initial example for translation. Again, we have let the set of unary and binary operators be unspecied but assume that the intermediate language includes all those used by the expression language. We assume that there is a function transop that translates the name of an operator in the expression language into the name of the corresponding operator in the intermediate language. The tokens unop and binop have the names of the actual operators as attributes, accessed by the function opname. When writing a compiler, we must decide what needs to be done at compile-time and what needs to be done at run-time. Ideally, as much as possible should be done at compile-time, but some things need to be postponed until run-time, as they need the actual values of variables, etc., which aren't known at compile-time. When we, below, explain the workings of the translation functions, we might use phrasing like the expression is evaluated and the result stored in the variable. This describes actions that are performed at run-time by the code that is generated at compile-time. At times, the textual description may not be 100% clear as to what happens at which time, but the notation used in the translation functions make this clear: The code that is written between the square brackets is executed at run-time, the rest is done at compile-time. When we want to translate the expression language to the intermediate language, the main complication is that the expression language is tree-structured while the intermediate language is at, requiring the result of every operation to be stored in a variable and every (non- 6.4. GENERATING CODE FROM EXPRESSIONS constant) argument to be in one. We use a function 141 newvar to generate newvar is called, new variables in the intermediate language. Whenever it returns a previously unused variable name. We will describe translation of expressions by a translation function using a notation similar to the notation we used for type-checking functions in chapter 5. Some attributes for the translation function are obvious: It must return the code as a synthesised attribute. Furthermore, it must translate variables and functions used in the expression language to the names these correspond to in the intermediate language. This can be done by symbol tables vtable and f table that bind variable and function names in the expression language into the corresponding names in the intermediate language. The symbol tables are passed as inherited attributes to the translation function. In addition to these attributes, the translation function must use attributes to decide where to put the values of sub-expressions. This can be done in two ways: 1) The location of the values of a sub-expression can be passed up as a synthesised attribute to the parent expression, which decides on a position for its own value. 2) The parent expression can decide where it wants to nd the values of its sub-expressions and pass this information down to these as inherited attributes. Neither of these is obviously superior to the other. Method 1 has a slight advantage when generating code for a variable access, as it doesn't have to generate any code, but can simply return the name of the variable that holds the value. This, however, only works under the assumption that the variable isn't updated before the value is used by the parent expression. If expressions can have side eects, this isn't always the case, as the C expression  x+(x=3) shows. Our expression language doesn't have assignment, but it does have function calls, which may have side eects. Method 2 doesn't have this problem: Since the value of the expression is created immediately before the assignment is executed, there is no risk of other side eects between these two points in time. Method 2 also has a slight advantage when we later extend the language to have assignment statements, as we can then generate code that calculates the expression result directly into the desired variable instead of having to copy it from a temporary variable. CHAPTER 6. INTERMEDIATE-CODE GENERATION 142 Hence, we will choose method 2 for our translation function T ransExp , which is shown in gure 6.3. The inherited attribute place is the intermediate-language variable that the result of the expression must be stored in. If the expression is just a number, the value of that number is stored in the place. If the expression is a variable, the intermediate-language equivalent of this variable is found in intended vtable and an assignment copies it into the place. A unary operation is translated by rst generating a new intermediatelanguage variable to hold the value of the argument of the operation. Then the argument is translated using the newly generated variable for the place attribute. We then use an unop operation in the intermediate language to assign the result to the inherited place. The operator ++ concatenates two lists of instructions. A binary operation is translated in a similar way. Two new intermediate-language variables are generated to hold the values of the arguments, then the arguments are translated and nally a binary operation in the intermediate language assigns the nal result to the inherited place. A function call is translated by rst translating the arguments, using the auxiliary function T ransExps . Then a function call is generated using the argument variables returned by to the inherited place. T ransExps , with the result assigned The name of the function is looked-up in f table to nd the corresponding intermediate-language name. T ransExps generates code for each argument expression, storing the results into new variables. These variables are returned along with the code, so they can be put into the argument list of the call instruction. 6.4.1 Examples of translation Translation of expressions is always relative to symbol tables and a place for storing the result. In the examples below, we assume a variable x, y and z to v0, v1 and v2, respectively and a function table that binds f to _f. The place for the result is t0 and we assume that calls to newvar() return, in sequence, the variables t1, t2, t3, . . . . We start by the simple expression x-3. This is a binop-expression, so the rst we do is to call newvar() twice, giving place1 the value t1 and place2 the value t2. We then call T ransExp recursively with the expression x. When translating this, we rst look up x in the variable symbol table, yielding v0, and then return the code [t1 := v0]. Back symbol table that binds 6.4. GENERATING CODE FROM EXPRESSIONS 143 T ransExp (Exp, vtable, f table, place) = case Exp of v = value(num) [place := v] id x = lookup(vtable, name(id)) [place := x] unop Exp1 place1 = newvar() code1 = T ransExp (Exp1 , vtable, f table, place1 ) op = transop(opname(unop)) code1 ++[place := op place1 ] Exp1 binop Exp2 place1 = newvar() place2 = newvar() code1 = T ransExp (Exp1 , vtable, f table, place1 ) code2 = T ransExp (Exp2 , vtable, f table, place2 ) op = transop(opname(binop)) code1 ++code2 ++[place := place1 op place2 ] id(Exps) (code1 , [a1 , . . . , an ]) = T ransExps (Exps, vtable, f table) f name = lookup(f table, name(id)) code1 ++[place := CALL f name(a1 , . . . , an )] num T ransExps (Exps, vtable, f table) = case Exps of Exp place = newvar() code1 = T ransExp (Exp, vtable, f table, place) (code1 , [place]) Exp , Exps place = newvar() code1 = T ransExp (Exp, vtable, f table, place) (code2 , args) = T ransExps (Exps, vtable, f table) code3 = code1 ++code2 args1 = place :: args (code3 , args1 ) Figure 6.3: Translating an expression CHAPTER 6. INTERMEDIATE-CODE GENERATION 144 in the translation of the subtraction expression, we assign this code to code1 and once more call 3. T ransExp recursively, this time with the [t2 := 3], which we assign code1 ++code2 ++[t0 := t1 − t2] which yields [t1 := v0, t2 := 3, t0 := t1 − t2]. We have translated the source-language operator - to the intermediate-language operator -. expression to code2 . This is translated to the code The nal result is produced by The resulting code looks quite suboptimal, and could, indeed, be shortened to [t0 := v0 − 3]. When we generate intermediate code, we want, for simplicity, to treat each subexpression independently of its context. This may lead to superuous assignments. We will look at ways of getting rid of these when we treat machine code generation and register allocation in chapters 7 and 8. A more complex expression is 3+f(x-y,z). Using the same assump- tions as above, this yields the code t1 := 3 t4 := v0 t5 := v1 t3 := t4 − t5 t6 := v2 t2 := CALL _f(t3, t6) t0 := t1 + t2 We have, for readability, laid the code out on separate lines rather than using a comma-separated list. calls to T ransExp The indentation indicates the depth of that produced the code in each line. 6.5 Translating statements We now extend the expression language in gure 6.2 with statements. The extensions are shown in grammar 6.4. When translating statements, we will need the symbol table for variables (for translating assignment), and since statements contain expres- f table so we can pass it on to T ransExp . newvar to generate new unused variables, we use a newlabel to generate new unused labels. The transla- sions, we also need Just like we use similar function tion function for statements is shown in gure 6.5. It uses an auxiliary translation function for conditions shown in gure 6.6. A sequence of two statements are translated by putting the code for these in sequence. 6.5. TRANSLATING STATEMENTS Stat Stat Stat Stat Stat Stat → → → → → → 145 Stat ; Stat id := Exp if Cond then Stat if Cond then Stat else Stat while Cond do Stat repeat Stat until Cond Cond → Exp relop Exp Grammar 6.4: Statement language An assignment is translated by translating the right-hand-side expression using the left-hand-side variable as target location (place). When translating statements that use conditions, we use an aux- T ransCond . T ransCond translates the arguments to the condition and generates an IF-THEN-ELSE instruction using the same re- iliary function lational operator as the condition. The target labels of this instruction are inherited attributes to T ransCond . if-then statement is translated by rst generating two labels: then-branch and one for the code following the if-then statement. The condition is translated by T ransCond , which is given An One for the the two labels as attributes. When (at run-time) the condition is true, the rst of these are selected, and when false, the second is chosen. Hence, when the condition is true, the by the code after the if-then then-branch is executed followed statement. When the condition is false, we jump directly to the code following the then-branch. if-then-else statement if-then statement, hence bypassing the An is treated similarly, but now the con- dition must choose between jumping to the branch. At the end of the the else-branch then-branch, then-branch or the else- a jump bypasses the code for by jumping to the label at the end. Hence, there is then-branch, one for the else-branch if-then-else statement. while-do loop is true, the body must be ex- need for three labels: One for the and one for the code following the If the condition in a ecuted, otherwise the body is by-passed and the code after the loop is executed. Hence, the condition is translated with attributes that provide the label for the start of the body and the label for the code after the loop. When the body of the loop has been executed, the condition must be re-tested for further passes through the loop. Hence, a jump is 146 CHAPTER 6. INTERMEDIATE-CODE GENERATION T ransStat (Stat, vtable, f table) = case Stat of Stat1 ; Stat2 code1 = T ransStat (Stat1 , vtable, f table) code2 = T ransStat (Stat2 , vtable, f table) code1 ++code2 id := Exp place = lookup(vtable, name(id)) T ransExp (Exp, vtable, f table, place) if Cond label1 = newlabel() then Stat1 label2 = newlabel() code1 = T ransCond (Cond, label1 , label2 , vtable, f table) code2 = T ransStat (Stat1 , vtable, f table) code1 ++[LABEL label1 ]++code2 ++[LABEL label2 ] if Cond label1 = newlabel() then Stat1 label2 = newlabel() else Stat2 label3 = newlabel() code1 = T ransCond (Cond, label1 , label2 , vtable, f table) code2 = T ransStat (Stat1 , vtable, f table) code3 = T ransStat (Stat2 , vtable, f table) code1 ++[LABEL label1 ]++code2 ++[GOTO label3 , LABEL label2 ] ++code3 ++[LABEL label3 ] while Cond label1 = newlabel() do Stat1 label2 = newlabel() label3 = newlabel() code1 = T ransCond (Cond, label2 , label3 , vtable, f table) code2 = T ransStat (Stat1 , vtable, f table) [LABEL label1 ]++code1 ++[LABEL label2 ]++code2 ++[GOTO label1 , LABEL label3 ] repeat Stat1 label1 = newlabel() until Cond label2 = newlabel() code1 = T ransStat (Stat1 , vtable, f table) code2 = T ransCond (Cond, label2 , label1 , vtable, f table) [LABEL label1 ]++code1 ++code2 ++[LABEL label2 ] Figure 6.5: Translation of statements 6.6. LOGICAL OPERATORS 147 T ransCond (Cond, labelt , labelf , vtable, f table) = case Cond of Exp1 relop Exp2 t1 = newvar() t2 = newvar() code1 = T ransExp (Exp1 , vtable, f table, t1 ) code2 = T ransExp (Exp2 , vtable, f table, t2 ) op = transop(opname(relop)) code1 ++code2 ++[IF t1 op t2 THEN labelt ELSE labelf ] Figure 6.6: Translation of simple conditions made to the start of the code for the condition. A total of three labels are thus required: One for the start of the loop, one for the loop body and one for the end of the loop. A repeat-until loop is slightly simpler. The body precedes the condition, so there is always at least one pass through the loop. If the condition is true, the loop is terminated and we continue with the code after the loop. If the condition is false, we jump to the start of the loop. Hence, only two labels are needed: One for the start of the loop and one for the code after the loop. 6.6 Logical operators Logical conjunction, disjunction and negation are often available for conditions, so we can write, e.g., x = y or y = z , where or is a logical disjunction operator. There are typically two ways to treat logical operators in programming languages: 1) Logical operators are similar to arithmetic operators: The arguments are evaluated and the operator is applied to nd the result. 2) The second operand of a logical operator is not evaluated if the rst operand is sucient to determine the result. This means and will not evaluate its second operand if the rst false, and a logical or will not evaluate the second operand if the rst is true. that a logical evaluates to The rst variant is typically implemented by using bitwise logical operators and uses −1) to represent integer 1 0 1 to represent true. false and a nonzero value (typically In C, there is no separate boolean type. is used for logical truth 1 and 0 1 or The for falsehood. Bitwise logical Actually, any non-zero value is treated as logical truth. CHAPTER 6. INTERMEDIATE-CODE GENERATION 148 operators & (bitwise and) and | (bitwise or) are used to implement the corresponding logical operations. Logical negation is bitwise negation, as the bitwise negation of logical negation operator ! 1 isn't 0. not handled by Instead, a special is used. This maps any non-zero value to 0 and 0 to 1. The second variant is called are called && (logical and) and sequential logical operators. || (logical or). In C, these Adding non-sequential logical operators to our language isn't too difcult. Since we haven't said exactly which binary and unary operators exist in the intermediate language, we can simply assume these include relational operators, bitwise logical operations and logical negation. We can now simply allow any expression 2 as a condition by adding the pro- duction Cond → Exp to grammar 6.4. We then extend the translation function for conditions as follows: T ransCond (Cond, labelt , labelf , vtable, f table) = case Cond of Exp1 relop Exp2 t1 = newvar() t2 = newvar() code1 = T ransExp (Exp1 , vtable, f table, t1 ) code2 = T ransExp (Exp2 , vtable, f table, t2 ) op = transop(opname(relop)) code1 ++code2 ++[IF t1 op t2 THEN labelt ELSE labelf ] t = newvar() Exp code1 = T ransExp (Exp, vtable, f table, t) code1 ++[IF t 6= 0 THEN labelt ELSE labelf ] We need to convert the numerical value returned by choice between two labels, so we generate an IF T ransExp into a instruction that does just that. The rule for relational operators is now actually superuous, as the case it handles is covered by the second rule (since relational operators are assumed to be included in the set of binary arithmetic operators). However, we can consider it an optimisation, as the code it generates is shorter than the equivalent code generated by the second rule. It will also be natural to keep it separate when we add sequential logical operators. 2 If it of boolean type, which we assume has been checked by the type checker. 6.6. LOGICAL OPERATORS Exp Exp Exp Exp Exp Exp Exp Exp → → → → → → → → 149 num id unop Exp Exp binop Exp id(Exps) true false Cond Exps → Exp Exps → Exp , Exps Cond Cond Cond Cond Cond Cond Cond → → → → → → → Exp relop Exp true false ! Cond Cond && Cond Cond || Cond Exp Grammar 6.7: Example language with logical operators 6.6.1 Sequential logical operators We will use the same names for sequential logical operators as C, for logical and, || for logical or and ! for logical negation. i.e., && The extended language is shown in gure 6.7. Note that we allow an expression to be a condition as well as a condition to be an expression. This grammar is highly ambiguous (not least because binop overlaps relop). As before, we assume such ambiguity to be resolved by the parser before code generation. We also assume that the last productions of Exp and Cond are used as little as possible, as this will yield the best code. The revised translation functions for Exp and Cond are shown in Exp are shown. true and false are the numbers 1 and 0. Cond is translated into code that chooses between gure 6.8. Only the new cases for As expressions, A condition labels. two When we want to use a condition as an expression, we must convert this choice into a number. We do this by rst assuming that the condition is false and hence assign 0 to the target location. We 150 CHAPTER 6. INTERMEDIATE-CODE GENERATION T ransExp (Exp, vtable, f table, place) = case Exp of . . . true false Cond [place := 1] [place := 0] label1 = newlabel() label2 = newlabel() code1 = T ransCond (Cond, label1 , label2 , vtable, f table) [place := 0]++code1 ++[LABEL label1 , place := 1] ++[LABEL label2 ] T ransCond (Cond, labelt , labelf , vtable, f table) = case Cond of Exp1 relop Exp2 t1 = newvar() t2 = newvar() code1 = T ransExp (Exp1 , vtable, f table, t1 ) code2 = T ransExp (Exp2 , vtable, f table, t2 ) op = transop(opname(relop)) code1 ++code2 ++[IF t1 op t2 THEN labelt ELSE labelf ] true [GOTO labelt ] false [GOTO labelf ] ! Cond1 T ransCond (Cond1 , labelf , labelt , vtable, f table) Cond1 && Cond2 arg2 = newlabel() code1 =T ransCond (Cond1 , arg2 , labelf , vtable, f table) code2 =T ransCond (Cond2 , labelt , labelf , vtable, f table) code1 ++[LABEL arg2 ]++code2 Cond1 || Cond2 arg2 = newlabel() code1 =T ransCond (Cond1 , labelt , arg2 , vtable, f table) code2 =T ransCond (Cond2 , labelt , labelf , vtable, f table) code1 ++[LABEL arg2 ]++code2 Exp t = newvar() code1 = T ransExp (Exp, vtable, f table, t) code1 ++[IF t 6= 0 THEN labelt ELSE labelf ] Figure 6.8: Translation of sequential logical operators 6.6. LOGICAL OPERATORS 151 then, if the condition is true, jump to code that assigns location. 0. value remains around, 1 to the target If the condition is false, we jump around this code, so the i.e., We could equally well have done things the other way rst assign 1 to the target location and modify this to 0 when the condition is false. It gets a bit more interesting in ditions. T ransCond , where we translate con- We have already seen how comparisons and expressions are translated, so we move directly to the new cases. The constant true condition just generates a jump to the label for true conditions, and, similarly, false generates a jump to the label for false conditions. Logical negation generates no code by itself, it just swaps the attributelabels for true and false when translating its argument. This negates the eect of the argument condition. Sequential logical and is translated as follows: The code for the rst operand is translated such that if it is false, the second condition is not tested. This is done by jumping straight to the label for false conditions when the rst operand is false. If the rst operand is true, a jump to the code for the second operand is made. This is handled by using the appropriate labels as arguments to the call to T ransCond T ransCond . The call to for the second operand uses the original labels for true and false. Hence, both conditions have to be true for the combined condition to be true. Sequential or is similar: If the rst operand is true, we jump directly to the label for true conditions without testing the second operand, but if it is false, we jump to the code for the second operand. Again, the second operand uses the original labels for true and false. Note that the translation functions now work even if binop and unop do not contain relational operators or logical negation, as we can just choose the last rule for expressions whenever the binop rules don't match. However, we can not in the same way omit non-sequential (e.g., bitwise) and and or, as these have a dierent eect (i.e., they always evaluate both arguments). We have, in the above, used two dierent nonterminals for conditions and expressions, with some overlap between these and consequently ambiguity in the grammar. It is possible to resolve this ambiguity by rewriting the grammar and get two non-overlapping syntactic categories in the abstract syntax. nals into one, e.g., Exp Another solution is to join the two nontermiand use two dierent translation functions for this: Whenever an expression is translated, the translation function most appropriate for the context is chosen. For example, if-then-else will CHAPTER 6. INTERMEDIATE-CODE GENERATION 152 T ransCond T ransExp . choose a translation function similar to choose a one similar to the current while assignment will 6.7 Advanced control statements We have, so far, shown translation of simple conditional statements and loops, but some languages have more advanced control features. We will briey discuss how such can be implemented. Goto and labels. Labels are stored in a symbol table that binds each to a corresponding label in the intermediate language. label will generate a language label. GOTO A jump to a statement to the corresponding intermediate- Unless labels are declared before use, an extra pass may be needed to build the symbol table before the actual translation. Alternatively, an intermediate-language label can be chosen and an entry in the symbol table be created at the rst occurrence of the label even if it is in a jump rather than a declaration. Subsequent jumps or declarations of that label will use the intermediate-language label that was chosen at the rst occurrence. By setting a mark in the symbol-table entry when the label is declared, it can be checked that all labels are declared exactly once. The scope of labels can be controlled by the symbol table, so labels can be local to a procedure or block. Break/exit. Some languages allow exiting loops from the middle of the loop-body by a break or exit statement. To handle these, the trans- lation function for statements must have an extra inherited parameter which is the label that a break or exit statement must jump to. This attribute is changed whenever a new loop is entered. loop is entered, this attribute is undened. Before the rst The translation function should check for this, so it can report an error if a break or exit occurs outside loops. This should, rightly, be done during type-checking (see chapter 5), though. C's continue statement, which jumps to the start of the current loop, can be handled similarly. Case-statements. A case-statement evaluates an expression and sel- ects one of several branches (statements) based on the value of the expression. In most languages, the case-statement end of each of these statements. In this case, the case-statement can will be exited at the 6.8. TRANSLATING STRUCTURED DATA 153 be translated as an assignment that stores the value of the expression if-then-else statement, where each branch of the case-statement becomes a then-branch of one of the if-then-else statements (or, in case of the default branch, the nal else-branch). In C, the default is that all case-branches following the selected branch are executed unless the case-expression (called switch in C) is explicitly terminated with a break statement (see above) at the end of the branch. In this case, the case-statement can still be translated to a nested if-then-else, but the branches of these are now GOTO's to the code for each case-branch. The code for the branches is placed in sequence after the nested if-then-else, with break handled by GOTO's followed by a nested as described above. Hence, if no explicit jump is made, one branch will fall through to the next. 6.8 Translating structured data So far, the only values we have used are integers and booleans. However, most programming languages provide oating-point numbers and structured values like arrays, records (structs), unions, lists or tree-structures. We will now look at how these can be translated. We will rst look at oats, then at one-dimensional arrays, multi-dimensional arrays and nally other data structures. 6.8.1 Floating-point values Floating-point values are, in a computer, typically stored in a dierent set of registers than integers. Apart from this, they are treated the same way we treat integer values: We use temporary variables to store intermediate expression results and assume the intermediate language has binary operators for oating-point numbers. The register allocator will have to make sure that the temporary variables used for oating-point values are mapped to oating-point registers. For this reason, it may be a good idea to let the intermediate code indicate which temporary variables hold oats. This can be done by giving them special names or by using a symbol table to hold type information. 6.8.2 Arrays We extend our example language with one-dimensional arrays by adding the following productions: CHAPTER 6. INTERMEDIATE-CODE GENERATION 154 T ransExp (Exp, vtable, f table, place) = case Exp of Index (code1 , address) = T ransIndex (Index, vtable, f table) code1 ++[place := M [address]] T ransStat (Stat, vtable, f table) = case Stat of Index := Exp (code1 , address) = T ransIndex (Index, vtable, f table) t = newvar() code2 = T ransExp (Exp, vtable, f table, t) code1 ++code2 ++[M [address] := t] T ransIndex (Index, vtable, f table) = case Index of id[Exp] base = lookup(vtable, name(id)) t = newvar() code1 = T ransExp (Exp, vtable, f table, t) code2 = code1 ++[t := t ∗ 4, t := t + base] (code2 , t) Figure 6.9: Translation for one-dimensional arrays Exp → Index Stat → Index := Exp Index → id[Exp] Index is an array element, which can be used the same way as a variable, either as an expression or as the left part of an assignment statement. We will initially assume that arrays are zero-based (i.e.. the lowest index is 0). Arrays can be allocated statically, ically, i.e., i.e., at compile-time, or at run-time. In the rst case, the base address dynam- of the array (the address at which index 0 is stored) is a compile-time constant. In the latter case, a variable will contain the base address of the array. In either case, we assume that the symbol table for variables binds an array name to the constant or variable that holds its base address. Most modern computers are byte-addressed, while integers typically are 32 or 64 bits long. This means that the index used to access array elements must be multiplied by the size of the elements (measured in bytes), e.g., 4 or 8, to nd the actual oset from the base address. In the translation shown in gure 6.9, we use 4 for the size of integers. We show only the new parts of the translation functions for Exp and Stat. 6.8. TRANSLATING STRUCTURED DATA We use a translation function T ransIndex 155 for array elements. This returns a pair consisting of the code that evaluates the address of the array element and the variable that holds this address. When an array element is used in an expression, the contents of the address is transferred to the target variable using a memory-load instruction. When an array element is used on the left-hand side of an assignment, the right-hand side is evaluated, and the value of this is stored at the address using a memory-store instruction. The address of an array element is calculated by multiplying the index by the size of the elements and adding this to the base address of the array. Note that base can be either a variable or a constant (depending on how the array is allocated, see below), but since both are allowed as the second operator to a binop in the intermediate language, this is no problem. Allocating arrays So far, we have only hinted at how arrays are allocated. As mentioned, one possibility is static allocation, where the base-address and the size of the array are known at compile-time. The compiler, typically, has a large address space where it can allocate statically allocated objects. When it does so, the new object is simply allocated after the end of the previously allocated objects. Dynamic allocation can be done in several ways. One is allocation local to a procedure or function, such that the array is allocated when the function is entered and deallocated when it is exited. This typically means that the array is allocated on a stack and popped from the stack when the procedure is exited. If the sizes of locally allocated arrays are xed at compile-time, their base addresses are constant osets from the stack top (or from the frame pointer, see chapter 9) and can be calculated from this at every array-lookup. However, this doesn't work if the sizes of these arrays are given at run-time. In this case, we need to use a variable to hold the base address of each array. The address is calculated when the array is allocated and then stored in the corresponding variable. This can subsequently be used as described in T ransIndex above. At compile-time, the array-name will in the symbol table be bound to the variable that at runtime will hold the base-address. Dynamic allocation can also be done globally, so the array will survive until the end of the program or until it is explicitly deallocated. In this case, there must be a global address space available for run-time allocation. Often, this is handled by the operating system which han- CHAPTER 6. INTERMEDIATE-CODE GENERATION 156 1st column 2nd column 3rd column 3rd row a[0][0] a[1][0] a[2][0] a[0][1] a[1][1] a[2][1] a[0][2] a[1][2] a[2][2] . . . . . . . . . . . . 1st row 2nd row ··· ··· ··· ··· .. . Figure 6.10: A two-dimensional array dles memory-allocation requests from all programs that are running at any given time. Such allocation may fail due to lack of memory, in which case the program must terminate with an error or release memory enough elsewhere to make room. The allocation can also be controlled by the program itself, which initially asks the operating system for a large amount of memory and then administrates this itself. This can make allocation of arrays faster than if an operating system call is needed every time an array is allocated. Furthermore, it can allow the program to use garbage collection to automatically reclaim arrays that are no longer in use. Garbage collection is, however, beyond the scope of this book. Multi-dimensional arrays Multi-dimensional arrays can be laid out in memory in two ways: major and column-major. dimensional arrays, as shown i Figure 6.10. is addressed by two indices, The rst index, j, i, e.g., indicates the A two-dimensional array (using C-style notation) as row a[i][j]. of the element and the second column. The rst row of the array a[0][0], a[0][1], a[0][2], . . . and the rst a[0][0], a[1][0], a[2][0], . . . .3 index, row- The dierence is best illustrated by two- indicates the the elements is, hence, column is In row-major form, the array is laid out one row at a time and in column-major form it is laid out one column at a time. In a 3×2 array, the ordering for row-major is a[0][0], a[0][1], a[1][0], a[1][1], a[2][0], a[2][1] For column-major the ordering is a[0][0], a[1][0], a[2][0], a[0][1], a[1][1], a[2][1] 3 Note that the coordinate system, following computer-science tradition, is rotated 90 clockwise compared to mathematical tradition. ◦ 6.8. TRANSLATING STRUCTURED DATA 157 size and the sizes of the dimensions in an dim0 , dim1 , . . . , dimn−2 , dimn−1 , then in rowelement at index [i0 ][i1 ] . . . [in−2 ][in−1 ] has the address If the size of an element is n-dimensional array are major format an base + ((. . . (i0 ∗ dim1 + i1 ) ∗ dim2 . . . + in−2 ) ∗ dimn−1 + in−1 ) ∗ size In column-major format the address is base + ((. . . (in−1 ∗ dimn−2 + in−2 ) ∗ dimn−3 . . . + i1 ) ∗ dim0 + i0 ) ∗ size Note that column-major format corresponds to reversing the order of the indices of a row-major array. dimn−1 , i1 and dim1 by in−2 i.e., replacing i0 and dim0 dimn−2 , and so on. by in−1 and and We extend the grammar for array-elements to accommodate multidimensional arrays: Index → id[Exp] Index → Index[Exp] and extend the translation functions as shown in gure 6.11. This translation is for row-major arrays. We leave column-major arrays as an exercise. With these extensions, the symbol table must return both the baseaddress of the array and a list of the sizes of the dimensions. Like the base-address, the dimension sizes can either be compile-time constants or variables that at run-time will hold the sizes. translation function In T ransIndex CalcIndex We use an auxiliary to calculate the position of an element. we multiply this position by the element size and add the base address. As before, we assume the size of elements is 4. In some cases, the sizes of the dimensions of an array are not stored in separate variables, but in memory next to the space allocated for the elements of the array. This uses fewer variables (which may be an issue when these need to be allocated to registers, see chapter 8) and makes it easier to return an array as the result of an expression or function, as only the base-address needs to be returned. The size information is normally stored just before the base-address so, for example, the size of the rst dimension can be at address dimension at base − 8 base − 4, the size of the second and so on. Hence, the base-address will always point to the rst element of the array no matter how many dimensions the array has. If this strategy is used, the necessary dimension-sizes must be loaded into variables when an index is calculated. Since this adds several extra (somewhat costly) loads, optimising compilers often 158 CHAPTER 6. INTERMEDIATE-CODE GENERATION T ransExp (Exp, vtable, f table, place) = case Exp of Index (code1 , address) = T ransIndex (Index, vtable, f table) code1 ++[place := M [address]] T ransStat (Stat, vtable, f table) = case Stat of Index := Exp (code1 , address) = T ransIndex (Index, vtable, f table) t = newvar() code2 = T ransExp (Exp2 , vtable, f table, t) code1 ++code2 ++[M [address] := t] T ransIndex (Index, vtable, f table) = (code1 , t, base, []) = CalcIndex (Index, vtable, f table) code2 = code1 ++[t := t ∗ 4, t := t + base] (code2 , t) CalcIndex (Index, vtable, f table) = case Index of id[Exp] (base, dims) = lookup(vtable, name(id)) t = newvar() code = T ransExp (Exp, vtable, f table, t) (code, t, base, tail(dims)) Index[Exp] (code1 , t1 , base, dims) = CalcIndex (Index, vtable, f table) dim1 = head(dims) t2 = newvar() code2 = T ransExp (Exp, vtable, f table, t2 ) code3 = code1 ++code2 ++[t1 := t1 ∗ dim1 , t1 := t1 + t2 ] (code3 , t1 , base, tail(dims)) Figure 6.11: Translation of multi-dimensional arrays 6.8. TRANSLATING STRUCTURED DATA try to re-use the values of previous loads, 159 e.g., by doing the loading once outside a loop and referring to variables holding the values inside the loop. Index checks The translations shown so far do not test if an index is within the bounds of the array. Index checks are fairly easy to generate: Each index must be compared to the size of (the dimension of ) the array and if the index is too big, a jump to some error-producing code is made. Hence, a single conditional jump is inserted at every index calculation. This is still fairly expensive, but various methods can be used to eliminate some of these tests. For example, if the array-lookup occurs within a for-loop, the bounds of the loop-counter may guarantee that array accesses using this variable will be within bounds. In general, it is possible to make an analysis that nds cases where the index-check condition is subsumed by previous tests, such as the exit test for a loop, the test in an if-then-else statement or previous index checks. Non-zero-based arrays We have assumed our arrays to be zero-based, start from 0. e.g., −10 to i.e., that the indices Some languages allow indices to be arbitrary intervals, 10 or 10 to 20. If such are used, the starting-index must be subtracted from each index when the address is calculated. In a one-dimensional array with known size and base-address, the startingindex can be subtracted (at compile-time) from base-address instead. In a multi-dimensional array with known dimensions, the starting-indices can be multiplied by the sizes of the dimensions and added together to form a single constant that is subtracted from the base-address instead of subtracting each starting-index from each index. 6.8.3 Strings Strings are usually implemented in a fashion similar to one-dimensional arrays. In some languages (e.g. C or pre-ISO-standard Pascal), strings are just arrays of characters. However, strings often dier from arrays in that the length is not static, but can vary at run-time. This leads to an implementation similar to the kind of arrays where the length is stored in memory, as explained in section 6.8.2. Another dierence is that the size of a character is CHAPTER 6. INTERMEDIATE-CODE GENERATION 160 typically one byte (unless 16-bit Unicode characters are used), so the index calculation does not multiply the index by the size (as this is 1). Operations on strings, e.g., concatenation and substring extraction, are typically implemented by calling library functions. 6.8.4 Records/structs and unions Records (structs) have many properties in common with arrays. They are typically allocated in a similar way (with a similar choice of possible allocation strategies), and the elds of a record are typically accessed by adding an oset to the base-address of the record. The dierences are: • The types (and hence sizes) of the elds may be dierent. • The eld-selector is known at compile-time, so the oset from the base address can be calculated at this time. The oset for a eld is simply the sum of the sizes of all elds that occur before it. For a record-variable, the symbol table for variables must hold the base-address and the osets for each eld in the record. The symbol table for types must hold the osets for every record type, such that these can be inserted into the symbol table for variables when a record of this type is declared. In a union (sum) type, the elds are not consecutive, but are stored at the same address, i.e., the base-address of the union. The size of an union is the maximum of the sizes of its elds. In some languages, union types include a tag, variant of the union is stored in the variable. separate eld before the union-elds. which identies which This tag is stored as a Some languages (e.g. Standard ML) enforce that the tag is tested when the union is accessed, others (e.g. Pascal) leave this as an option to the programmer. 6.9 Translating declarations In the translation functions used in this chapter, we have several times required that The symbol table must contain . . . . It is the job of the compiler to ensure that the symbol tables contain the information necessary for translation. When a name (variable, label, type, etc.) is declared, the compiler must keep in the symbol-table entry for that name the information necessary for compiling any use of that name. For scalar 6.10. FURTHER READING 161 variables (e.g., integers), the required information is the intermediatelanguage variable that holds the value of the variable. For array variables, the information includes the base-address and dimensions of the array. For records, it is the osets for each eld and the total size. If a type is given a name, the symbol table must for that name provide a description of the type, such that variables that are declared to be that type can be given the information they need for their own symbol-table entries. The exact nature of the information that is put into the symbol tables will depend on the translation functions that use these tables, so it is usually a good idea to write rst the translation functions for uses of names and then translation functions for their declarations. Translation of function declarations will be treated in chapter 9. 6.9.1 Example: Simple local declarations We extend the statement language by the following productions: Stat → Decl ; Stat Decl → int id Decl → int id[num] We can, hence, declare integer variables and one-dimensional integer arrays for use in the following statement. An integer variable should be bound to a location in the symbol table, so this declaration should add such a binding to vtable. An array should be bound to a variable containing its base address. Furthermore, code must be generated for allocating space for the array. We assume arrays are heap allocated and that the intermediate-code variable of the (upwards growing) heap. HP points to the rst free element Figure 6.12 shows the translation of these declarations. When allocating arrays, no check for heap overow is done. 6.10 Further reading A comprehensive discussion about intermediate languages can be found in [30]. Functional and logic languages often use high-level intermediate languages, which are in many cases translated to lower-level intermediate code before emitting actual machine code. Examples of such intermediate languages can be found in [21], [8] and [6]. CHAPTER 6. INTERMEDIATE-CODE GENERATION 162 T ransStat (Stat, vtable, f table) = case Stat of Decl ; Stat1 (code1 , vtable1 ) = T ransDecl (Decl, vtable) code2 = T ransStat (Stat1 , vtable1 , f table) code1 ++code2 T ransDecl (Decl, vtable) = case Decl of int id t1 = newvar() vtable1 = bind(vtable, name(id), t1 ) ([], vtable1 ) int id[num] t1 = newvar() vtable1 = bind(vtable, name(id), t1 ) ([t1 := HP, HP := HP + (4 ∗ value(num))], vtable1 ) Figure 6.12: Translation of simple declarations Another high-level intermediate language is the Java Virtual Machine [25]. This language has single instructions for such complex things as calling virtual methods and creating new objects. The high-level nature of JVM was chosen for several reasons: • By letting common complex operations be done by single instructions, the code is smaller, which reduces transmission time when sending the code over the Internet. • JVM was originally intended for interpretation, and the complex operations also helped reduce the overhead of interpretation. • A program in JVM is validated (essentially type-checked) before interpretation or further translation. This is easier when the code is high-level. Exercises Exercise 6.1 Use the translation functions in gure 6.3 to generate code for the ex- 2+g(x+y,x*y). Use a vtable that binds x to t0 and y to t1 and f table that binds g to _g. The result of the expression should be put in the intermediate-code variable r (so the place attribute in the initial call to T ransExp is r). pression an 6.10. FURTHER READING 163 Exercise 6.2 Use the translation functions in gures 6.5 and 6.6 to generate code for the statement x:=2+y; if x10 do x:=x/2 until x0 do { x := x-1; if x>10 then x := x/2 } The curly braces are used as disambiguators, though they are not part of grammar 6.4. Use the same vtable as exercise 6.1 and use endlab as the endlabel for the whole statement. Exercise 6.10 In gure 6.5, while statements are translated in such a way that every iteration of the loop executes an unconditional jump (GOTO in addition to the conditional jumps in the loop condition. Modify the translation so each iteration only executes the conditional jumps in the loop condition, every iteration. i.e., so an unconditional jump is saved in You may have to add an unconditional jump outside the loop. Exercise 6.11 Logical conjunction is associative: && p ∧ (q ∧ r) ⇔ (p ∧ q) ∧ r. Show that this also applies to the sequential conjunction operator when translated as in gure 6.8, i.e., same code (up to renaming of labels) as p && (q && r) (p && q) && r. that generates the CHAPTER 6. INTERMEDIATE-CODE GENERATION 166 Exercise 6.12 Figure 6.11 shows translation of multi-dimensional arrays in row-major layout, where the address of each element is found through multiplication and addition. On machines with fast memory access but slow multiplication, an alternative implementation of multi-dimensional ar- dim0 , dim1 , . . . , dimn dim0 with pointers to dim0 dierent arrays each of dimension dim1 , . . . , dimn , which again are rays is sometimes used: An array with dimensions is implemented as a one-dimensional array of size implemented in the same way (until the last dimension, which is implemented as a normal one-dimensional array of values). This takes up more room, as the pointer arrays need to be stored as well as the elements. But array-lookup can be done using only addition and memory accesses. a) Assuming pointers and array elements both need four bytes each, what is the total number of bytes required to store an array of dimensions dim0 , dim1 , . . . , dimn ? b) Write translation functions for array-access in the style of gure 6.11 using this representation of arrays. Use addition to multiply numbers by 4 for scaling indices by the size of pointers and array elements. Chapter 7 Machine-Code Generation 7.1 Introduction The intermediate language we have used in chapter 6 is quite low-level and not unlike the type of machine code you can nd on modern RISC processors, with a few exceptions: • We have used an unbounded number of variables, where a processor will have a bounded number of registers. CALL • We have used a complex • In the intermediate language, the instruction for function calls. IF-THEN-ELSE instruction has two target labels, where, on most processors, the conditional jump instruction has only one target label, and simply falls through to the next instruction when the condition is false. • We have assumed any constant can be an operand to an arithmetic instruction. Typically, RISC processors allow only small constants as operands. The problem of mapping a large set of variables to a small number of registers is handled by register allocation, as explained in chapter 8. Function calls are treated in chapter 9. We will look at the remaining two problems below. The simplest solution for generating machine code from intermediate code is to translate each intermediate-language instruction into one or more machine-code instructions. However, it is often possible to nd a machine-code instruction that covers two or more intermediate-language instructions. We will see how we can exploit the instruction set this way. Additionally, we will briey discuss other optimisations. 167 CHAPTER 7. MACHINE-CODE GENERATION 168 7.2 Conditional jumps Conditional jumps come in many shapes on dierent machines. Some conditional jump instructions embody a relational comparison between two registers (or a register and a constant) and are, hence, similar to the IF-THEN-ELSE instruction in our intermediate language. Other types of conditional jump instruction require the condition to be already resolved and stored in special condition registers or ags. However, it is almost universal that conditional jump instructions specify only one target label (or address), typically used when the condition is true. When the condition is false, execution simply continues with the instructions immediately following the conditional jump instruction. This isn't terribly dicult to handle: IF c THEN lt ELSE lf can be translated to branch_if_c jump where branch_if_c lt lf is a conditional jump instruction on the condition c. It will, however, often be the case that an IF-THEN-ELSE instruction is immediately followed by one of its target labels. In fact, this will always be the case if the intermediate code is generated by the translation functions shown in chapter 6 (see exercise 6.5). to be lf unconditional is lt , If this label happens (the label taken for false conditions), we can simply omit the jump from the code shown above. If the following label we can negate the condition of the conditional jump and make it jump to lf , i.e., as branch_if_not_c lf Hence, the code generator should test which (if any) of the target labels follow an IF-THEN-ELSE instruction and translate it accordingly. Alter- natively, a pass can be made over the generated machine-code to remove superuous jumps. It is possible to extend the translation function for conditionals to use extra inherited attributes that tell which of the target labels (if any) immediately follow the condition code and use this to generate code such that the false-labels of these (inserting GOTO IF-THEN-ELSE instructions immediately follow instructions if necessary). If the conditional jump instructions in the target machine do not 7.3. CONSTANTS 169 allow conditions as complex as those used in the intermediate language, code must be generated to calculate the condition and put the result somewhere where it can be tested by the conditional jump instruction. In some machine architectures (e.g., MIPS and Alpha), this somewhere can be a general-purpose register. Other machines (e.g. PowerPC or In- tel's IA-64) use special condition registers, while yet others (e.g. IA-32, Sparc, PA-RISC and ARM) use a single set of arithmetic ags that can be set by comparison or arithmetic instructions. A conditional jump may test various combinations of the ags, so the same comparison instruction can, in combination with dierent conditions, be used for testing equality, signed or unsigned less-than, overow and several other properties. Usually, an IF-THEN-ELSE instruction can be translated to two instructions: One that does the comparison and one that does the conditional jump. 7.3 Constants The intermediate language allows arbitrary constants as operands to binary or unary operators. This is not always the case in machine code. For example, MIPS allows only 16-bit constants in operands even though integers are 32 bits (64 bits in some versions of the MIPS architecture). To build larger constants, MIPS includes instructions to load 16-bit constants into the upper portions (most signicant bits) of a register. With help of these, an arbitrary 32-bit integer can be entered into a register using two instructions. On the ARM, a constant can be any 8-bit number positioned at any even bit-boundary. It may take up to four instructions to build a 32-bit number using these. When an intermediate-language instruction uses a constant, the code generator must check if it ts into the constant eld (if any) of the equivalent machine-code instruction. If it does, a single machine-code instruction is generated. If not, a sequence of instructions are generated that builds the constant in a register, followed by an instruction that uses this register in place of the constant. If a complex constant is used inside a loop, it may be a good idea to move the code for generating this outside the loop and keep it in a register inside the loop. This can be done as part of a general optimisation to move code out of loops, see section 7.5. CHAPTER 7. MACHINE-CODE GENERATION 170 7.4 Exploiting complex instructions Most instructions in our intermediate language are atomic, in the sense that they each correspond to a single operation and can not sensibly be split into several smaller steps. IF-THEN-ELSE, The exceptions to this rule are which is described above, and CALL, which will be de- tailed in chapter 9. While the philosophy behind RISC (Reduced Instruction Set Computer) processors advocates that machine-code instructions should be atomic, most RISC processors include a few non-atomic instructions. CISC (Complex Instruction Set Computer) processors have composite (i.e., non-atomic) instructions in abundance. To exploit these composite instructions, several intermediate-language instructions must be grouped together and translated into a single machinecode instruction. For example, the instruction sequence [t2 := t1 + 116, t3 := M [t2 ]] can be translated into the single MIPS instruction lw r3, 116(r1) where r1 and r3 are the registers chosen for However, this is only possible if the value of t2 t1 and t3 , respectively. isn't required later, as the combined instruction doesn't store this intermediate value anywhere. We will, hence, need to know if the contents of a variable is required for later use, or if it is dead after a particular use. When generating intermediate code, most of the temporary variables introduced by the compiler will be single-use and can be marked as such. Any use of a single-use variable will, by denition, be the last use. Alternatively, last-use information can be obtained by analysing the intermediate code, as we shall see in chapter 8. For now, we will just assume that the last use of any variable is marked in the intermediate code. Our next step is to describe each machine-code instruction in terms of one or more intermediate-language instructions. MIPS instruction lw rt , k(rs ) For example, the can be described by the pattern [t := rs + k, rt := M [tlast ]] where tlast indicates that t can't be used afterwards. be used to replace a piece of intermediate code if all the pattern are matched by last A pattern can only last annotations in annotations in the intermediate code. 7.4. EXPLOITING COMPLEX INSTRUCTIONS 171 The converse, however, isn't true: It is not harmful to store a value even last annotation in the a last annotation in the if it isn't used later, so a intermediate language need not be matched by pattern. The list of patterns that describe the machine-code instruction set must cover all of the intermediate language. In particular, each single intermediate-language instruction must be covered by a pattern. This lw rt , 0(rs ) to cover [rt := M [rs ]], even though we have form for lw. If there are intermediate- means that we must include the MIPS instruction the intermediate-code sequence already listed a more general language instructions for which there are no equivalent machine-code instruction, a sequence of machine-code instructions must be given for these. Hence, an instruction-set description is a list of pairs, where each pattern (a sequence of intermediate-language instrucreplacement (a sequence of machine-code instructions). pair consists of a tions) and a When translating a sequence of intermediate-code instructions, the code generator can look at the patterns and pick the replacement that covers the largest prex of the intermediate code. A simple way of achieving this is to list the pairs in order of preference (e.g., longest pattern rst) and pick the rst pattern that matches a prex of the intermediate code. This kind of algorithm is called greedy, because it always picks the choice that is best for immediate prot. It will, however, not always yield the optimal solution for the total sequence of intermediate-language instructions. If costs are given for each machine-code instruction sequence in the code-pairs, optimal (i.e., least-cost) solutions can be found for straight-line (i.e., jump-free) code sequences. The least-cost sequence that covers the intermediate code can be found, e.g., using a dynamic- programming algorithm. We will not go into detail about optimal solutions, though. For RISC processors, a greedy algorithm will typically get close to optimal solutions, so the gain by using a better algorithm is small. As an example, gure 7.1 describes a subset of the instructions for the MIPS microprocessor architecture in terms of the intermediate language. Note that we exploit the fact that register 0 is hard-wired to be the value 0 to, e.g., get the addi instruction to generate a constant. We assume we, at this point, have already handled the problem of too-large constants, so any constant remaining in the intermediate code can be used as an immediate constant in an instruction. special cases for IF-THEN-ELSE Note that we make when one of the labels follow the test. Note, also, that we need (at least) two instructions from our MIPS subset to implement an IF-THEN-ELSE instruction that uses less-than as the CHAPTER 7. MACHINE-CODE GENERATION 172 relational operator, while we need only one for comparison by equal. Figure 7.1 does not cover all of the intermediate language, nor does it cover the full MIPS instruction set, but it can be extended to do either or both. The instructions in gure 7.1 are listed in order of priority. This is only important when the pattern for one instruction sequence is a prex of a pattern for another instruction sequence, as is the case with and lw/sw and for the dierent instances of beq/bne and slt. addi We can try to use gure 7.1 to select instructions for the following code sequence: a := a + blast , d := c + 8, M [dlast ] := a, IF a = c THEN label1 ELSE label2 , LABEL label2 of this code, add instruction) in gure 7.1 matches a prex so we generate an add instruction for the rst intermediate instruction. We now have two matches for prexes of the remaining Only one pattern (for the code: sw and addi. Since sw is listed rst, we choose this to replace the next two intermediate-language instructions. Finally, beq match the last two instructions. Hence, we generate the code add a, a, b sw a, 8(c) beq a, c, label1 label2 : Note that we retain label2 even though the resulting sequence doesn't refer to it, as some other part of the code might jump to it. We could include single-use annotations for labels like we use for variables, but it is hardly worth the eort, as labels don't generate actual code and hence 1 cost nothing . 7.4.1 Two-address instructions In the above we have assumed that the machine code is three-address code, i.e., that the destination register of an instruction can be distinct from the two operand registers. 1 It is, however, not uncommon that This is, strictly speaking, not entirely true, as superuous labels might inhibit later optimisations. 7.4. EXPLOITING COMPLEX INSTRUCTIONS labelf : labelt : labelf : labelt : label: 173 lw rt , k(rs ) lw lw sw rt , 0(rs ) rt , k (R0) rt , k(rs ) sw sw add add addi addi j beq rt , 0(rs ) rt , k (R0) rd , rs , rt rd , R0, rt rd , rs , k rd , R0, k label rs , rt , labelt bne rs , rt , labelf beq j slt bne rs , rt , labelt labelf rd , rs , rt rd , R0, labelt slt beq rd , rs , rt rd , R0, labelf IF rs < rt THEN labelt ELSE labelf , LABEL labelt slt bne j rd , rs , rt rd , R0, labelt labelf IF rs < rt THEN labelt ELSE labelf t := rs + k, rt := M [tlast ] rt := M [rs ] rt := M [k] t := rs + k, M [tlast ] := rt M [rs ] := rt M [k] := rt rd := rs + rt rd := rt rd := rs + k rd := k GOTO label IF rs = rt THEN labelt ELSE labelf , LABEL labelf IF rs = rt THEN labelt ELSE labelf , LABEL labelt IF rs = rt THEN labelt ELSE labelf IF rs < rt THEN labelt ELSE labelf , LABEL labelf LABEL label Figure 7.1: A subset of the MIPS instruction set CHAPTER 7. MACHINE-CODE GENERATION 174 processors use two-address code, where the destination register is the same as the rst operand register. To handle this, we use patterns like mov add move add rt , rs rt , r s rd , rs rd , r t rt := rs rt := rt + rs rd := rs + rt that use an extra copy-instruction in the case where the destination register is not the same as the rst operand. As we will see in chapter 8, the register allocator will often be able to remove the extra copy-instruction by allocating rd and rs in the same register. 7.5 Optimisations Optimisations can be done by a compiler in three places: In the source code (i.e., on the abstract syntax), in the intermediate code and in the machine code. Some optimisations can be specic to the source language or the machine language, but it makes sense to perform optimisations mainly in the intermediate language, as the optimisations hence can be shared among all the compilers that use the same intermediate language. Also, the intermediate language is typically simpler than both the source language and the machine language, making the eort of doing optimisations smaller. Optimising compilers have a wide array of optimisations that they can employ, but we will mention only a few and just hint at how they can be implemented. Common subexpression elimination. a[i]+2, the address for a[i] In the statement a[i] := is calculated twice. This double calcula- tion can be eliminated by storing the address in a temporary variable when the address is rst calculated, and then use this variable instead of calculating the address again. Simple methods for common subexpression elimination work on basic blocks, i.e., straight-line code without jumps or labels, but more advanced methods can eliminate duplicated calculations even across jumps. Code hoisting. If part of the computation inside a loop is independent of the variables that change inside the loop, it can be moved outside the loop and only calculated once. For example, in the loop 7.5. OPTIMISATIONS 175 while (j 10 THEN error ELSE ok2 LABEL ok2 t := i ∗ 4 M [t] := 0 i := i + 1 GOTO f or1 LABEL f or3 Figure 10.7: Intermediate code for for-loop with index check We initialise all in sets to Q, except the in set for the rst instruction, which is initialised to the empty set. After the data-ow analysis reaches a xed-point, the inequalities in in[i] are guaranteed to hold at instruction i. So, if we have an instruction i of the form IF c THEN lt ELSE lf and c is implied by the inequalities in in[i], we can replace the instruction by GOTO lt . If the negation of c is implied by the inequalities in in[i], we can replace the instruction by GOTO lf . We illustrate the analysis by an example. for-loop and assume that the array a Consider the following is declared to go from 0 to 10. for i:=0 to 9 do a[i] := 0; This loop can be translated (with index check) into the intermediate code shown in gure 10.7. Q of possible inequalities in the program are derived from the IF-THEN-ELSE instructions and their negations, i.e., Q = {i ≤ 9, i > 9, i < 0, i ≥ 0, i > 10, i ≤ 10}. The set conditions in the three We leave the xed-point iteration and check elimination as an exer- i := 0 in instruction 1 {i ≤ 9, i ≥ 0, i ≤ 10} and that the assignment instruction 11 preserves i ≥ 0 but invalidates i ≤ 9 and cise to the reader, but note that the assignment implies the inequalities i := i + 1 i ≤ 10. in 10.5. LIMITATIONS OF DATA-FLOW ANALYSES 235 10.5 Limitations of data-ow analyses All of the data-ow analyses we have seen above are approximations: They will not always accurately reect what happens at runtime: The index-check analysis may fail to remove a redundant index check, and the available assignment analysis may say an assignment is unavailable when, in fact, it is available. In all cases, the approximations err on the safe side: It is better to miss an opportunity for optimisation than to make an incorrect optimisation. For liveness analysis, this means that if you are in doubt about a variable being live, you had better say that it is, as assuming it dead might cause its value to be overwritten. When available assignment analysis is used for common subexpression elimination, saying that an assigning is available when it isn't may make the optimisation replace an expression by a variable that does not always hold the same value as the expression, so it is better to leave an assignment out of the set if you are in doubt. It can be shown that no compile-time analysis that seeks to uncover nontrivial information about the run-time behaviour of programs can ever be completely exact. You can make more and more complex analyses that get closer and closer to the exact result, but there will always be programs where the analysis isn't precise. So a compiler writer will have to be satised with analyses that nd most cases where an optimisation can be applied, but misses some. 10.6 Loop optimisations Since many programs spend most of their time in loops, it is worthwhile to study optimisations specic for loops. 10.6.1 code hoisting One such optimisation is recognising computations that are repeated in every iteration of the loop without changing the values involved, i.e., loop-invariant computations. We want to lift such computations outside the loop, so they are performed only once. This is called code hoisting. We saw an example of this in section 10.2.3, where calculation of n∗3 was done once before the loop and subsequent re-computations were replaced by a reference to the variable n∗3 a that holds the value of computed before the loop. However, it is only because there was an explicit computation of n∗3 before the loop, that we could avoid CHAPTER 10. ANALYSIS AND OPTIMISATION 236 re-computation inside the loop: Otherwise, the occurrence of n ∗ 3 inside the loop would not have any available assignment that can replace the calculation. So our aim is to move or copy loop-invariant assignments to before the loop, so their result can be reused inside the loop. Moving a computation to before the loop may, however, cause it to be computed even when the loop is not entered. In addition to causing unnecessary computation (which goes against the wish for optimisation), such computations can cause errors when the precondition (the loop condition) is not satised. For example, if the invariant computation is a memory access, the address may be valid only if the loop is entered. A common solution to this problem is to unroll the loop once: A loop of the form (using C-like syntax): while (cond) { body } is transformed to if (cond) then { body while (cond) { body } } Similarly, a test-at-bottom loop of the form do body while (cond) can be unrolled to body while (cond) { body } Now, we can safely calculate the invariant parts in the rst copy of the body and reuse the results in the loop. If the compiler does common subexpression elimination, this unrolling is all that is required to do code 10.6. LOOP OPTIMISATIONS 237 hoisting  assuming the unrolling is done before common-subexpression elimination. Unrolling of loops is most easily done at source-code level (i.e., on the abstract syntax tree), so this is no problem. This unrolling will, of course, increase the size of the compiled program, so it should be done with care if the loop body is large. 10.6.2 Memory prefetching If a loop goes through a large array, it is likely that parts of the array will not be in the cache of the processor. memory is much Since access to non-cached slower than access to cached memory, we would like to avoid this. Many modern processors have memory prefetch instructions that tell the processor to load the contents of an address into cache, but unlike a normal load, a memory prefetch doesn't cause errors if the address is invalid, and it returns immediately without waiting for the load to complete. So a way to ensure that an array element is in the cache is to issue a prefetch of the array element well in advance of its use, but not so well in advance that it is likely that it will be evicted from the cache between the prefetch and the use. Given modern cache sizes and timings, 25 to 10000 cycles ahead of the use is a reasonable time for prefetching  less than 25 increases the risk that the prefects is not completed before the use, and more than 10000 increases the chance that the value will be evicted from the cache before use. A prefetch instruction usually loads an entire cache line, which is typically four or eight words, so we don't have to explicitly prefetch every array element  every fourth element is enough. So, assume we have a loop that adds the elements of an array: sum = 0; for (i=0; i<100000; i++) sum += a[i]; } we can rewrite this to sum = 0; for (i=0; i<100000; i++) { if (i&3 == 0) prefetch a[i+32]; sum += a[i]; } CHAPTER 10. ANALYSIS AND OPTIMISATION 238 where prefetch a[i+32] prefetches the element of a that is 32 places after the current element. The number 32 is rather arbitrary, but makes the number of cycles between prefetch and use lie in the interval mentioned above. Note that we used the test i%4==0, i&3==0, which is equivalent to but somewhat faster. We don't have to worry about prefetching past the end of the array  prefetching will never cause runtime errors, so at worst we prefetch something that we don't need. While this transformation adds a test (that takes time), the potential savings by having all array elements in cache before use are much larger. The overhead of testing can be reduced by unrolling the loop body: sum = 0; for (i=0; i<100000; i++) { prefetch a[i+32]; sum += a[i]; i++; sum += a[i]; i++; sum += a[i]; i++; sum += a[i]; } This should, of course, only be done if the loop body is small. We have exploited that the number of iterations is a multiple of 4, so the exit test is not needed at every increment of i. If we don't know this, the exit test must be replicated after each increase of i, like shown here: sum = 0; for (i=0; i0) if (n%2 == 0) { x = x*x; n = n/2; } else { p = p*x; n = n-1; } return(p); } If we have a call power(x,5), we can replace this by a call power5(x) to a specialised function. We now need to add a denition of this specialised function to the program. The most obvious idea would be to take the power function and replace all occurrences of n by 5, but this won't work, as n changes value inside the body of power. What above code for the we do instead is to observe the following: 1. The loop condition depends only on 2. Every change to n depends only on n. n. So it is safe to unroll the loop at compile-time, doing all the computations on n, but leaving the computations on p and x for run-time. the following specialised denition: double power5(double x) { double p=1.0; p = p*x; x = x*x; x = x*x; p = p*x; return(p); } This yields 10.9. FURTHER READING 245 power5(x) is, obviously, a lot faster than executing power(x,5). power5 is fairly small, we can additionally inline the call, as de- Executing Since scribed in section 10.7.1. This kind of specialisation may not always be applicable, even if a function call has constant parameters, for example if the call was power(3.14159,p), where p is not a constant, but when the method is applicable, the speedup can be dramatic. Similar specialisation techniques are used in C++ compilers for compiling templates: When a call species template parameters, the denition of the template is specialised with respect to the actual template parameters. Since templates in C++ can be recursively dened, an innite number of specialised versions might be required. Most C++ compilers put a limit on the recursion level of template instantiations and stop with an error message when this limit is exceeded. 10.9 Further reading We have covered only a small portion of the optimisations that are found in optimising compilers. More examples of optimisations (including value numbering) can be found in advanced compiler textbooks, such as [4, 7, 9, 30]. A detailed treatment of program analysis can be found in [31]. Specialisation techniques like those mentioned in section 10.8 are covered in more detail in [16, 20]. The book [29] has good articles on both program analysis and transformation. Additionally, the conferences Compiler Construction (CC), Programming Language Design and Implementation (PLDI) and other programming-language-oriented conferences often present new optimisation techniques, so past proceedings from these is a good source for advanced optimisation methods. Exercises Exercise 10.1 In the program in gure 10.2, replace instructions 13 and 14 by 13: 14: h := n ∗ 3 IF i < h THEN loop ELSE end CHAPTER 10. ANALYSIS AND OPTIMISATION 246 a) Repeat common subexpression elimination on this modied program. b) Repeat, again, common subexpression elimination on the modied program, but, prior to the xed-point iteration, initialise all sets to the empty set instead of the set of all assignments. What dierences does this make to the nal result of xed-point iteration, and what consequences do these dierences have for the optimisation? Exercise 10.2 Write a program that has jumps to jumps and perform jump-to-jump optimisation of it as described in section 10.3. Try to make the program cover all the three optimisation cases described at the end of section 10.3. Exercise 10.3 a) As described in the beginning of section 10.4, add extra labels and gotos for each IF-THEN-ELSE in the program in gure 10.7. b) Do the xed-point iteration for index-check elimination on the result. c) Eliminate the redundant tests. d) Do jump-to-jump elimination as described in section 10.3 on the result to remove the extra labels and gotos introduced in question a. Exercise 10.4 Section 10.7.2 describes tail-call optimisation for a pure caller-saves strategy. Things become somewhat more complicated when you use a mixed caller-saves/callee-saves strategy. Using the call sequence from gure 9.10 and the epilogue from gure 9.9, describe how the combined sequence can be rewritten to get some degree of tail-call optimisation. Your main focus should be on reusing stack space, and secondarily on saving time. Exercise 10.5 Specialise the power function in section 10.8 to n = 12. Chapter 11 Bootstrapping a compiler 11.1 Introduction When writing a compiler, one will usually prefer to write it in a highlevel language. A possible choice is to use a language that is already available on the machine where the compiler should eventually run. It is, however, quite common to be in the following situation: You have a completely new processor for which no compilers exist yet. Nevertheless, you want to have a compiler that not only targets this processor, but also runs on it. In other words, you want to write a compiler for a language A, targeting language B (the machine language) and written in language B. The most obvious approach is to write the compiler in language B. But if B is machine language, it is a horrible job to write any non-trivial compiler in this language. Instead, it is customary to use a process called bootstrapping, referring to the seemingly impossible task of pulling oneself up by the bootstraps. The idea of bootstrapping is simple: You write your compiler in language A (but still let it target B) and then let it compile itself. The result is a compiler from A to B written in B. It may sound a bit paradoxical to let the compiler compile itself: In order to use the compiler to compile a program, we must already have compiled it, and to do this we must use the compiler. In a way, it is a bit like the chicken-and-egg paradox. We shall shortly see how this apparent paradox is resolved, but rst we will introduce some useful notation. 247 CHAPTER 11. BOOTSTRAPPING A COMPILER 248 11.2 Notation We will use a notation designed by H. Bratman [10]. The notation is hence called Bratman diagrams or, because of their shape, Tdiagrams . In this notation, a compiler written in language C, compiling from the language A and targeting the language B is represented by the diagram A B C In order to use this compiler, it must stand on a solid foundation, i.e., something capable of executing programs written in the language C. This something can be a machine that executes C as machine-code or an interpreter for C running on some other machine or interpreter. Any number of interpreters can be put on top of each other, but at the bottom of it all, we need a real machine. An interpreter written in the language D and interpreting the language C, is represented by the diagram C D A machine that directly executes language D is written as JJD The pointed bottom indicates that a machine need not stand on anything; it is itself the foundation that other things must stand on. When we want to represent an unspecied program (which can be a compiler, an interpreter or something else entirely) written in language D, we write it as D These gures can be combined to represent executions of programs. For example, running a program on a machine D is written as 11.2. NOTATION 249 D JJD Note that the languages must match: The program must be written in the language that the machine executes. We can insert an interpreter into this picture: C C D JJD Note that, also here, the languages must match: The interpreter can only interpret programs written in the language that it interprets. We can run a compiler and use this to compile a program: A A B B C JJC The input to the compiler (i.e., the source program) is shown at the left of the compiler, and the resulting output (i.e., the target program) is shown on the right. Note that the languages match at every connection and that the source and target program aren't standing on anything, as they aren't executed in this diagram. We can insert an interpreter in the above diagram: CHAPTER 11. BOOTSTRAPPING A COMPILER 250 A A B B C C D JD J 11.3 Compiling compilers The basic idea in bootstrapping is to use compilers to compile themselves or other compilers. We do, however, need a solid foundation in form of a machine to run the compilers on. It often happens that a compiler does exist for the desired source language, it just doesn't run on the desired machine. Let us, for example, assume we want a compiler for ML to Pentium machine code and want this to run on a Pentium. We have access to an ML-compiler that generates HP PA-RISC machine code and runs on an HP machine, which we also have access to. One way of obtaining the desired compiler would be to do binary translation, i.e., to write a compiler from HP machine code to Pentium machine code. This will allow the translated compiler to run on a Pentium, but it will still generate HP code. We can use the HP-to-Pentium compiler to translate this into Pentium code afterwards, but this introduces several problems: • Adding an extra pass makes the compilation process take longer. • Some eciency will be lost in the translation. • We still need to make the HP-to-Pentium compiler run on the Pentium machine. A better solution is to write an ML-to-Pentium compiler in ML. We can compile this using the ML compiler on the HP: 11.3. COMPILING COMPILERS ML P ML ML ML 251 P HP HP HP HP JJ where P is short for Pentium. Now, we can run the ML-to-Pentium compiler on the HP and let it 1 compile itself : ML P ML ML ML P P P HP HP JJ We have now obtained the desired compiler. Note that the compiler can now be used to compile itself directly on the Pentium platform. This can be useful if the compiler is later extended or, simply, as a partial test of correctness: If the compiler, when compiling itself, yields a dierent object code than the one obtained with the above process, it must contain an error. The converse isn't true: Even if the same target is obtained, there may still be errors in the compiler. It is possible to combine the two above diagrams to a single diagram that covers both executions: ML ML P ML P ML ML ML ML P P P HP HP HP HP JJ HP JJ In this diagram, the ML-to-Pentium compiler written in HP has two roles: It is the output of the rst compilation and the compiler that runs the second compilation. Such combinations can, however, be a bit confusing: The compiler that is the input to the second compilation step looks like it is also the output of the leftmost compiler. In this case, 1 When a program is compiled and hence, strictly speaking, isn't textually the same, we still regard it as the same program. CHAPTER 11. BOOTSTRAPPING A COMPILER 252 the confusion is avoided because the leftmost compiler isn't running and because the languages doesn't match. Still, diagrams that combine several executions should be used with care. 11.3.1 Full bootstrap The above bootstrapping process relies on an existing compiler for the desired language, albeit running on a dierent machine. It is, hence, often called half bootstrapping. When no existing compiler is available, e.g., when a new language has been designed, we need to use a more complicated process called full bootstrapping. A common method is to write a QAD (quick and dirty) compiler using an existing language. This compiler needs not generate code for the desired target machine (as long as the generated code can be made to run on some existing platform), nor does it have to generate good code. The important thing is that it allows programs in the new language to be executed. Additionally, the real compiler is written in the new language and will be bootstrapped using the QAD compiler. As an example, let us assume we design a new language M+. We, initially, write a compiler from M+ to ML in ML. The rst step is to compile this, so it can run on some machine: M+ M+ ML ML ML ML HP HP HP HP JJ The QAD compiler can now be used to compile the real compiler: M+ M+ HP M+ M+ HP ML ML HP HP JJ The result is an ML program, which we need to compile: 11.3. COMPILING COMPILERS M+ M+ HP ML ML 253 HP HP HP HP HP JJ The result of this is a compiler with the desired functionality, but it will probably run slowly. The reason is that it has been compiled by using the QAD compiler (in combination with the ML compiler). A better result can be obtained by letting the generated compiler compile itself: M+ M+ HP M+ M+ HP HP HP HP HP JJ This yields a compiler with the same functionality as the above, i.e., it will generate the same code, but, since the real compiler has been used to compile it, it will run faster. The need for this extra step might be a bit clearer if we had let the real compiler generate Pentium code instead, as it would then be obvious that the last step is required to get the compiler to run on the same machine that it targets. We chose the target language to make a point: Bootstrapping might not be complete even if a compiler with the right functionality has been obtained. Using an interpreter Instead of writing a QAD compiler, we can write a QAD interpreter. In our example, we could write an M+ interpreter in ML. We would rst need to compile this: M+ M+ ML ML HP HP HP HP JJ We can then use this to run the M+ compiler directly: CHAPTER 11. BOOTSTRAPPING A COMPILER 254 M+ M+ HP M+ M+ HP HP HP M+ M+ HP HP JJ Since the real compiler has been used to do the compilation, nothing will be gained by using the generated compiler to compile itself, though this step can still be used as a test and for extensions. Though using an interpreter requires fewer steps, this shouldn't really be a consideration, as the computer(s) will do all the work in these steps. What is important is the amount of code that needs to be written by hand. For some languages, a QAD compiler will be easier to write than an interpreter, and for other languages an interpreter is easier. The relative ease/diculty may also depend on the language used to implement the QAD interpreter/compiler. Incremental bootstrapping It is also possible to build the new language and its compiler incrementally. The rst step is to write a compiler for a small subset of the language, using that same subset to write it. This rst compiler must be bootstrapped in one of the ways described earlier, but thereafter the following process is done repeatedly: 1) Extend the language subset slightly. 2) Extend the compiler so it compiles the extended subset, but without using the new features. 3) Use the previous compiler to compile the new. In each step, the features introduced in the previous step can be used in the compiler. Even when the full language is compiled, the process can be continued to improve the quality of the compiler. 11.4. FURTHER READING 255 11.4 Further reading Bratman's original article, [10], only describes the T-shaped diagrams. The notation for interpreters, machines and unspecied programs was added later in [14]. The rst Pascal compiler [36] was made using incremental bootstrapping. Though we in section 11.3 dismissed binary translation as unsuitable for porting a compiler to a new machine, it is occasionally used. The advantage of this approach is that a single binary translator can port any number of programs, not just compilers. It was used by Digital Equipment Corporation in their FX!32 software [17] to enable programs compiled for Windows on a Pentium-platform to run on their Alpha RISC processor. Exercises Exercise 11.1 You have a machine that can execute Alpha machine code and the fol- lowing programs: 1: A compiler from C to Alpha machine code written in Alpha machine code. 2: An interpreter for ML written in C. 3: A compiler from ML to C written in ML. Now do the following: a) Describe the above programs and machine as diagrams. b) Show how a compiler from ML to C written in Alpha machine code can be generated from the above components. The generated program must be stand-alone, i.e., it may not consist of an interpreter and an interpreted program. c) Show how the compiler generated in question b can be used in a process that compiles ML programs to Alpha machine code. CHAPTER 11. BOOTSTRAPPING A COMPILER 256 Exercise 11.2 A source-code optimiser is a program that can optimise programs at source-code level, another program i.e., a program O that reads a program P P 0 , which is equivalent to P , but may be and outputs faster. A source-code optimiser is like a compiler, except the source and target languages are the same. Hence, we can describe a source-code optimizer for C written in C with the diagram C C C Assume that you additionally have the following components: • A compiler, written in ARM code, from C to ARM code. • A machine that can execute ARM code. • Some unspecied program P written in C. Now do the following: a) Describe the above components as diagrams. b) Show, using Bratman diagrams, the steps required to optimise to P0 and then execute P 0. P Bibliography [1] A. Aasa. Precedences in specication and implementations of programming languages. In J. Maluszy«ski and M. Wirsing, editors, Proceedings of the Third International Symposium on Programming Language Implementation and Logic Programming, number 528 in LNCS, pages 183194. Springer Verlag, 1991. [2] Harold Abelson, Gerald Jay Sussman, and Julie Sussman. and Interpretation of Computer Programs. Structure MIT Press, 1996. Also downloadable from http://mitpress.mit.edu/sicp/full-text/sicp/book/. [3] Alfred V. Aho, John E. Hopcroft, and Jerey D. Ullman. sign and Analysis of Computer Algorithms. The De- Addison-Wesley, 1974. [4] Alfred V. Aho, Monica S. Lam, Ravi Sethi, and Jerey D. Ullman. Compilers; Principles, Techniques and Tools. Addison-Wesley, 2007. Newer edition of [5]. [5] Alfred V. Aho, Ravi Sethi, and Jerey D. Ullman. ciples, Techniques and Tools. Compilers; Prin- Addison-Wesley, 1986. Rereleased in extended form as [4]. [6] Hassan Aït-Kaci. struction. Warren's Abstract Machine  A Tutorial Recon- MIT Press, 1991. Optimizing compilers for modern architectures: a dependence-based approach. Morgan Kaufmann, [7] John R. Allen and Ken Kennedy. 2001. [8] Andrew W. Appel. Compiling with Continuations. Cambridge Uni- versity Press, 1992. [9] Andrew W. Appel. Modern Compiler Implementation in ML. bridge University Press, 1998. 257 Cam- BIBLIOGRAPHY 258 [10] H. Bratman. An alternative form of the `uncol' diagram. nications of the ACM, Commu- 4(3):142, 1961. Register Allocation via Graph Coloring, Tech. Rept. CPC-TR94517-S. PhD thesis, Rice University, Center for Research [11] Preston Briggs. on Parallel Computation, Apr. 1992. [12] J. A. Brzozowski. Derivatives of regular expressions. ACM, Journal of the 1(4):481494, 1964. [13] Noam Chomsky. Three models for the description of language. Transactions on Information Theory, IRE IT-2(3):113124, 1956. [14] J. Earley and H. Sturgis. A formalism for translator interactions. Communications of the ACM, 13:607617, 1970. [15] Peter Naur (ed.). Revised report on the algorithmic language algol 60. Communications of the ACM, 6(1):117, 1963. Partial Evaluation: Practice and Theory, volume 1706 of Lecture Notes in Computer Science. Springer Verlag, 1999. [16] John Hatcli, Torben Mogensen, and Peter Thiemann (Eds.). [17] Raymond J. Hookway and Mark A. Herdeg. Digital fx!32: Combining emulation and binary translation. http://www.cs.tufts.edu/comp/150PAT/optimization/DTJP01PF.pdf, 1997. Introduction to Automata Theory, Languages and Computation, 2nd ed. [18] John E. Hopcroft, Rajeev Motwani, and Jerey D. Ullman. Addison-Wesley, 2001. [19] Kathleen Jensen and Niklaus Wirth. port (2nd ed.). Pascal User Manual and Re- Springer-Verlag, 1975. [20] Neil D. Peyton Jones, Carsten Gomard, and Peter Sestoft. tial Evaluation and Automatic Program Generation. Par- Prentice Hall, 1993. [21] Simon L. Peyton Jones and David Lester. Languages  A Tutorial. Implementing Functional Prentice Hall, 1992. [22] J. P. Keller and R. Paige. Program derivation with veried transformations  a case study. Mathematics, Communications in Pure and Applied 48(910), 1996. BIBLIOGRAPHY 259 [23] B. W. Kerninghan and D. M. Ritchie. guage. The C Programming Lan- Prentice-Hall, 1978. [24] M. E. Lesk. Lex: a Lexical Analyzer Generator. Technical Re- port 39, AT&T Bell Laboratories, Murray Hill, N. J., 1975. [25] T. Lindholm and F. Yellin. 2nd ed. The Java Virtual Machine Specication, Addison-Wesley, Reading, Massachusetts, 1999. [26] R. McNaughton and H. Yamada. graphs for automata. Regular expressions and state IEEE Transactions on Electronic Computers, 9(1):3947, 1960. [27] Robin Milner. A theory of type polymorphism in programming. Journal of Computational Systems Science, [28] Robin Milner. 17(3):348375, 1978. Communication and Concurrency. Prentice-Hall, 1989. [29] Torben Æ. Mogensen, David A. Schmidt, and I. Hal Sudborough, The essence of computation: complexity, analysis, transformation. Springer-Verlag New York, Inc., New York, NY, USA, editors. 2002. [30] Steven S. Muchnick. tion. Advanced Compiler Design and Implementa- Morgan Kaufmann, 1997. Principles of Program Analysis. Springer-Verlag New York, Inc., Secaucus, NJ, [31] Flemming Nielson, Hanne R. Nielson, and Chris Hankin. USA, 1999. [32] Chris Okasaki. Purely Functional Data Structures. Cambridge Uni- versity Press, 1998. Computer Organization & Design, the Hardware/Software Interface. Morgan Kaufmann, [33] David A. Patterson and John L. Hennessy. 1998. [34] Vern Paxson. Flex, version 2.5, a fast scanner generator. http://www.gnu.org/software/flex/manual/html_mono/flex.html, 1995. [35] Mikkel Thorup. All structured programs have small tree-width and good register allocation. 181, 1998. Information and Computation, 142(2):159 BIBLIOGRAPHY 260 [36] Niklaus Wirth. The design of a pascal compiler. and Experience, 1(4):309333, 1971. Software - Practice Index abstract syntax, 101, 122 C++, 245 accept, 91, 95, 96 cache, 237 action, 43, 100, 101 cache line, 237 activation record, 198 call sequence, 239 alias, 211, 212 call stack, 197 allocation, 155, 216 call-by-reference, 211 Alpha, 169, 255 call-by-value, 197 alphabet, 10 call-sequence, 199 ARM, 169 callee-saves, 202, 204 array, 237 caller-saves, 202, 204 assembly, 3 caller/callee, 197 assignment, 139 calling convention, 199 associative, 67, 68 CISC, 170 attribute, 121 coalescing, 193 inherited, 122 code generator, 168, 171 synthesised, 121 code hoisting, 174, available assignments, 223 235 column-major, 156 comments back-end, 135 nested, 43 biased colouring, 192 common subexpression elimination, binary translation, 255 174, 223, binding 228, 236 compile-time, 140 dynamic, 115 compiling compilers, 250 static, 115 conict, 83, 88, 97, 99, 104 bootstrapping, 247, 250 reduce-reduce, 97, 99 full, 252 shift-reduce, 97, 99 half, 252 consistent, 32 incremental, 254 constant in operand, 169 Bratman diagram, 248 constant propagation, 175 C, 4, 41, 67, 70, 102, 104, 107, 119, context-free, 121 141, 147, 149, 152, 156, 210, grammar, 55, 212, 217, 236, 244 language, 106 261 56, 61 INDEX 262 dangling-else, 70, 97, 99 FORTRAN, 41 data-ow analysis, 222, 235 frame, 198 dead code elimination, 231 frame pointer, 199 dead variable, 170, 180 front-end, 136 declaration, 115 function call, 139, 239 function calls, 167, 197 global, 115 local, 115 derivation, 60, 60, 61, 71, 82 left, 64, 80 global variable, 210 right, 64, 89 go, 91, 93 rightmost, 61 21, 46, 90, 91 combined, 39 converting NFA to, 23, 27 equivalence of, 31 minimisation, gen and kill sets, 181 generic types, 131 leftmost, 61 DFA, 17, functional, 116 31, 32, 38 unique minimal, 31 grammar, 71 ambiguous, 6466, 68, 72, 76, 85, 97 equivalent, 65 graph colouring, 186, 187 greedy algorithm, 171 Digital Vax, 218 hashing, 119 distributive, 26 Haskell, 104, 117 domain specic language, 5 heuristics, 186, dynamic programming, 171 190 IA-32, 169 environment, 116, 124 IA-64, 169 epilogue, 199, 239 IBM System/370, 218 epsilon transition, 16 imperative, 116 epsilon-closure, implicit types, 132 24 in and out sets, 181 FA, 17 index check, 159 nite automaton translation of, 159 graphical notation, 17 nite automaton, 10, deterministic, 21 nondeterministic, FIRST, 16 17 73, 76 index-check elimination, 175 index-check elimination, 231 inlining, 239 instruction set description, 171 xed-point, 25, 74, 76, 182, 184 integer, 15, 139 ag, 168 interference, 185 arithmetic, 169 interference graph, 185 oating-point constant, 15 intermediate code, 3, 135, 179 oating-point numbers, 139 intermediate language, 3, 136, 167, FOLLOW, 77 174 INDEX 263 tree-structured, 176 interpreter, 3, 135, 137, 248 machine code, 3, 135, 137, 167 machine language, 179 memory transfer, 139 Java, 41, 102, 136 MIPS, 169, 170, jump, 139 conditional, 139, 168 171, 176, 218 monotonic, 24 jump-to-jump optimisation, 165, 228 name space, 119, 122 just-in-time compilation, 136 NFA, 17, 91, 93, 105 keyword, 14 combined, 38 converting to DFA, 23, 27 label, 139 LALR(1), 89, 100, 107 language, 10, 61 context-free, 106 high-level, 135, 247 left-associative, 66, 99 left-derivation, 72 left-factorisation, 87 left-recursion, 67, 68, 87, 102 elimination of, 85 indirect, 86 lexer, 9, nested scopes, 212, 214 fragment, 19 non-associative, 67, 99 non-local variable, 210 non-recursive, 68 nonterminal, 56 Nullable, 73, 76 operator, 139 operator hierarchy, 66 optimisations, 174 overloading, 130 37, 71 lexer generator, 37, 42 PA-RISC, 169 lexical, 9 parser, 65 analysis, 9 generator, 66, 100, 104 error, 42 predictive, 71, 72, 77 lexical analysis, 2 shift-reduce, 90 lexing, 121 table-driven, 89 linking, 3 top-down, 71 LISP, 217 parsing, 55, 64, 121 live variable, 180, 197 at end of procedure, 182 live-range splitting, 193 liveness, 180 predictive, 76, 77, 80, 81 table-driven, 82 Pascal, 4, 67, 70, 102, 107, 119, liveness analysis, 181 LL(1), 56, 80, bottom-up, 72 82, 107 85, 89, 97, 102, 211, 212 pattern, 170 Pentium, 255 local variables, 197 persistent, 116, 117 longest prex, 41 pointer, 211, 212 lookahead, 80 polymorphism, 131 LR, 89 PowerPC, 169 INDEX 264 precedence, 59, 65, 66, 68, 69, 89, 97 algorithm, 92 construction of table, 91, 96 declaration, 97, 99, 106 SML, 4, 41, 67, 104, 117, 119, 212 rules, 66 source program, 249 prefetch, 237 Sparc, 169 processor, 247 spill, 199 production, 56, 58 spill-code, 190 188 empty, 57, 76 spilling, 179, nullable, 73, 77 stack automaton, 55 stack automaton, 106 prologue, 199, 239 recursive descent, 81 stack pointer, 216 start symbol, 56, 71 reduce, 90, 91, 95 starting state, 16 register, 179 state, 16, 17 for passing function parameters, 204 accepting, 17, 19, 28, 32, 37 dead, 35 register allocation, 3, 167, 179 nal, 17 by graph colouring, 186 initial, 17 global, 185 starting, 16, 17, 19, 27, 38 register allocator, 208 static links, 214 regular expression, subset construction, 27 10, 43 converting to NFA, 19 equivalence of, 31 symbol table, 116, 116, 124 implemented as function, 118 regular language, 31, 44 implemented as list, 117 return address, 198, 204 implemented as stack, 118 right-associative, 67, 99 syntactical category, 122 right-recursion, 68 syntax analysis, 2, 9, 55, 60, 64, RISC, 167, 170, 204 syntax tree, 55, 61, 71, 86 row-major, 156 T-diagram, 248 run-time, 140 tail call, 240 Scheme, 104, 117 tail call optimisation, 240 scope, 115 target program, 249 nested, 212, 214 select, 188 templates, 245 terminal, 56 sequential logical operators, 148, 149 token, 9, 37, 39, 43, 71 set constraints, 78 set equation, 24, shift, 90, 91, 93 24 transition, 16, 17, 28, 32 epsilon, 16, 94 translation simplify, 187 of arrays, 153 SLR, 56, 89, 97 of case-statements, 152 71 INDEX 265 of declarations, 160 of expressions, 140 of function, 209 of index checks, 159 of logical operators, 147, 149 of multi-dimensional arrays, 156 of non-zero-based arrays, 159 of records/structs, 160 of statements, 144 of strings, 159 of of break/exit/continue, goto, 152 152 type checking, 2, 121, 124 of assignments, 130 of data structures, 130 of expressions, 124 of function declarations, 127 of programs, 127 type conversion, 131 type error, 124 undecidable, 65 value numbering, 228 value numbering, 245 variable global, 210 non-local, 210 variable name, 14 white-space, 9, 43 word length, 154 work-list algorithm, 26