Architecture for a Next-Generation GCC Chris Lattner Vikram Adve University of Illinois at Urbana, Champaign {lattner, vadve}@cs.uiuc.edu http://llvm.cs.uiuc.edu Abstract This pap er presents a design and implementation of a whole-program interpro cedural optimizer built in the GCC framework. Through the intro duction of a new language-indep endent intermediate representation, we extend the current GCC architecture to include a p owerful mid-level optimizer and add link-time interpro cedural analysis and optimization capabilities. This intermediate representation is an SSA-based, low-level, strongly-typ ed, representation which is designed to supp ort b oth efficient global optimizations and high-level analyses. Because most of the program is available at link-time, aggressive “whole-program” optimizations and analyses are p ossible, improving the time and space requirements of compiled programs. The final prop osed organization of GCC retains the imp ortant features which make it successful to day, requires almost no mo dification to either the frontor back-ends of GCC, and is completely compatible with user makefiles. compiling a statement at a time, to compiling and optimizing entire functions at a time, to the (still very new) supp ort for unit-at-a-time compilation (compiling and optimizing all of the functions in a translation unit together). As the scop e for analysis and optimizations increases, the compiler is b etter able to reduce the time and space requirements for the generated co de. This pap er prop oses the next logical step for the GCC optimizer: extend it to b e able to analyze and optimize whole programs at link-time1 , enabling new optimizations and making existing analyses and optimizations more p owerful. For example: • • • • • • • inlining across translation units whole-program alias analysis interpro cedural register allo cation interpro cedural constant propagation data layout optimizations exception handling space optimizations sorting initializer priorities at link-time The key challenges to whole-program optimization are to enable p owerful transformations while keeping compile times reasonable, and to keep the uservisible development pro cess unchanged (e.g. user makefiles). The architecture that we prop ose is based on a new language-indep endent low-level co de representation that preserves imp ortant typ e information from the source co de. The use of a low-level, SSA-based representation allows the compiler to p erform a variety of optimizations at compile time, off-loading work from the link-time optimizer. However, the linktime optimizer can only p erform meaningful optimizations on the program if it has enough high-level information ab out the program to prove that aggressive optimizations are safe. Because of this, the lowlevel co de representation is typ ed (using a languageindep endent constructive typ e system) and directly 1 This capability would b e optional and could b e enabled only when the program is compiled at the “-O4” level of optimization, for example. 1 Intro duction The GNU Compiler Collection (GCC) [15] is in many ways the centerpiece of the Free Software movement. It supp orts several source languages and a plethora of back-ends for various targets, providing a unified target for free software. GCC has b een successful b ecause of its extreme p ortability, stability, and b ecause it is able to compile and optimize several p opular source languages (C, C++ , Java, etc) to each target. Unfortunately, despite the success of the GCC compiler suite as a whole, the optimization infrastructure is still not comp etitive with commercial compilers. Over the years, the GCC optimizer has evolved from Front-End Compiler Source File Source File Source File Source File GCC FE Mid-Level Opt GCC FE Mid-Level Opt GCC FE Mid-Level Opt GCC FE Mid-Level Opt "LLVM" Assembly Optimizing Linker .o files Unmodified binutils Toolchain .s file Whole Program RTL Expansion Link Analysis and Optimization Native Codegen Native Assembler native .a /.so files Native Linker ".exe" file .a files Figure 1: High-Level Compiler Architecture for Whole-Program Optimization exp oses information ab out structure and array accesses to the optimizer. The link-time optimizer is designed to combine the translation units of a program together and do the final whole-program optimization. After the program is optimized, machine co de is generated at link-time for the entire program at once, allowing a variety of interpro cedural low-level co de optimizations to b e p erformed. The Low-Level Virtual Machine (LLVM) [10] is an implementation of the architecture and intermediate representation [11] describ ed in this pap er, which allows us to b e more concrete when describing asp ects of the design. This system has served as the host for several research pro jects [7, 13, 12] which require whole-program information as well as a host for a variety of traditional compiler optimizations. We hop e that the lessons learned by the LLVM pro ject will b e useful to the GCC community, and are willing to contribute as much co de to the GNU pro ject as there is interest in. We are planning to have our first public release of LLVM, with a lib eral license, in the Summer of 2003. However, LLVM will only b e discussed when it helps clarify the ideas in the prop osed architecture, this pap er is intended to b e a GCC pap er, not an LLVM pap er. This pap er is organized as follows: Section 2 describ es the prop osed high-level architecture in detail, including mo difications that would need to b e made to the GCC infrastructure. Section 3 describ es imp ortant asp ects of the prop osed intermediate representation for the system. Section 4 describ es LLVM, our existing implementation of the prop osed design. Section 5 describ es other work related to the prop osed design, and Section 6 wraps up the pap er. 2 High-Level Compiler Architecture The prop osed high-level architecture is illustrated in Figure 1. The essential asp ect of this design is that it separates the current cc1 program into two comp onents: a front-end compiler and an optimizing linker. The front-end retains all of the resp onsibilities of current GCC front-ends (prepro cessing, lexical analysis, parsing, semantic analysis, etc..) and should work unmo dified in the new system. After each function is parsed and checked for semantic errors it is “expanded” from the “tree” representation to the new language-indep endent intermediate representation (describ ed in Section 3). Once the entire translation unit has b een translated (and if no errors have o ccurred), a standard set of mid-level optimizations are p erformed on the translated mo dule. After these optimizations are finished, a “.o” file is emitted which contains IR assembly co de for the representation. When the optimizing linker is invoked, it reads in all of the translated IR files and any libraries compiled to the intermediate representation. It links these files together into a single-file representation of the program, on which it can run whole-program analyses and optimizations. Finally, once these analyses and transformations are complete, the GCC backend is invoked to expand the intermediate representation into RTL and use the configured target description to pro duce a native .s file. After the optimizing linker pro duces a native .s file, the compilation pro cess pro ceeds through the standard system assembler and linker (to resolve any symb ols in libraries that were not available in the IR form), finally pro ducing a native executable. 2.1 Compatibility and Implementation One of the key features of this design is that it is compatible with the standard “compile and link” mo dels of compilation, and is thus fully compatible with existing makefiles. In order to provide this compatibility, the link phase of the gcc compiler driver is extended to invoke the optimizing linker and system assembler (if necessary) during the standard link step of the compile pro cess. In this way, any input files that are in the IR format are automatically linked together and optimized without interfering with the compilation and linking of standard translation units and libraries. If no files in the IR format are present, the entire invo cation of the optimizing linker is skipp ed. Another imp ortant asp ect of the design is how the compiler works when whole-program optimization is not enabled. If not enabled, each translation unit is either compiled a function at a time or a unit at a time (dep ending on the setting of the -funit-at-a-time switch), through the midlevel optimizer, RTL expansion, and co de generation phases of the compiler. This pro duces a native .s file, which can b e pro cessed with the standard system assembler and linker, as b efore. For this approach to b e feasible, a large amount of co de must b e shared b etween the optimizing linker and the compiler front-ends. This can either b e accomplished through the use of libraries that are shared b etween the two (which would contain the existing GCC back-end, and any shared optimizations on the IR), or by making b oth logical pieces b e part of the same binary. In either case, the actual organization of the existing GCC co de base would not have to change in any substantial way. piled and the application must b e relinked. In order to reduce the amount of work that must b e done, this design allows most traditional optimizations to b e p erformed in the compiler front-end stage, rather than requiring all optimization to o ccur at the link stage (as is common for whole-program optimizers). Because most aggressive scalar optimizations are p erformed at compile-time, they would not need to b e rerun at link time, reducing the time for compilation. Of course, the compiler p erformance issue do es not even arise unless the user is mo difying the program and recompiling at -O4. Optionally with this design, the compiler could try to minimize the amount of recompilation necessary when a change o ccurs by keeping track of which interpro cedural information is used to mo dify functions in other translation units, building a dep endence graph b etween the mo dules [4]. In practice, however, this would make the compiler much more complicated and prone to subtle bugs that are hard to repro duce. We feel that although the cost of recompilation is still fairly substantial in our system (native co de must b e regenerated for the entire application), that the extra complexity intro duced into the compiler must b e weighed against the recompilation time p enalties, and thus may b e impractical. 3 Co de Representation 2.2 Architectural Issues Affecting Performance In addition to providing the desired functionality and compatibility with existing systems, it is crucial that the compiler do es not slow down unacceptably — even if whole-program optimization is only enabled at -O4. In practical terms, this design addresses the issue by p erforming as much optimization as p ossible at compile time. Any time a source file is changed, it must b e recom- The representation used to analyze and manipulate the program determines what kinds of transformations are p ossible and when in the compilation pro cess they must b e p erformed to b e successful. As mentioned earlier, we prop ose using a languageindep endent, low-level, SSA-based, strongly-typ ed representation as the sole representation used for the mid-level and link-time optimizers. This representation is a first-class assembly language, which includes all of the information necessary to represent the program (and is in fact directly interpretable). Concrete details of the representation used by LLVM are included in Section 3.2. Using a low-level three-address co de representation based on Static Single Assignment [6] form enables the direct application of many well-known and efficient global optimizations. SSA form p ermits sparse optimizations that do not, in general, require bitvector data-flow analysis to compute results. Using a three-address co de representation (as opp osed to an tree structured representation) also makes transformations easy to develop and reason ab out. Many transformations need information ab out the high-level b ehavior of the program to b e effective. In order to preserve this information, we prop ose that the representation maintain a strong (but languageneutral) typ e system, which captures information ab out p ointer, structure, and array accesses in the program. Working with the LLVM system we find that this typ e information allows for a variety of high-level analyses and transformations [7, 13, 12] while the nature of the low-level representation makes it very easy to manipulate. Another advantage of typ e information is that it makes detecting and understanding bugs in transformations much easier. The goal of the program representation is to enable as many different typ es of optimizations as p ossible. Because of this, it is imp ortant that the representation b e able to represent al l parts of a program (including global variables, and file scop e asm statements, for example) in a form that allows transformations to mo dify it. Another useful feature of the representation is a stable textual format (“assembly language”) that can b e read and written by the compiler. Given this, it is trivial to write unit tests for transformations and to debug transformations in isolation from the rest of the compiler, and the representation can b e directly interpreted for immediate feedback on a transformation. ership semantics (eliminating the need for it to live on a garbage collected heap). In our exp erience with LLVM, co de optimizers for a sparse representation can b e several times faster than optimizations on a dense representation like RTL. 3.2 A Concrete LLVM Example Figure 2 gives an example of a C function and the corresp onding LLVM mo dule it compiles to. The example shows several imp ortant asp ects of the LLVM representation. In particular, it gives a simple example of the typ e system, basic instruction flavor, and demonstrates some instructions. More details ab out the LLVM representation can b e found in the LLVM language reference [11]. LLVM uses a simple constructive typ e system comp osed of primitive typ es, structures, arrays, and p ointers. Although this is a very simple typ e system, we b elieve that it contains the key features necessary for a front-end to lower any high-level typ e onto it. For example, the LLVM C++ front-end lowers classes with inheritance into nested structure typ es. Typ es are very imp ortant in the LLVM system, and everything that can b e used as an op erand to an instruction has a typ e. Functions in LLVM contain a list of basic blo cks, and each basic blo ck contains a list of instructions. LLVM has only 29 instructions, which include standard instructions like load, xor, setcc, etc and a phi instruction for representing SSA form2 . Intrapro cedural control flow in LLVM is very simple (consisting of conditional branches, unconditional branches, and the switch instruction). Everything in LLVM is explicit: there are no fallthrough branches, all address arithmetic is exp osed (at the level of structures, p ointers, and arrays), and all references to memory use the load and store instructions. This makes the language more uniform and simple to analyze and transform. The getelementptr instruction in LLVM provides the mechanism for structured address arithmetic3 . The getelementptr instruction is exactly analogous to sequences of array subscript and structure 2 SSA φ-no des are eliminated during the register allo cation phase of native co de generation. 3 LLVM co de can also cast a p ointer to an integer typ e, add an arbitrary offset to it, then cast it back to a p ointer, if unstructured address arithmetic is necessary. 3.1 Performance Asp ects of the Representation Once the optimizing linker brings together the compiled program into one mo dule, the interpro cedural analysis and optimization passes are used to improve the program. Because these passes op erate on the entire program at once, however, the efficiency of each analysis or optimization is critical. For this reason, several asp ects of the representation are designed to make these transformations as efficient as p ossible. In particular, the use of an SSA-based representation allows for efficient, sparse, global optimizations, and can make flow-sensitivity much less imp ortant in many analyses (reducing cost substantially). In addition, the three-address co de representation has a small memory fo otprint and simple memory own- % s t r u c t . Q u a d T r e e = t y p e { d o u b l e , [ 4 x %QT ∗ ] } % T = type % s t r u c t . QuadTree Q typedef s t r u c t Q u a d T r e e { double Data ; s t r u c t QuadTree ∗ C h i l d r e n [ 4 ] ; } QT ; v o i d S u m 3 r d C h i l d r e n (QT ∗ T , double ∗ R e s u l t ) { double R e t ; i f ( T = = 0 ) { Ret = 0 ; } else { QT ∗ C h i l d 3 = T [ 0 ] . Children [ 3 ] ; double V ; S u m 3 r d C h i l d r e n ( C h i l d 3 , &V ) ; Ret = V + T [ 0 ] . Data ; } ∗ R e s u l t = Ret ; } v o i d % S u m 3 r d C h i l d r e n (%QT∗ %T , d o u b l e ∗ e n t r y : % = a l l o c a double V %tmp . 0 = s e t e q %QT∗ %T , n u l l b r b o o l % tmp . 0 , l a b e l % e n d i f , else : %Result ) { ; ; %V i s t y p e ’ d o u b l e ∗ ’ ; ; type ’ bool ’ label % e l s e = F i e l d #1 ubyte 1 , l o n g 3 d o u b l e ∗ %V ) ubyte 0 ; ; tmp . 1 = &T [ 0 ] . C h i l d r e n [ 3 ] ’ Children ’ %tmp . 1 = g e t e l e m e n t p t r %QT∗ %T , l o n g 0 , %C h i l d 3 = l o a d %QT∗ ∗ % tmp . 1 c a l l v o i d % S u m 3 r d C h i l d r e n (%QT∗ % C h i l d 3 , %tmp . 2 = l o a d d o u b l e ∗ %V %tmp . 3 = g e t e l e m e n t p t r %QT∗ %T , l o n g 0 , %tmp . 4 = l o a d d o u b l e ∗ % tmp . 3 %tmp . 5 = add d o u b l e %tmp . 2 , % tmp . 4 br l a b e l % e n d i f endif : } (a) Example function %R e t = p h i d o u b l e [ % tmp . 5 , % e l s e ] , [ 0 . 0 , % e n t r y ] s t o r e double % Ret , double ∗ % R e s u l t ; ; R e t u r n w i t h no v a l u e r e t void (b) Corresp onding LLVM co de Figure 2: C and LLVM co de for a function index expressions, returning the address of the last element indexed4 . For example, the %tmp.1 instruction in Figure 2(b) first indexes into the 0th element from the p ointer, then into the 1st structure element (the “Children” memb er), then into the 3rd element of the array. Structured address arithmetic exp oses the necessary high-level information ab out structure and array accesses directly to analyses and transformations which need it. One imp ortant asp ect of the LLVM language is that all references to memory happ en with load and store instructions, and that there is no “addressof ” op eration. In LLVM, all ob jects which live in memory (global variables, functions, the heap, and the stack) are explicitly allo cated and exp osed by their address, not their value. In Figure 2, for example, the V variable is required to live in memory so that its address may b e passed into a recursive invo cation of Sum3rdChildren. Because it is imp ossible to take the address of a virtual register, stack memory must b e explicitly allo cated with the alloca instruction5 , and any references to V must use load and store instructions. This dramatically simplified def-use chain construction for virtual registers, which would otherwise require some form of alias-analysis to construct. A final example illustrating how LLVM simplifies 4 The example in Figure 2(a) uses the strange syntax ’T[0].x’ instead of using the equivalent ’T->x’ to make the corresp ondence more clear. 5 When the back-end is invoked, all fixed sized allocas in the entry blo ck are treated the same as address-exp osed automatic variables. the development of transformations is the op erators that it lacks. In particular, LLVM do es not have (or need) any unary op erators or a copy instruction. Instead of providing the standard negate and bitwise complement unary op erators, LLVM represents these with standard binary op erators where one op erand is a constant (“neg x” = “sub 0, x” and “not x” = “xor x, -1”). This reduces the dep endence on a “canonical form” for the representation and simply reduces the numb er of instructions that need to b e handled. The lack of a copy instruction is p ossible through the use of SSA form, and b ecause def-use chains are trivially computed and always available. Any time a copy instruction would b e inserted (to replace a redundant computation for example) it is sufficient to replace any uses of the destination with uses of the source op erand (by following the def-use chains), implicitly p erforming copy propagation automatically. This simple feature has actually avoided several phase-ordering issues that would otherwise require unnecessary passes over the representation to do copy propagation b etween other passes. 4 LLVM Compiler Infrastructure The LLVM Compiler Infrastructure [10] currently consists of approximately 130,000 lines of C++ co de and a the front-end, which is a patch against the mainline GCC CVS tree. This co de largely implements the design presented in this pap er, al- though there are some differences. This section describ es these differences, the implementation status of LLVM, some other features of LLVM that make writing transformations simpler, and some insights that we have had while working on LLVM. • Traditional SSA based optimizations: ADCE, GCSE, LICM, PRE, SCCP, induction variable canonicalization, reasso ciation, value numb ering, register promotion, etc... • Control Flow Graph based optimizations and analyses: critical edge elimination, lo op canonicalization, various dominator, p ost-dominator, and control dep endence graph related analyses, interval construction, natural lo op construction, CFG simplification, path profiling instrumentation, etc... • Interpro cedural analyses and transformations: call graph construction, several interpro cedural alias analyses, global variable merging, dead global elimination, inlining, Data Structure Analysis [13], automatic p o ol allo cation [12], interpro cedural mo d/ref, etc... In addition to pure infrastructure, the LLVM system also provides a large test suite. The three main sections of the test suite are the regression tests (which contain thousands of tests for transformations and other to ols), feature tests (which demonstrate how instructions and idioms are used in LLVM), and program tests (which compile b enchmarks and other programs with the various co de generators, ensuring that they pro duce co de whose b ehavior agrees with a native compiler). The LLVM web site also hosts a variety of do cumentation describing asp ects of the infrastructure. LLVM is also still under development. In particular, the C++ front-end is nearing completion (runtime library supp ort for exception handling is the ma jor missing p ortion), Sparc V9 supp ort for the JIT is in development, and a system for runtime optimization of statically compiled binaries is in the research phases. 4.1 Implementation Status The LLVM C front-end is based on the mainline GCC CVS rep ository. It generates co de by calling LLVM versions of functions that are equivalent to the RTL-expansion routines (e.g. llvm expand expr, llvm expand function start, make decl llvm, etc) during compilation. These routines build up an LLVM version of the translation unit, which is then written to the “.s” file all at once (allowing “unit-at-a-time” style transformations to b e p erformed from within GCC in the future). Instead of mo difying the cc1 binary to interface directly to the LLVM optimizations written in C++ , cc1 directly emits the expanded co de without any optimization at all. When the gcc compiler driver invokes the “assembler”, we actually have it invoke a program called gccas which parses the LLVM assembly file, runs a series of LLVM optimizers on it, then emits a compressed byteco de file (the .o file). The interface to gccas is intentionally designed to b e identical to the interface of the standard system as to ol, to avoid having to make changes to sp ec files. When the user (or a makefile) links the program using our gcc compiler driver, it invokes our gccld to ol. This to ol reads the .o files sp ecified, links in the appropriate byteco de files from any .a files, and then runs a series of interpro cedural optimizations on the program. At this time, we directly emit an LLVM byteco de file for the entire program, instead of automatically invoking a native co de generator. Once the program has b een optimized and is available in a single byteco de file, there are several ways to execute the resultant program. LLVM provides a very slow (but p ortable) reference interpreter for byteco de files, a Sparc V9 native co de generator, a C back-end, and a Just-In-Time (JIT) compiler for the IA32 architecture. A large numb er of LLVM optimizations and analyses are available, including passes for: 4.2 Differences from the Prop osal The biggest difference b etween the prop osal and the LLVM implementation is the lack of an LLVM to RTL conversion pass. For our research purp oses, we use a C back-end, which provides much of the same functionality as a full fledges RTL back-end, but is much slower. We exp ect that this comp onent can b e added up on demand. Another big difference b etween the current implementation and the prop osal is the interface b etween the cc1 program and the mid-level optimizer. For exp ediency of implementation we currently have the two to ols as separate executables, although this obviously incurs more overhead than linking the two comp onents together. Once the sub ject of including C++ co de in GCC is b etter decided, we can lo ok to resolve this issue. lecting a sequence of passes to run. bugpoint, another useful to ol, is b est describ ed as an “automated test-case reducer”. Given an LLVM program (or fragment) and a list of passes to run, it attempts to reduce the test-case (and list of passes) to the minimum which still exp oses a problem. bugpoint can currently diagnose passes which crash/assert during optimization and passes which misoptimize the program (by executing the resultant program with a co de generator, assuming a deterministic program)6 . If a test-case causes a pass to crash, bugpoint is usually able to reduce the test-case down to the few LLVM instructions and basic blo ck which cause the problem. If a pass (or combination of passes) miscompiles the test-case, it can isolate a single function which is b eing miscompiled. The bugpoint to ol is p ossible b ecause of the mo dularity of the pass manager and the ability to read, write, and mo dify a representation of whole programs. 4.3 Supp ort for Develop ers One of the strengths of the LLVM infrastructure is that it has some interesting utilities for constructing passes, finding bugs in those passes, and building a compiler around a selection of these passes. This strength is imp ortant for two reasons: it allows new p eople to get into the system and get pro ductive relatively fast, and it also allows exp erienced develop ers to b e more pro ductive than they otherwise would. The most imp ortant features are: a strong consistency checker, a “pass manager”, and a to ol we call “bugpoint”. The LLVM infrastructure includes a stringent checker for LLVM co de, which ensures that typ e relationships, SSA prop erties (e.g., all definitions dominates their uses), and other LLVM invariants haven’t b een violated by a transformation. This checker is automatically run after passes when in development mo de to ensure that these passes are not corrupting the input for other passes that are run. Additionally, when in development mo de, an automated memory leak detector is automatically enabled, which detects violations of the LLVM representation’s ownership mo del. This light-weight checker is implemented using only a few additions to constructors and destructors for the classes which make up the representation, no garbage collector is necessary. The LLVM “Pass Manager” provides a structured environment for passes to execute in. Transformations in LLVM use a declarative syntax to indicate which other passes are prerequisites (e.g. break-critical-edges), which analyses are required (e.g. natural lo op information, alias analysis, value numb ering, interpro cedural mo d/ref info, etc...), and which analyses are preserved or destroyed by the transformation b eing run. This structured pass mo del makes it easier for develop ers to fit co de into the system, and it also makes construction of to ols (e.g. gccas and gccld) a simple matter of handling command-line arguments and se- 4.4 Surprises and Insights from LLVM Through the exp erience of developing LLVM, we have develop ed several insights which may b e useful to a broad audience. First, implementing a typ esafe linker for C is a non-trivial exercise. C programs often rely on implicit prototyp es for called functions, or use prototyp es that are blatantly wrong. We have also seen cases where global data is declared to have different typ es in different translation units (which, in practice, b ehaves similarly to a COMMON blo ck in FORTRAN). A normal binary linker do es not typically have problems with these issues, but they must b e handled explicitly with a typ e-safe linker. On the other hand, this information is often useful to the programmer, like the “lint” to ol. When p erforming interpro cedural analysis, having as much of the program available as p ossible increases the precision of the analyses. For this reason, we have compiled several libraries to LLVM form that allow them to b e analyzed and optimized with the program. This has several interesting consequences: first, the library co de itself can b e sp ecialized and optimized with the program (for example, optimizing qsort by inlining the comparison functions, so indirect calls do not need to b e used). Second, this dramatically reduces the need for adho c annotations on functions indicating prop erties 6A third mo de, for debugging back-end bugs, is planned. Source Filename combine.c expr.c cse.c reload1.c c-decl.c insn-recog.c lo op.c c-typ eck.c wc -l LOC 11103 10747 8779 7117 6968 6957 6648 6604 GCC CSE 1 0.70s 0.52s 0.50s 0.37s 0.42s 0.34s 0.33s 0.46s IC .431s .141s .187s .058s .022s .082s .013s .028s LLVM Pass Times GER GCSE Sum .027s .141s .599s .009s .072s .222s .012s .061s .260s .008s .034s .100s .005s .031s .058s .004s .090s .176s .001s .003s .017s .005s .026s .059s # LLVM Pass xforms IC GER GCSE 16182 141 2734 6540 41 2870 10925 59 1894 5735 86 1830 3299 3 2221 5238 0 654 1671 7 264 4481 14 1993 Table 1: Transformation timings for source files from the SPEC CPU2000 176.gcc b enchmark such as “const” and “pure”. Instead, simple interpro cedural analyses can b e used, which have the advantage of applying to user co de as well as the built-in functions. Finally, we have found that investing in making the system easier to develop for, and debug in, has b een worth it. In particular, the bugpoint to ol can narrow down a test-case from thousands of lines of C co de to a dozen lines of LLVM co de in a few seconds: doing the same manually would take much longer. Making the development environment detect problems early is also extremely valuable to develop ers, making them more pro ductive and making it easier to bring new p eople on. Having a mo dular system also helps keep p eople from getting overwhelmed when they first start on the pro ject. For the LLVM timings, we chose to use a combination of the Instruction Combining, Global Expression Reasso ciation, and Glo cal Common Sub expression Elimination passes. The combination of these three phases is b elieved to b e strictly more p owerful than the cse pass. The Instruction Combining pass sup ersumes value numb ering, constant folding and trivial dead co de elimination phases, plus it p erforms a variety of transformations similar to the GCC “combine” pass (describ ed b elow). The reasso ciation pass transforms chained o ccurrences of commutative op erations to promote b etter co de motion. The GCSE pass is a well known technique to remove common sub expressions. The table shows the execution time for each pass as well as the sum of the three. The table also shows the numb er of transformations that each pass makes (instructions combined, instructions reasso ciated, common sub expressions deleted). From the table, we can see that the LLVM optimizations always run in less time than the cse pass, and with the exception of the “combine.c” case, to ok ab out half as much time. Despite b eing faster overall, the LLVM transformations are more p owerful than the cse pass, which only op erates on extended basic blo cks. The slowest individual transformation by far is the instruction combination pass, which uses a work-list driven approach to p erform “p eephole” style optimization on the SSA graph (giving it global transformation p owers) for a large collection of algebraic identities (such as folding “(A − (A&B ))” into “(A& ∼ B )”), that the cse pass do es not p erform. Together, the three transformations are quite effective. In addition to simple scalar optimizations, LLVM is designed to supp ort aggressive interpro cedural analyses and optimizations at link-time. As an example, we consider the Data Structure Analysis algorithm, 4.5 Optimizer Performance The LLVM representation allows for efficient transformations and analyses, b oth for aggressive interpro cedural transformation and traditional optimizations. In order to quantify this p erformance, we compared the p erformance of the GCC “cse” pass with the p erformance of the LLVM transformations closest to it (see Table 1). For these tests, we compiled the 8 largest single .c files in the SPEC CPU2000 176.gcc b enchmark (which is based on the GCC 2.7.2.2 source co de). The numb ers were collected on a 1.7GHz AMD 2100+ Athlon pro cessor. The timings for the cse pass were collected when compiling with GCC 3.2 and the -O3 option. The actual timings were acquired as the average of 5 runs with the -ftime-report option and the compiler configured for a i686-pc-linux-gnu target. The cse 2 pass was ignored, the timings just include the first invo cation of the cse pass. a context-sensitive flow-insensitive memory analysis framework. On the same hardware as ab ove it is capable of analyzing entire programs in seconds: 2.5s the povray and 1.2s for the 255.vortex programs, which are ab out 136,000 and 67,000 lines of C co de resp ectively [13]. Other simpler algorithms may obviously run much more quickly. 5 Related Work Within the GCC pro ject, several pro jects in development or recently merged onto the mainline are relevant. In particular, the ast-optimizer pro ject and its tree-ssa subpro ject aim to improve optimization in GCC by migrating optimizations from the target-sp ecific RTL representation to a target-indep endent AST representation. The representation prop osed in this pap er is similar to the tree-ssa GIMPLE representation in some ways (b oth are language-indep endent, SSA based, and do not allow nested expressions), but they are different in many other ways. In particular, the GIMPLE representation is not capable of representing the entire translation unit b eing compiled: a lot of information ab out the program is stored only in global variables, or are immediately emitted to the output assembly file. Also, the GIMPLE representation has op erations which are closer to the source level. For example, variable definitions can have their address taken, which makes the def-use chain representation much more complex in the GIMPLE representation. On the other hand, the tree-ssa pro ject is much b etter integrated into GCC, is written in the C language, and do es not require the intro duction of a completely new intermediate representation. There is a vast amount of related work on interpro cedural optimization in research and commercial compilers [1, 8, 2, 9, 3]. To avoid ma jor changes to the build pro cess, all of these compilers combine the program together at link-time in a very high-level representation, b efore any substantial optimization is p erformed. Most often, this representation takes the form of the source language Abstract Syntax Tree (AST) with source languagesp ecific no des removed. Once the program is combined at link-time, optimization for the entire program commences, starting with interpro cedural optimizations. In contrast, the approach describ ed here immediately optimizes and translates the program to a low-level, but strongly-typ ed, intermediate representation which is suitable for optimization b oth at compile- and link-time. Because substantial optimization is p erformed at compile-time, the interpro cedural optimizers have less work to p erform at link-time, reducing the amount of time a recompilation requires. Previous work [13, 7, 10, 12] has shown that a low-level representation with typ e information can supp ort aggressive high-level analyses and transformations. Another successful class of interpro cedural optimizers target very low-level optimizations. These “smart-linkers” typically op erate at the level of the machine co de, p erforming optimizations such as interpro cedural register allo cation and co de layout optimizations [16, 14, 5]. Although these to ols have b een successful, and require little or no mo dification to the source compiler, they are not capable of p erforming high-level optimizations at all. Also, these optimizations can all b e p erformed in our framework, b ecause co de generation o ccurs for the entire program at a time, exp osing the necessary interprocedural information. 6 Conclusion This pap er presents the design for an aggressive, but realistic, interpro cedural optimization comp onent for the GNU Compiler Collection. This design is capable of supp orting a broad range of wholeprogram optimization techniques, is reasonable in terms of compilation time, and has already b een implemented. We hop e our efforts will accelerate the pro cess of making GCC pro duce co de which is more comp etitive with commercial compilers, and p erhaps LLVM can b e directly adopted as an optional part of the compiler itself. We encourage memb ers of the community who are interested in the prop osed architecture or LLVM itself to contact the authors with any feedback, questions, or ideas. References [1] J. Amaral, G. Gao, J. Dehnert, and R. Towle. The SGI Pro64 compiler infrastructure: A tu- torial. The International Conference on Parallel Architeture and Compilation Techniques (PACT2000), Oct. 2000. [2] A. Ayers, S. de Jong, J. Peyton, and R. Scho oler. Scalable cross-mo dule optimization. In Proc. SIGPLAN ’98 Conf. on Programming Language Design and Implementation, pages 301–312, Montreal, June 1998. [3] D. Blickstein, P. Craig, C. Davidson, N. Faiman, K. Glossop, R. G. S. Hobbs, and W. Noyce. The gem optimizing compiler system. Digital Technical Journal, 4(4):121–136, 1992. [4] M. Burke and L. Torczon. Interpro cedural optimization: eliminating unnecessary recompilation. ACM Transactions on Programming Languages and Systems (TOPLAS), 15(3):367– 399, 1993. [5] R. Cohn, D. Go o dwin, and P. Lowney. Optimizing Alpha executables on Windows NT with Spike. Digital Technical Journal, 9(4), 1997. [6] R. Cytron, J. Ferrante, B. K. Rosen, M. N. Wegman, and F. K. Zadeck. Efficiently computing static single assignment form and the control dep endence graph. ACM Transactions on Programming Languages and Systems, pages 13(4):451–490, Octob er 1991. [7] D. Dhurjati, S. Kowshik, V. Adve, and C. Lattner. Memory safety without runtime checks or garbage collection. In Proc. 2003 ACM SIGPLAN Symposium on Languages, Compilers, and Tools for Embedded Systems, Feb 2003. [8] C. Dulong, R. Krishnaiyer, D. Kulkarni, D. Lavery, W. Li, J. Ng, and D. Sehr. An overview of the Intel IA-64 compiler. Intel Technology Journal, (Q4), 1999. [9] A. Holler and Hewlett-Packard Company. Compiler optimizations for the PA-8000. In Proc. IEEE International Computer Conference, 1997. [10] C. Lattner. LLVM: An infrastructure for multistage optimization. Master’s thesis, Computer Science Dept., University of Illinois at Urbana-Champaign, Urbana, IL, Dec 2002. See http://llvm.cs.uiuc.edu. [11] C. Lattner and V. Adve. LLVM Assembly language reference manual, http://llvm.cs.uiuc.edu/docs/langref.html. [12] C. Lattner and V. Adve. Automatic Po ol Allo cation for Disjoint Data Structures. In Proc. ACM SIGPLAN Workshop on Memory System Performance, Berlin, Germany, Jun 2002. [13] C. Lattner and V. Adve. Data structure analysis: An efficient context-sensitive heap analysis. Tech. Rep ort UIUCDCS-R-2003-2340, Computer Science Dept., Univ. of Illinois at Urbana-Champaign, Apr 2003. [14] A. Srivastava and D. W. Wall. A practical system for intermo dule co de optimization at link-time. Journal of Programming Languages, 1(1):1–18, Dec. 1992. [15] R. Stallman. The GNU C compiler. Free Software Foundation, 1991. [16] D. Wall. Global register allo cation at link-time. In Proc. SIGPLAN ’86 Symposium on Compiler Construction, Palo Alto, CA, 1986.