Architecture for a Next-Generation GCC
Chris Lattner
Vikram Adve
University of Illinois at Urbana, Champaign
{lattner, vadve}@cs.uiuc.edu
http://llvm.cs.uiuc.edu
Abstract
compiling a statement at a time, to compiling and
optimizing entire functions at a time, to the (still
very new) support for unit-at-a-time compilation
This paper presents a design and implementation
(compiling and optimizing all of the functions in a
of a whole-program interprocedural optimizer built
translation unit together). As the scope for analysis
in the GCC framework.
Through the introduc-
and optimizations increases, the compiler is better
tion of a new language-independent intermediate
able to reduce the time and space requirements for
representation, we extend the current GCC archi-
the generated code.
tecture to include a powerful mid-level optimizer
and add link-time interprocedural analysis and op-
This paper proposes the next logical step for the
timization capabilities. This intermediate represen-
GCC optimizer: extend it to be able to analyze and
tation is an SSA-based, low-level, strongly-typed,
optimize whole programs at link-time1, enabling new
representation which is designed to support both
optimizations and making existing analyses and op-
efficient global optimizations and high-level anal-
timizations more powerful. For example:
yses.
Because most of the program is available
at link-time, aggressive “whole-program” optimiza-
• inlining across translation units
• whole-program alias analysis
tions and analyses are possible, improving the time
• interprocedural register allocation
and space requirements of compiled programs. The
• interprocedural constant propagation
final proposed organization of GCC retains the im-
• data layout optimizations
portant features which make it successful today, re-
• exception handling space optimizations
quires almost no modification to either the front-
• sorting initializer priorities at link-time
or back-ends of GCC, and is completely compatible
The key challenges to whole-program optimization
with user makefiles.
are to enable powerful transformations while keep-
ing compile times reasonable, and to keep the user-
visible development process unchanged (e.g. user
makefiles).
1
Introduction
The architecture that we propose is based on a new
language-independent low-level code representation
The GNU Compiler Collection (GCC) [15] is in
that preserves important type information from the
many ways the centerpiece of the Free Software
source code. The use of a low-level, SSA-based rep-
movement. It supports several source languages and
resentation allows the compiler to perform a variety
a plethora of back-ends for various targets, provid-
of optimizations at compile time, off-loading work
ing a unified target for free software.
GCC has
from the link-time optimizer. However, the link-
been successful because of its extreme portability,
time optimizer can only perform meaningful opti-
stability, and because it is able to compile and op-
mizations on the program if it has enough high-level
timize several popular source languages (C, C++,
information about the program to prove that aggres-
Java, etc) to each target. Unfortunately, despite the
sive optimizations are safe. Because of this, the low-
success of the GCC compiler suite as a whole, the
level code representation is typed (using a language-
optimization infrastructure is still not competitive
independent constructive type system) and directly
with commercial compilers.
1 This capability would be optional and could be enabled
only when the program is compiled at the “-O4” level of op-
Over the years, the GCC optimizer has evolved from
timization, for example.
Front-End Compiler
"LLVM"
Assembly
Source File
GCC FE Mid-Level Opt
Optimizing Linker
Unmodified binutils Toolchain
.o files
Source File
GCC FE Mid-Level Opt
Whole Program RTL Expansion
Link
Analysis and
Native
Native
".exe"
.s file
Optimization
Native Codegen
Assembler
Linker
file
Source File
GCC FE Mid-Level Opt
native
.a files
.a /.so
Source File
GCC FE Mid-Level Opt
files
Figure 1: High-Level Compiler Architecture for Whole-Program Optimization
exposes information about structure and array ac-
2
High-Level Compiler Architecture
cesses to the optimizer.
The link-time optimizer is designed to combine the
translation units of a program together and do the
final whole-program optimization. After the pro-
gram is optimized, machine code is generated at
link-time for the entire program at once, allowing a
variety of interprocedural low-level code optimiza-
The proposed high-level architecture is illustrated in
tions to be performed.
Figure 1. The essential aspect of this design is that
it separates the current cc1 program into two com-
The Low-Level Virtual Machine (LLVM) [10] is an
ponents: a front-end compiler and an optimizing
implementation of the architecture and intermediate
linker. The front-end retains all of the responsibili-
representation [11] described in this paper, which al-
ties of current GCC front-ends (preprocessing, lex-
lows us to be more concrete when describing aspects
ical analysis, parsing, semantic analysis, etc..) and
of the design. This system has served as the host
should work unmodified in the new system. After
for several research projects [7, 13, 12] which require
each function is parsed and checked for semantic
whole-program information as well as a host for a
errors it is “expanded” from the “tree” representa-
variety of traditional compiler optimizations.
tion to the new language-independent intermediate
representation (described in Section 3). Once the
We hope that the lessons learned by the LLVM
entire translation unit has been translated (and if
project will be useful to the GCC community, and
no errors have occurred), a standard set of mid-level
are willing to contribute as much code to the GNU
optimizations are performed on the translated mod-
project as there is interest in. We are planning to
ule. After these optimizations are finished, a “.o”
have our first public release of LLVM, with a liberal
file is emitted which contains IR assembly code for
license, in the Summer of 2003. However, LLVM will
the representation.
only be discussed when it helps clarify the ideas in
the proposed architecture, this paper is intended to
When the optimizing linker is invoked, it reads in all
be a GCC paper, not an LLVM paper.
of the translated IR files and any libraries compiled
to the intermediate representation. It links these
This paper is organized as follows: Section 2 de-
files together into a single-file representation of the
scribes the proposed high-level architecture in de-
program, on which it can run whole-program analy-
tail, including modifications that would need to be
ses and optimizations. Finally, once these analyses
made to the GCC infrastructure.
Section 3 de-
and transformations are complete, the GCC back-
scribes important aspects of the proposed interme-
end is invoked to expand the intermediate repre-
diate representation for the system. Section 4 de-
sentation into RTL and use the configured target
scribes LLVM, our existing implementation of the
description to produce a native .s file.
proposed design. Section 5 describes other work re-
lated to the proposed design, and Section 6 wraps
After the optimizing linker produces a native .s file,
up the paper.
the compilation process proceeds through the stan-
dard system assembler and linker (to resolve any
symbols in libraries that were not available in the
IR form), finally producing a native executable.
2.1
Compatibility and Implementation
piled and the application must be relinked. In order
to reduce the amount of work that must be done,
this design allows most traditional optimizations to
One of the key features of this design is that it is
be performed in the compiler front-end stage, rather
compatible with the standard “compile and link”
than requiring all optimization to occur at the link
models of compilation, and is thus fully compati-
stage (as is common for whole-program optimizers).
ble with existing makefiles. In order to provide this
Because most aggressive scalar optimizations are
compatibility, the link phase of the gcc compiler
performed at compile-time, they would not need to
driver is extended to invoke the optimizing linker
be rerun at link time, reducing the time for com-
and system assembler (if necessary) during the stan-
pilation. Of course, the compiler performance issue
dard link step of the compile process. In this way,
does not even arise unless the user is modifying the
any input files that are in the IR format are au-
program and recompiling at -O4.
tomatically linked together and optimized without
interfering with the compilation and linking of stan-
Optionally with this design, the compiler could try
dard translation units and libraries. If no files in the
to minimize the amount of recompilation necessary
IR format are present, the entire invocation of the
when a change occurs by keeping track of which in-
optimizing linker is skipped.
terprocedural information is used to modify func-
tions in other translation units, building a depen-
Another important aspect of the design is how
dence graph between the modules [4]. In practice,
the compiler works when whole-program optimiza-
however, this would make the compiler much more
tion is not enabled. If not enabled, each transla-
complicated and prone to subtle bugs that are hard
tion unit is either compiled a function at a time
to reproduce. We feel that although the cost of re-
or a unit at a time (depending on the setting of
compilation is still fairly substantial in our system
the -funit-at-a-time switch), through the mid-
(native code must be regenerated for the entire ap-
level optimizer, RTL expansion, and code genera-
plication), that the extra complexity introduced into
tion phases of the compiler. This produces a native
the compiler must be weighed against the recompi-
.s file, which can be processed with the standard
lation time penalties, and thus may be impractical.
system assembler and linker, as before.
For this approach to be feasible, a large amount of
code must be shared between the optimizing linker
3
Code Representation
and the compiler front-ends. This can either be
accomplished through the use of libraries that are
shared between the two (which would contain the
The representation used to analyze and manipu-
existing GCC back-end, and any shared optimiza-
late the program determines what kinds of transfor-
tions on the IR), or by making both logical pieces be
mations are possible and when in the compilation
part of the same binary. In either case, the actual
process they must be performed to be successful.
organization of the existing GCC code base would
As mentioned earlier, we propose using a language-
not have to change in any substantial way.
independent, low-level, SSA-based, strongly-typed
representation as the sole representation used for
the mid-level and link-time optimizers. This repre-
2.2
Architectural Issues Affecting Per-
sentation is a first-class assembly language, which
formance
includes all of the information necessary to rep-
resent the program (and is in fact directly inter-
pretable). Concrete details of the representation
In addition to providing the desired functionality
used by LLVM are included in Section 3.2.
and compatibility with existing systems, it is cru-
cial that the compiler does not slow down unac-
Using a low-level three-address code representation
ceptably — even if whole-program optimization is
based on Static Single Assignment [6] form enables
only enabled at -O4. In practical terms, this design
the direct application of many well-known and effi-
addresses the issue by performing as much optimiza-
cient global optimizations. SSA form permits sparse
tion as possible at compile time.
optimizations that do not, in general, require bit-
vector data-flow analysis to compute results. Using
Any time a source file is changed, it must be recom-
a three-address code representation (as opposed to
an tree structured representation) also makes trans-
ership semantics (eliminating the need for it to live
formations easy to develop and reason about.
on a garbage collected heap). In our experience with
LLVM, code optimizers for a sparse representation
Many transformations need information about the
can be several times faster than optimizations on a
high-level behavior of the program to be effective. In
dense representation like RTL.
order to preserve this information, we propose that
the representation maintain a strong (but language-
neutral) type system, which captures information
3.2
A Concrete LLVM Example
about pointer, structure, and array accesses in the
program. Working with the LLVM system we find
that this type information allows for a variety of
Figure 2 gives an example of a C function and
high-level analyses and transformations [7, 13, 12]
the corresponding LLVM module it compiles to.
while the nature of the low-level representation
The example shows several important aspects of the
makes it very easy to manipulate. Another advan-
LLVM representation. In particular, it gives a sim-
tage of type information is that it makes detecting
ple example of the type system, basic instruction
and understanding bugs in transformations much
flavor, and demonstrates some instructions. More
easier.
details about the LLVM representation can be found
in the LLVM language reference [11].
The goal of the program representation is to enable
as many different types of optimizations as possible.
LLVM uses a simple constructive type system com-
Because of this, it is important that the represen-
posed of primitive types, structures, arrays, and
tation be able to represent all parts of a program
pointers. Although this is a very simple type sys-
(including global variables, and file scope asm state-
tem, we believe that it contains the key features nec-
ments, for example) in a form that allows transfor-
essary for a front-end to lower any high-level type
mations to modify it. Another useful feature of the
onto it. For example, the LLVM C++ front-end
representation is a stable textual format (“assem-
lowers classes with inheritance into nested structure
bly language”) that can be read and written by the
types. Types are very important in the LLVM sys-
compiler. Given this, it is trivial to write unit tests
tem, and everything that can be used as an operand
for transformations and to debug transformations
to an instruction has a type.
in isolation from the rest of the compiler, and the
representation can be directly interpreted for imme-
Functions in LLVM contain a list of basic blocks,
diate feedback on a transformation.
and each basic block contains a list of instructions.
LLVM has only 29 instructions, which include stan-
dard instructions like load, xor, setcc, etc and a
3.1
Performance Aspects of the Repre-
phi instruction for representing SSA form2.
In-
sentation
traprocedural control flow in LLVM is very simple
(consisting of conditional branches, unconditional
branches, and the switch instruction).
Once the optimizing linker brings together the com-
Everything in LLVM is explicit: there are no fall-
piled program into one module, the interprocedu-
through branches, all address arithmetic is exposed
ral analysis and optimization passes are used to im-
(at the level of structures, pointers, and arrays), and
prove the program. Because these passes operate on
all references to memory use the load and store
the entire program at once, however, the efficiency
instructions. This makes the language more uniform
of each analysis or optimization is critical. For this
and simple to analyze and transform.
reason, several aspects of the representation are de-
signed to make these transformations as efficient as
The getelementptr instruction in LLVM provides
possible.
the mechanism for structured address arithmetic3.
The getelementptr instruction is exactly analo-
In particular, the use of an SSA-based representa-
gous to sequences of array subscript and structure
tion allows for efficient, sparse, global optimizations,
and can make flow-sensitivity much less important
2 SSA φ-nodes are eliminated during the register allocation
in many analyses (reducing cost substantially). In
phase of native code generation.
3 LLVM code can also cast a pointer to an integer type,
addition, the three-address code representation has
add an arbitrary offset to it, then cast it back to a pointer, if
a small memory footprint and simple memory own-
unstructured address arithmetic is necessary.
%s t r u c t . QuadTree = type { double , [ 4 x %QT∗ ] }
%QT = type % s t r u c t . QuadTree
typedef s t r u c t QuadTree {
double Data ;
void % Sum3rdChildren(%QT∗ %T , double∗ % R e s u l t ) {
s t r u c t QuadTree ∗ C h i l d r e n [ 4 ] ;
e n t r y :
%V = a l l o c a double
; ; %V i s t y p e ’ d o u b l e ∗ ’
} QT;
%tmp . 0 = seteq %QT∗ %T , n u l l
; ; t y p e ’ b o o l ’
br bool %tmp . 0 , l a b e l % e n d i f , l a b e l % e l s e
void Sum3rdChildren (QT ∗T,
double ∗ R e s u l t ) {
e l s e :
; ; tmp.1 = &T [ 0 ] . C h i l d r e n [ 3 ]
’ C h i l d r e n ’ = F i e l d #1
double Ret ;
%tmp . 1 = getelementptr %QT∗ %T , long 0 , ubyte 1 , long 3
i f
( T = = 0 ) { Ret = 0 ;
%C h i l d 3 = load %QT∗∗ %tmp . 1
} e l s e {
c a l l
void % Sum3rdChildren(%QT∗ % C h i l d 3 , double∗ %V)
QT ∗ C h i l d 3 =
%tmp . 2 = load double∗ %V
T [ 0 ] . C h i l d r e n [ 3 ] ;
%tmp . 3 = getelementptr %QT∗ %T , long 0 , ubyte 0
double V ;
%tmp . 4 = load double∗ %tmp . 3
Sum3rdChildren ( C h i l d 3 , &V ) ;
%tmp . 5 = add double %tmp . 2 , % tmp . 4
Ret = V + T [ 0 ] . Data ;
br l a b e l % e n d i f
}
∗ R e s u l t = Ret ;
e n d i f :
%Ret = phi double [ % tmp . 5 , % e l s e ] , [ 0 . 0 , % e n t r y ]
}
s t o r e double %Ret , double∗ % R e s u l t
(a) Example function
r e t void
; ; Return w i t h no v a l u e
}
(b) Corresponding LLVM code
Figure 2: C and LLVM code for a function
index expressions, returning the address of the last
the development of transformations is the operators
element indexed4. For example, the %tmp.1 instruc-
that it lacks. In particular, LLVM does not have
tion in Figure 2(b) first indexes into the 0th element
(or need) any unary operators or a copy instruc-
from the pointer, then into the 1st structure element
tion. Instead of providing the standard negate and
(the “Children” member), then into the 3rd element
bitwise complement unary operators, LLVM repre-
of the array. Structured address arithmetic exposes
sents these with standard binary operators where
the necessary high-level information about structure
one operand is a constant (“neg x” = “sub 0, x”
and array accesses directly to analyses and transfor-
and “not x” = “xor x, -1”). This reduces the de-
mations which need it.
pendence on a “canonical form” for the representa-
tion and simply reduces the number of instructions
One important aspect of the LLVM language is that
that need to be handled.
all references to memory happen with load and
store instructions, and that there is no “address-
The lack of a copy instruction is possible through
of” operation. In LLVM, all objects which live in
the use of SSA form, and because def-use chains are
memory (global variables, functions, the heap, and
trivially computed and always available. Any time
the stack) are explicitly allocated and exposed by
a copy instruction would be inserted (to replace a
their address, not their value. In Figure 2, for ex-
redundant computation for example) it is sufficient
ample, the V variable is required to live in memory
to replace any uses of the destination with uses of
so that its address may be passed into a recursive
the source operand (by following the def-use chains),
invocation of Sum3rdChildren. Because it is im-
implicitly performing copy propagation automati-
possible to take the address of a virtual register,
cally. This simple feature has actually avoided sev-
stack memory must be explicitly allocated with the
eral phase-ordering issues that would otherwise re-
alloca instruction5, and any references to V must
quire unnecessary passes over the representation to
use load and store instructions. This dramatically
do copy propagation between other passes.
simplified def-use chain construction for virtual reg-
isters, which would otherwise require some form of
alias-analysis to construct.
4
LLVM Compiler Infrastructure
A final example illustrating how LLVM simplifies
4 The example in Figure 2(a) uses the strange syntax
The LLVM Compiler Infrastructure [10] currently
’T[0].x’ instead of using the equivalent ’T->x’ to make the
consists of approximately 130,000 lines of C++ code
correspondence more clear.
and a the front-end, which is a patch against the
5 When the back-end is invoked, all fixed sized allocas
in the entry block are treated the same as address-exposed
mainline GCC CVS tree.
This code largely im-
automatic variables.
plements the design presented in this paper, al-
though there are some differences. This section de-
• Traditional SSA based optimizations: ADCE,
scribes these differences, the implementation status
GCSE, LICM, PRE, SCCP, induction variable
of LLVM, some other features of LLVM that make
canonicalization, reassociation, value number-
writing transformations simpler, and some insights
ing, register promotion, etc...
that we have had while working on LLVM.
• Control Flow Graph based optimizations and
analyses: critical edge elimination, loop canon-
icalization, various dominator, post-dominator,
4.1
Implementation Status
and control dependence graph related analyses,
interval construction, natural loop construc-
tion, CFG simplification, path profiling instru-
The LLVM C front-end is based on the main-
mentation, etc...
line GCC CVS repository.
It generates code
by calling LLVM versions of functions that are
• Interprocedural analyses and transformations:
equivalent to the RTL-expansion routines (e.g.
call graph construction, several interprocedu-
llvm expand expr, llvm expand function start,
ral alias analyses, global variable merging, dead
make decl llvm, etc) during compilation.
These
global elimination, inlining, Data Structure
routines build up an LLVM version of the trans-
Analysis [13], automatic pool allocation [12],
lation unit, which is then written to the “.s” file
interprocedural mod/ref, etc...
all at once (allowing “unit-at-a-time” style transfor-
mations to be performed from within GCC in the
In addition to pure infrastructure, the LLVM system
future).
also provides a large test suite. The three main sec-
tions of the test suite are the regression tests (which
Instead of modifying the cc1 binary to interface di-
contain thousands of tests for transformations and
rectly to the LLVM optimizations written in C++,
other tools), feature tests (which demonstrate how
cc1 directly emits the expanded code without any
instructions and idioms are used in LLVM), and pro-
optimization at all. When the gcc compiler driver
gram tests (which compile benchmarks and other
invokes the “assembler”, we actually have it invoke
programs with the various code generators, ensur-
a program called gccas which parses the LLVM as-
ing that they produce code whose behavior agrees
sembly file, runs a series of LLVM optimizers on it,
with a native compiler). The LLVM web site also
then emits a compressed bytecode file (the .o file).
hosts a variety of documentation describing aspects
The interface to gccas is intentionally designed to
of the infrastructure.
be identical to the interface of the standard system
as tool, to avoid having to make changes to spec
LLVM is also still under development. In particular,
files.
the C++ front-end is nearing completion (runtime li-
brary support for exception handling is the major
When the user (or a makefile) links the program
missing portion), Sparc V9 support for the JIT is
using our gcc compiler driver, it invokes our gccld
in development, and a system for runtime optimiza-
tool. This tool reads the .o files specified, links in
tion of statically compiled binaries is in the research
the appropriate bytecode files from any .a files, and
phases.
then runs a series of interprocedural optimizations
on the program. At this time, we directly emit an
LLVM bytecode file for the entire program, instead
4.2
Differences from the Proposal
of automatically invoking a native code generator.
Once the program has been optimized and is avail-
The biggest difference between the proposal and the
able in a single bytecode file, there are several ways
LLVM implementation is the lack of an LLVM to
to execute the resultant program. LLVM provides
RTL conversion pass. For our research purposes,
a very slow (but portable) reference interpreter for
we use a C back-end, which provides much of the
bytecode files, a Sparc V9 native code generator, a
same functionality as a full fledges RTL back-end,
C back-end, and a Just-In-Time (JIT) compiler for
but is much slower. We expect that this component
the IA32 architecture.
can be added upon demand.
A large number of LLVM optimizations and analy-
Another big difference between the current imple-
ses are available, including passes for:
mentation and the proposal is the interface between
the cc1 program and the mid-level optimizer. For
lecting a sequence of passes to run.
expediency of implementation we currently have the
two tools as separate executables, although this ob-
bugpoint, another useful tool, is best described
viously incurs more overhead than linking the two
as an “automated test-case reducer”.
Given an
components together. Once the subject of including
LLVM program (or fragment) and a list of passes to
C++ code in GCC is better decided, we can look to
run, it attempts to reduce the test-case (and list of
resolve this issue.
passes) to the minimum which still exposes a prob-
lem. bugpoint can currently diagnose passes which
crash/assert during optimization and passes which
4.3
Support for Developers
misoptimize the program (by executing the resul-
tant program with a code generator, assuming a de-
terministic program)6. If a test-case causes a pass
One of the strengths of the LLVM infrastructure is
to crash, bugpoint is usually able to reduce the
that it has some interesting utilities for constructing
test-case down to the few LLVM instructions and
passes, finding bugs in those passes, and building a
basic block which cause the problem. If a pass (or
compiler around a selection of these passes. This
combination of passes) miscompiles the test-case, it
strength is important for two reasons: it allows new
can isolate a single function which is being miscom-
people to get into the system and get productive
piled. The bugpoint tool is possible because of the
relatively fast, and it also allows experienced devel-
modularity of the pass manager and the ability to
opers to be more productive than they otherwise
read, write, and modify a representation of whole
would. The most important features are: a strong
programs.
consistency checker, a “pass manager”, and a tool
we call “bugpoint”.
4.4
Surprises and Insights from LLVM
The LLVM infrastructure includes a stringent
checker for LLVM code, which ensures that type
Through the experience of developing LLVM, we
relationships, SSA properties (e.g., all definitions
have developed several insights which may be use-
dominates their uses), and other LLVM invariants
ful to a broad audience. First, implementing a type-
haven’t been violated by a transformation. This
safe linker for C is a non-trivial exercise. C programs
checker is automatically run after passes when in
often rely on implicit prototypes for called functions,
development mode to ensure that these passes are
or use prototypes that are blatantly wrong. We have
not corrupting the input for other passes that are
also seen cases where global data is declared to have
run. Additionally, when in development mode, an
different types in different translation units (which,
automated memory leak detector is automatically
in practice, behaves similarly to a COMMON block
enabled, which detects violations of the LLVM rep-
in FORTRAN). A normal binary linker does not
resentation’s ownership model. This light-weight
typically have problems with these issues, but they
checker is implemented using only a few additions
must be handled explicitly with a type-safe linker.
to constructors and destructors for the classes which
On the other hand, this information is often useful
make up the representation, no garbage collector is
to the programmer, like the “lint” tool.
necessary.
When performing interprocedural analysis, having
The LLVM “Pass Manager” provides a structured
as much of the program available as possible in-
environment for passes to execute in.
Transfor-
creases the precision of the analyses. For this rea-
mations in LLVM use a declarative syntax to in-
son, we have compiled several libraries to LLVM
dicate which other passes are prerequisites (e.g.
form that allow them to be analyzed and optimized
break-critical-edges), which analyses are re-
with the program. This has several interesting con-
quired (e.g. natural loop information, alias analy-
sequences: first, the library code itself can be spe-
sis, value numbering, interprocedural mod/ref info,
cialized and optimized with the program (for exam-
etc...), and which analyses are preserved or de-
ple, optimizing qsort by inlining the comparison
stroyed by the transformation being run.
This
functions, so indirect calls do not need to be used).
structured pass model makes it easier for developers
Second, this dramatically reduces the need for ad-
to fit code into the system, and it also makes con-
hoc annotations on functions indicating properties
struction of tools (e.g. gccas and gccld) a simple
matter of handling command-line arguments and se-
6 A third mode, for debugging back-end bugs, is planned.
Source
wc -l
GCC
LLVM Pass Times
# LLVM Pass xforms
Filename
LOC
CSE 1
IC
GER
GCSE
Sum
IC
GER
GCSE
combine.c
11103
0.70s
.431s
.027s
.141s
.599s
16182
141
2734
expr.c
10747
0.52s
.141s
.009s
.072s
.222s
6540
41
2870
cse.c
8779
0.50s
.187s
.012s
.061s
.260s
10925
59
1894
reload1.c
7117
0.37s
.058s
.008s
.034s
.100s
5735
86
1830
c-decl.c
6968
0.42s
.022s
.005s
.031s
.058s
3299
3
2221
insn-recog.c
6957
0.34s
.082s
.004s
.090s
.176s
5238
0
654
loop.c
6648
0.33s
.013s
.001s
.003s
.017s
1671
7
264
c-typeck.c
6604
0.46s
.028s
.005s
.026s
.059s
4481
14
1993
Table 1: Transformation timings for source files from the SPEC CPU2000 176.gcc benchmark
such as “const” and “pure”. Instead, simple in-
For the LLVM timings, we chose to use a com-
terprocedural analyses can be used, which have the
bination of the Instruction Combining, Global
advantage of applying to user code as well as the
Expression Reassociation, and Glocal Common
built-in functions.
Subexpression Elimination passes. The combina-
tion of these three phases is believed to be strictly
Finally, we have found that investing in making the
more powerful than the cse pass.
The Instruc-
system easier to develop for, and debug in, has been
tion Combining pass supersumes value numbering,
worth it. In particular, the bugpoint tool can nar-
constant folding and trivial dead code elimination
row down a test-case from thousands of lines of C
phases, plus it performs a variety of transforma-
code to a dozen lines of LLVM code in a few seconds:
tions similar to the GCC “combine” pass (described
doing the same manually would take much longer.
below). The reassociation pass transforms chained
Making the development environment detect prob-
occurrences of commutative operations to promote
lems early is also extremely valuable to developers,
better code motion. The GCSE pass is a well known
making them more productive and making it easier
technique to remove common subexpressions. The
to bring new people on. Having a modular system
table shows the execution time for each pass as well
also helps keep people from getting overwhelmed
as the sum of the three. The table also shows the
when they first start on the project.
number of transformations that each pass makes
(instructions combined, instructions reassociated,
common subexpressions deleted).
4.5
Optimizer Performance
From the table, we can see that the LLVM op-
timizations always run in less time than the cse
The LLVM representation allows for efficient trans-
pass, and with the exception of the “combine.c”
formations and analyses, both for aggressive inter-
case, took about half as much time. Despite being
procedural transformation and traditional optimiza-
faster overall, the LLVM transformations are more
tions. In order to quantify this performance, we
powerful than the cse pass, which only operates
compared the performance of the GCC “cse” pass
on extended basic blocks. The slowest individual
with the performance of the LLVM transformations
transformation by far is the instruction combina-
closest to it (see Table 1).
For these tests, we
tion pass, which uses a work-list driven approach to
compiled the 8 largest single .c files in the SPEC
perform “peephole” style optimization on the SSA
CPU2000 176.gcc benchmark (which is based on the
graph (giving it global transformation powers) for a
GCC 2.7.2.2 source code). The numbers were col-
large collection of algebraic identities (such as fold-
lected on a 1.7GHz AMD 2100+ Athlon processor.
ing “(A − (A&B))” into “(A& ∼ B)”), that the cse
pass does not perform. Together, the three trans-
The timings for the cse pass were collected when
formations are quite effective.
compiling with GCC 3.2 and the -O3 option. The
actual timings were acquired as the average of 5 runs
In addition to simple scalar optimizations, LLVM is
with the -ftime-report option and the compiler
designed to support aggressive interprocedural anal-
configured for a i686-pc-linux-gnu target. The
yses and optimizations at link-time. As an example,
cse 2 pass was ignored, the timings just include
we consider the Data Structure Analysis algorithm,
the first invocation of the cse pass.
a context-sensitive flow-insensitive memory analy-
Within the GCC project, several projects in de-
sis framework. On the same hardware as above it
velopment or recently merged onto the mainline
is capable of analyzing entire programs in seconds:
are relevant.
In particular, the ast-optimizer
2.5s the povray and 1.2s for the 255.vortex pro-
project and its tree-ssa subproject aim to im-
grams, which are about 136,000 and 67,000 lines of
prove optimization in GCC by migrating optimiza-
C code respectively [13]. Other simpler algorithms
tions from the target-specific RTL representation
may obviously run much more quickly.
to a target-independent AST representation. The
representation proposed in this paper is similar to
the tree-ssa GIMPLE representation in some ways
(both are language-independent, SSA based, and do
not allow nested expressions), but they are different
5
Related Work
in many other ways.
In particular, the GIMPLE representation is not ca-
There is a vast amount of related work on inter-
pable of representing the entire translation unit be-
procedural optimization in research and commercial
ing compiled: a lot of information about the pro-
compilers [1, 8, 2, 9, 3]. To avoid major changes
gram is stored only in global variables, or are im-
to the build process, all of these compilers com-
mediately emitted to the output assembly file. Also,
bine the program together at link-time in a very
the GIMPLE representation has operations which
high-level representation, before any substantial op-
are closer to the source level. For example, vari-
timization is performed. Most often, this represen-
able definitions can have their address taken, which
tation takes the form of the source language Ab-
makes the def-use chain representation much more
stract Syntax Tree (AST) with source language-
complex in the GIMPLE representation. On the
specific nodes removed. Once the program is com-
other hand, the tree-ssa project is much better in-
bined at link-time, optimization for the entire pro-
tegrated into GCC, is written in the C language, and
gram commences, starting with interprocedural op-
does not require the introduction of a completely
timizations.
new intermediate representation.
In contrast, the approach described here immedi-
ately optimizes and translates the program to a
low-level, but strongly-typed, intermediate repre-
6
Conclusion
sentation which is suitable for optimization both at
compile- and link-time. Because substantial opti-
mization is performed at compile-time, the inter-
This paper presents the design for an aggressive,
procedural optimizers have less work to perform at
but realistic, interprocedural optimization compo-
link-time, reducing the amount of time a recompi-
nent for the GNU Compiler Collection. This design
lation requires. Previous work [13, 7, 10, 12] has
is capable of supporting a broad range of whole-
shown that a low-level representation with type in-
program optimization techniques, is reasonable in
formation can support aggressive high-level analyses
terms of compilation time, and has already been
and transformations.
implemented. We hope our efforts will accelerate
the process of making GCC produce code which is
Another successful class of interprocedural opti-
more competitive with commercial compilers, and
mizers target very low-level optimizations. These
perhaps LLVM can be directly adopted as an op-
“smart-linkers” typically operate at the level of the
tional part of the compiler itself. We encourage
machine code, performing optimizations such as in-
members of the community who are interested in
terprocedural register allocation and code layout op-
the proposed architecture or LLVM itself to contact
timizations [16, 14, 5]. Although these tools have
the authors with any feedback, questions, or ideas.
been successful, and require little or no modification
to the source compiler, they are not capable of per-
forming high-level optimizations at all. Also, these
optimizations can all be performed in our frame-
References
work, because code generation occurs for the entire
program at a time, exposing the necessary interpro-
[1] J. Amaral, G. Gao, J. Dehnert, and R. Towle.
cedural information.
The SGI Pro64 compiler infrastructure: A tu-
torial. The International Conference on Par-
[12] C. Lattner and V. Adve. Automatic Pool Al-
allel Architeture and Compilation Techniques
location for Disjoint Data Structures. In Proc.
(PACT2000), Oct. 2000.
ACM SIGPLAN Workshop on Memory System
Performance, Berlin, Germany, Jun 2002.
[2] A. Ayers,
S. de Jong,
J. Peyton,
and
R. Schooler. Scalable cross-module optimiza-
[13] C. Lattner and V. Adve. Data structure anal-
tion. In Proc. SIGPLAN ’98 Conf. on Pro-
ysis: An efficient context-sensitive heap anal-
gramming Language Design and Implementa-
ysis.
Tech. Report UIUCDCS-R-2003-2340,
tion, pages 301–312, Montreal, June 1998.
Computer Science Dept., Univ. of Illinois at
Urbana-Champaign, Apr 2003.
[3] D.
Blickstein,
P.
Craig,
C.
Davidson,
N. Faiman, K. Glossop, R. G. S. Hobbs,
[14] A. Srivastava and D. W. Wall. A practical
and W. Noyce.
The gem optimizing com-
system for intermodule code optimization at
piler system.
Digital Technical Journal,
link-time. Journal of Programming Languages,
4(4):121–136, 1992.
1(1):1–18, Dec. 1992.
[4] M. Burke and L. Torczon. Interprocedural op-
[15] R. Stallman. The GNU C compiler. Free Soft-
timization: eliminating unnecessary recompi-
ware Foundation, 1991.
lation.
ACM Transactions on Programming
Languages and Systems (TOPLAS), 15(3):367–
[16] D. Wall. Global register allocation at link-time.
399, 1993.
In Proc. SIGPLAN ’86 Symposium on Com-
piler Construction, Palo Alto, CA, 1986.
[5] R. Cohn, D. Goodwin, and P. Lowney. Opti-
mizing Alpha executables on Windows NT with
Spike. Digital Technical Journal, 9(4), 1997.
[6] R. Cytron, J. Ferrante, B. K. Rosen, M. N.
Wegman, and F. K. Zadeck. Efficiently com-
puting static single assignment form and the
control dependence graph.
ACM Transac-
tions on Programming Languages and Systems,
pages 13(4):451–490, October 1991.
[7] D. Dhurjati, S. Kowshik, V. Adve, and C. Lat-
tner. Memory safety without runtime checks or
garbage collection. In Proc. 2003 ACM SIG-
PLAN Symposium on Languages, Compilers,
and Tools for Embedded Systems, Feb 2003.
[8] C. Dulong, R. Krishnaiyer, D. Kulkarni,
D. Lavery, W. Li, J. Ng, and D. Sehr. An
overview of the Intel IA-64 compiler.
Intel
Technology Journal, (Q4), 1999.
[9] A. Holler and Hewlett-Packard Company.
Compiler optimizations for the PA-8000. In
Proc. IEEE International Computer Confer-
ence, 1997.
[10] C. Lattner. LLVM: An infrastructure for multi-
stage optimization.
Master’s thesis, Com-
puter Science Dept., University of Illinois at
Urbana-Champaign, Urbana, IL, Dec 2002. See
http://llvm.cs.uiuc.edu.
[11] C. Lattner and V. Adve.
LLVM As-
sembly
language
reference
manual,
http://llvm.cs.uiuc.edu/docs/langref.html.