The Compressor: Concurrent, Incremental, and Parallel
Compaction ∗
Haim Kermany

Erez Petrank

Dept. of Computer Science
Technion - Israel Institute of Technology
Haifa 32000, Israel
{haimk,erez}@cs.technion.ac.il

Abstract
The widely used Mark-and-Sweep garbage collector has a drawback in that it does not move objects during collection. As a result, large long-running realistic applications, such as Web application servers, frequently face the fragmentation problem. To eliminate fragmentation, a heap compaction is run periodically. However, compaction typically imposes very long undesirable pauses
in the application. While efficient concurrent collectors are ubiquitous in production runtime systems (such as JVMs), an efficient
non-intrusive compactor is still missing.
In this paper we present the Compressor, a novel compaction
algorithm that is concurrent, parallel, and incremental. The Compressor compacts the entire heap to a single condensed area, while
preserving the objects’ order, but reduces pause times significantly,
thereby allowing acceptable runs on large heaps. Furthermore, the
Compressor is the first compactor that requires only a single heap
pass. As such, it is the most efficient compactors known today,
even when run in a parallel Stop-the-World manner (i.e., when the
program threads are halted). Thus, to the best of our knowledge,
the Compressor is the most efficient compactor known today. The
Compressor was implemented on a Jikes Research RVM and we
provide measurements demonstrating its qualities.
Categories and Subject Descriptors D.3.4 [Programming Languages]: Processors—Memory management (garbage collection)
General Terms

Languages, Performance, Algorithms.

Keywords Runtime systems, Memory management, Compaction,
Garbage collection, Concurrent garbage collection.

1. Introduction
Today’s SMP machines that run modern applications using large
heaps present new challenges in designing suitable garbage collectors. In particular, modern servers are required to operate continuously and remain highly responsive to extremely frequent client
∗ This

research was supported by Intel Corporation.

requests and very large heaps. To allow non-intrusive garbage collection, concurrent collectors have been designed and implemented
in various modern runtime systems. Concurrent collectors run in
parallel to the application on a separate thread using part of the
overall computing resources, while the application continues to run
on the rest of these resources.
With the exception of the Sapphire collector [14], concurrent
garbage collectors do not move objects during collection. In other
words, the garbage collector only reclaims unreachable objects.
Consequently, the heap may become more and more fragmented as
holes are created in between the objects that survive the collections.
These holes make allocation more costly. Furthermore, the application may fail in allocating a large object even though the heap does
possesses the overall free space required for the requested allocation.
Compaction is used to solve this problem by grouping live objects together in the heap and freeing up large contiguous spaces
available thereafter for future allocation. However, known compaction algorithms execute all, or a substantial part of the compaction while the application is halted, freezing the application
for a large amount of time. Compaction is notorious for imposing
lengthy pause times. Thus, even if the runtime system employs concurrent garbage collectors, eventually a compaction may be triggered causing infrequent, yet extended, pauses.
In this paper, we present the design and implementation of the
Compressor: a new efficient compaction algorithm that requires
only short pauses and is suitable for Java and C# running on modern
SMPs and supporting large heaps. The Compressor is incremental
(i.e., application threads contribute some compaction work during
each allocation or mutation), parallel (i.e., compaction work can
run in parallel on multiple processors), and concurrent (i.e., parts
or all of the collection can run concurrently while the application is
executing). In addition to reducing pause times, the Compressor is
the first compactor that requires only one heap pass while achieving
full compaction. Namely, it compacts the entire heap to a single
packed area and preserves the order of allocated objects. All known
compactors require at least two passes. Therefore, a parallel version
of the Compressor, designed to work while the program is halted
(in a Stop-the-World manner) is, to the best of our knowledge, the
most efficient compactor known today.
1.1 Technique used

Permission to make digital or hard copies of all or part of this work for personal or
classroom use is granted without fee provided that copies are not made or distributed
for profit or commercial advantage and that copies bear this notice and the full citation
on the first page. To copy otherwise, to republish, to post on servers or to redistribute
to lists, requires prior specific permission and/or a fee.
PLDI’06 June 11–14, 2006, Ottawa, Ontario, Canada.
c 2006 ACM 1-59593-320-4/06/0006. . . $5.00.
Copyright °

To achieve concurrency, the Compressor makes use of the system’s
page protection mechanisms as in some previous work [3, 4, 10,
22]. In the beginning, the application threads are directed to work
with objects as if they had already moved. However, when they
try to access objects not yet moved, a trap is sprung and an area
surrounding the attempted access location is moved. Its pointers

are updated to point to the new locations of the objects’ referents.
More details appear in Section 2 below.
For a more detailed introduction to garbage collection and memory management, the reader is referred to the book by Jones [15].
1.2 Organization
An overview of the Compressor design is provided in Section 2.
The design details are given in Section 3. The implementation
is described in Section 4 and the measurements are presented in
Section 5. Related work is discussed in Section 6 and we conclude
in Section 7.

it highly efficient. Typically, a stop-the-world garbage collector is
more efficient than a concurrent one, but imposes longer pauses.
However, the concurrent Compressor has an important virtue that
makes it more efficient: it is cache conscious. The concurrent Compressor is cache conscious because a page is copied only when the
objects on this page are needed by the program. Thus, the operation
of the Compressor is highly coordinated with the operation of the
program with respect to locality. We now provide the details of the
algorithm.

3. The design details

2. An overview of the Compressor design

The details of the algorithm are given in the sections below.

We start with an overview of the compactor. A detailed description
appears in Section 3 below. The Compressor does not actually compact the heap into itself. Instead, akin to a copying collector, it compacts the heap into a second space. Yet, unlike copying collectors,
the Compressor preserves the order of objects in the heap and does
not require the safeguarding of a large space for the collection. This
desirable behavior is achieved through the use of (standard) virtual
memory operations. The Compressor also satisfies the following
useful property. After compacting a page (or several pages), it is
able to return this page (or pages) to the operating system. Thus,
during compaction the algorithm repeatedly allocates new virtual
pages, but at the same time returns the same number of (or more)
pages to the operating system.
The Compressor assumes as input a standard markbit vector
that is typically output by any marking procedure. For example,
the state-of-the-art On-the-Fly marking procedure presented by
Azatchi et al. [5] can be used to produce a markbit vector on-thefly. This vector has one bit for each heap word,1 and for each live
object in the heap, the two bits representing the first and last words
of the live object are set.
A small auxiliary offset table in the spirit of the IBM’s compactor [1] is used to compute the relocation function. Namely, given
an address of an object in the heap, this table can be used to compute the new address to which the object is moved. This table is
first computed based on the markbit vector. This computation can
be executed concurrently and without accessing the actual objects
in the heap.
We start with the simpler, parallel version of the Compressor.
In this version, several compaction threads may move and update
pointers in parallel. Each Compressor thread finds a page that has
not yet been moved. It moves the objects on this page to their new
location according to the new addresses computed from the offset
table. A new virtual page may be allocated at this point to accept the
moved objects. Next, the thread traverses the objects it has moved
and fixes their pointers using the offset table again. The pages from
which the objects were moved can then be safely returned to the
operating system. In practice, the size of an area handled by a thread
may be (much) larger than a single page.
To obtain a concurrent compactor, we use traps to protect the
virtual space into which the objects are moved. After computing
the offset table concurrently, the above-mentioned virtual space is
protected and the roots are updated to point to the target locations
(using the offset table). Next, a concurrent thread moves the pages
of the heap while the program threads provide some help via execution of traps. A program thread trying to access a not-yet-moved
object springs the trap, which moves the object and its adjacent objects to their target location in order to fill a target page. The trap
also updates the pointers of the moved objects.
Note that each page of the heap is touched only once by the
Compressor, whether parallel or concurrent, which is what makes

3.1 Nomenclature, data structures, and virtual memory
operations

1 Assuming

The Compressor uses two virtual address spaces, each of them of
the same size as the heap. We call these spaces virtual since they are
not always mapped to physical addresses. During each compaction
objects will be moved from one virtual space (denoted from-virtualspace) to the other virtual space (denoted to-virtual-space) and
their roles will change thereafter.
Four major data structures are used. First, the Compressor assumes a markbit vector that is output by the garbage collector’s
marking phase. This vector has one bit for each word in fromvirtual-space. Assuming objects are word aligned, this vector will
hold set bits corresponding to the first and last words of each live
object. Second, the Compressor employs an offset vector. We divide
the heap (or actually the from-virtual-space) into blocks. The size
of a block is a parameter that is typically set to 512 bytes. For each
such block, there is an entry in the offset vector holding a pointer to
the location in to-virtual-space to which the first object (that start)
in the block is moved. Finally, we use two smaller tables, denoted
the first-object vector and the status-table. The first-object vector
has a pointer for each page P of to-virtual-space (typically a page
size is 4096 bytes) referencing the location in from-virtual-space of
the first object that is moved into Page P by the Compressor. The
status-table has a state for each to-virtual-space page. The size of a
state should be the minimum size that can be used with a cmp-andswap operation, typically a word or a byte. The status-table signifies for each to-virtual-space page whether it is UNHANDLED,
i.e., has not been moved to yet, or BUSY, meaning that it is currently being moved to, or HANDLED, meaning that all its objects
were already moved and their pointers updated.
The Compressor will use the following virtual memory services
widely available on standard operating systems (we used Linux).
These services and some of their usage are further described by
Appel and Li [4].
Map: Map a virtual page to a physical page.
UnMap: Unmap a virtual page from its associated physical page.
ProtN: Protect a range of virtual pages from read and write access.
UnProt: Remove the protection from a virtual page.
TRAP: Perform a specified routine upon access to a protected
virtual page.
DoubleMap: Map one physical page to two different virtual
pages.2
The double mapping in the above last item is used when working
with a protected page during trap handling. While the page remains
protected from access by the program threads, the Compressor
2 This

objects are word-aligned.

service is actually implemented using the Linux system calls
shmget() and shmat().

-live object

Procedure Get-New-Address(old: address)
begin
1.
blockNumber := Get-Block-Number(old)
2.
blockAddress := Get-Block-Address(old)
3.
offsetInBlock := Total-Live-Data(blockAddress,old)
4.
newAddress :=to-virtual-space+
offsetVector[blockNumber] + offsetInBlock
end

Offset vector

0

50

100

125

200

275

325

350

markbit vector
from space

0 1
1000

2

3 4 5 6 78

9

1100 1200 1300 1400 1500 1600 1700

Figure 2. An example: Getting the new address of an object.

Figure 1. Get-New-Address() Procedure
will touch objects on it by using a second virtual view that is not
protected.
3.2 The parallel stop-the-world Compressor
We start by describing the details of the simpler parallel compactor.
The more involved Compressor, which is also incremental and
concurrent, is described later in Section 3.3.
The Compressor starts by computing the addresses to which
objects move, recording this information succinctly. The order of
objects in from-virtual-space is preserved during the move to tovirtual-space. Thus, the new address of an object O is the bottom
address of to-virtual-space plus the total accumulated size of the
live objects residing there before the object O from from-virtualspace arrived. We note that the markbit vector has the required
information to compute the new address for each object, and so
a heap pass is not necessary at this stage. In order to facilitate fast
computation of a new address for a given object, the Compressor
prepares helpful information in the offset vector. Using a single pass
over the markbit vector, the Compressor computes, for each block
B, the sum of live space before it, and stores this number in the
offset vector. This is exactly the location in to-virtual-space into
which the first live object in the block B is moved.
A second computation executed during the same pass over the
markbit vector is one that finds, for each to-virtual-space page P ,
the from-virtual-space location of the first object that moves into
P during the compaction. This information is obtained in the same
(single) pass over the markbit vector and is stored in the first-object
vector.
We stress that the use of a forwarding pointer is avoided and
this process does not access the actual objects in the heap. Only
the small offset and markbit vectors are touched. The pseudocode for using the offset vector to translate an old address into
a new one (i.e., determine the address to which a given object
should be moved) appears in Figure 1. This procedure first assumes a macro Get-Block-Number that returns the block number of a given address (using a shift and a subtraction), a second
macro Get-Block-Address computing the beginning address of
the block that contains a given address (by zeroing the address’
least bits), and a final macro Total-Live-Data, which computes
the size of live space in the given block between its beginning and
the given address. This macro uses the markbit vector to obtain the
required information.
Figure 2 depicts an example of how the updated new address of
a from-virtual-space object is computed. In this example the base
address of from-virtual-space is 1000. The block size is 100 bytes.
Objects 2 and 4 are 50 bytes each, object 6 is 75 bytes and the other
objects are 25 bytes each. The markbit vector has a bit set for the
beginning and end of each object. For every block B, the entry at
the offset vector of B amounts to the total size of the live objects
until B. To get the new address of object 8, whose old address
is 1575, we first obtain the block number (block number 5, start
counting from 0) by rounding down the value of (1575 - 1000)/100.
The address of the block is calculated by zeroing the least two digits
of 1575 to 1500. Using the markbit vector, we calculate the total

Procedure Handle-Move-To-Page(pageNumber: int)
begin
1.
pageAddress := Get-Page-Address(pageNumber)
2.
map(pageAddress,PAGE-SIZE)
3.
F := first-object[pageNumber]
4.
L := first-object[pageNumber+1]
5.
O := F
6.
to := Get-New-Address(O)
7.
while O < L do
8.
Fix-References(O)
9.
Copy(O,to)
10.
to := to + size-of(O)
11.
O := Get-Next-Object(O)
12.
unmap(F, L - F)
end

Figure 3. Handle-Move-To-Page() Procedure

size of live objects inside the block, i.e., from 1500 to 1575. This
amounts to the size of object number 7, which is 25 bytes. These 25
bytes are added to the entry of the block at the offset vector which
is 275 bytes and we get that the total data until object 8 becomes
300. To get the new address of the object we add this 300 to the
bottom address of to-virtual-space.
After completing the preparation of the offset vector and the
first-object vector, the actual compaction may begin. Objects are
moved page by page according to the target pages at to-virtualspace, and their pointers are immediately updated to point to the
new locations. These two vectors provide all the information required for the move after which the pointers are conveniently updated.
The compaction itself starts by updating the roots. The job of
moving and fixing the rest of the objects is partitioned into tasks.
Each task is associated with one or more to-virtual-space pages
and consists of moving objects into the associated to-virtual-space
pages and updating the pointers of the moved objects to point to the
new locations of their referents. Note that the objects in a task are
determined by the target location in to-virtual-space and not by the
original location. Each compaction thread undertakes a task upon
itself via synchronization with the other Compressor threads. Compaction tasks can be executed in parallel with no synchronization.
To move objects into a given to-virtual-space page P , the Compressor starts by mapping a new physical page to the to-virtualspace page P . The objects are then moved starting at the first object
specified by the first-object vector for Page P . The first location is
computed using the offset vector and the rest of the objects follow
thereafter. Then, their references are updated using the offset vector.
Finally, the virtual pages in from-virtual-space whose objects were
moved are unmapped.3 Using this technique, alternating pages are
3 In order to determine whether a from-virtual-space page can be unmapped,
the Compressor needs to know that all the data in the from-virtual-space
page has been copied already. This can be determined by checking the
status-table entries of the to-virtual-space pages residing just before and/or
after the said page.

Procedure Parallel-Compact
begin
1.
stop-Program-Threads
2.
Calculate-Offset-Vector-and-First-Object-Vector
3.
for every root r do
4.
r := Get-New-Address(r)
5.
Spawn compacting threads to perform:
6.
P := Get-Next-Unhandeled-Page
7.
while P != NULL do
8.
Handle-Move-To-Page(P)
9.
P := Get-Next-Unhandeled-Page
10.
If all threads done then Resume-Mutators
end

Figure 4. Parallel-Compact() Procedure

mapped and unmapped, ensuring that no physical memory overhead is required for the compaction.4
The pseudo-code for handling a to-virtual-space page appears
in Figure 3. The code uses the Fix-Reference(O) procedure that
finds the pointers in object O and replaces them with the new
locations of the referents using the Get-NewAddress procedure
described above. The Get-Next-Object(O) procedure finds the
next live object after O using the markbit vector.
A delicate point that should be addressed is the handling of
objects that begin on one to-virtual-space page and end on another.
For the parallel algorithm we may arbitrarily decide that an object
is associated with the to-virtual-space page on which it begins and
work with this assumption. We will need a more involved treatment
for the concurrent collector which is handled in Section 3.3 below.
The pseudo-code for the parallel Compressor appears in Figure 4. All program threads are halted and the two vectors are computed. Since this operation is fast, we did not bother parallelizing
it, although such parallelization can be done using simple tricks.
Parallelization of the main work is straightforward. We use several
compaction threads, one for each available processor, and let each
obtain a task and execute it. The pseudo-code equates a task with
a single page, but, of course, a task may consist of several pages
needing to be fixed. The Get-Next-Unhandled-Page procedure
dispatches tasks to the compaction thread in a synchronized manner. The standard trade-off between balancing the work and minimizing synchronization needs to be fine-tuned here: larger tasks
mean less synchronization during dispatching but also less work
balance, and vice versa.
An important property of this algorithm is that it requires only
one heap pass. Previous algorithms required (at least) two heap
passes for the compaction. Typically, one pass was used to move
the object and another pass was used to adjust the references so that
they point to the new locations. The pre-computation of the offset
vector gives the algorithm its main efficiency advantage. Both the
move and the pointer updates can be executed during the same heap
pass. This simple idea has never been used before. Note that this
idea may also be used with other earlier collectors such as IBM’s
compactor [1].
3.3 A concurrent, incremental, and parallel compactor
In this section, we note that most of the parallel compaction can
be run incrementally by the program threads. Depending on the
availability of idle processor time, low priority background threads
can also execute some compaction work.
4 We address the situation of worst case behavior in Section 4.1. In that case,

the trap routine ensure that pages are unmapped when a worst case scenario
is detected.

The two major steps of the compaction algorithm described
above can be executed while the application is running. First, it
is possible to compute the offset vector and the first-object vector
concurrently with the running threads. Second, using some modifications, we can also move pages and update pointers concurrently
with the program run. The program threads need to be stopped only
to fix the roots.
Since fixing the roots requires the information in the offset vector, we must compute it before stopping the program threads. Thus,
the first phase still consists of computing the offset and the firstobject vectors which is done incrementally. Each program thread
contributes some computation when allocating a new object. The
first problem that arises here is that new objects may be allocated
to holes in the heap between live objects. Such allocations interfere
with the computation of the offset vector, as they change the values
that should be output.
In order to prevent interference of newly allocated objects during this computation, new objects are allocated to to-virtual-space.
Thus, they will not be moved in later stages. However, such objects cannot be completely ignored. Their pointers point to the objects’ original locations and we must now update them to point to
the objects’ new locations. We denote the pages in to-virtual-space
containing newly created objects as require-update pages.
After calculating the offset and the first-object vectors concurrently, we move on to protecting all the virtual to-virtual-space
pages. Note that the program threads are not using this virtual space
at this time, and therefore, the protection of these pages can be
set concurrently while the program is running and using the fromvirtual-space. We then stop the program threads. While the program threads are halted, we update the roots to point to their referents’ new locations and we protect the require-update pages. Observe that we cannot protect the require-update pages concurrently
with the program run, because the program threads are using these
pages. At this point the program threads are resumed.
From this point and on, the program threads will never access the from-virtual-space. The roots only reference addresses
in to-virtual-space, which is protected, and therefore, traps are
bound to be triggered soon thereafter. When a program thread accesses a to-virtual-space location and gets trapped, the trap routine moves the appropriate objects from from-virtual-space into the
to-virtual-space page on which the trap was sprung. The parallel
algorithm was designed for moving objects according to their target to-virtual-space pages. Therefore, the trap operation is readily
available to us. However, we must distinguish two cases. Traps on
require-update pages do not move objects, but only update references. Traps on the rest of the to-virtual-space pages move the objects and update references as before. After handling a to-virtualspace page, we can unprotect it and the program threads can go on
working with unprotected pages. Our invariant of letting the program threads access only pointers in to-virtual-space holds since
these pointers, which the program can read, are already updated
and point to to-virtual-space only.
While executing the trap, the program threads use the DoubleMap virtual memory primitive in order to access the protected
page. A second virtual page is mapped to the same physical page
but is not protected and may be used by the trap code to read and
write to the protected to-virtual-space page. Other program threads
that try to access the same to-virtual-space page while it is being
handled by a trap, will still be trapped. These threads will wait (and
yield the processor) until the first trap finishes and the protection of
the page is lifted.
To coordinate the handling of pages, the Compressor threads
use an additional structure denoted the status-table. The status-

Procedure Trap-Routine(A: address)
begin
1.
P := Get-Page-Number(A)
2.
oldStatus :=
cmp-and-swap(status-table[P],UNHANDLED,BUSY)
3.
if oldStatus = UNHANDLED then
4.
if P is a require-update page then
5.
Fix-Page(P)
6.
else //P is a standard to-virtual-space page
7.
Move-To-Virtual-Page(P)
8.
unprotect(P)
9.
set(statusTable[P],HANDLED)
10.
elseif oldStatus = BUSY then
11.
while test(statusTable[P]) != HANDLED do
12.
wait// yield processor.
end

Figure 5. Trap-Routine() Procedure
r
o
o
ts

Figure 7. Concurrent-Compact() Procedure
A

0

1 2 3 45

6 7 8

9

r
o
o
ts

B

0

1 2 3 45

6 7 8

9

4 567

r
o
o
ts

C

0

12 3

8

9

4567
to-space pages

from space
-live object

-unmapped page

to space
-page protection

Figure 6. An example: The execution of a trap.

table contains a byte (or word) for each to-virtual-space page.5 The
status of all to-virtual-space pages is initialized to UNHANDLED.
In the beginning of the trap routine, the trap tries to modify the
status of the page from UNHANDLED to BUSY using a cmp-andswap operation. If the status is modified successfully, then the trap
handles the page and eventually changes the status to HANDLED
using the atomic write. If the cmp-and-swap operation fails, then
another mutator is already handling the page and the trap routine
just waits until the status of the page is changed by the other thread
into HANDLED.
Allocations that occur concurrently with the run of the Compressor are handled as follows. All allocations performed after the
mark phase is completed are put in the require-update pages, as
discussed above. Once the roots are updated, newly created objects
do not require pointer updates. From this point and on, we allocate objects in to-virtual-space pages that are marked HANDLED.
These pages are not protected and are not touched further by the
Compressor.
Let us refer again to the delicate issue of objects that stretch
along more than one to-virtual-space page. In order to remove the
protection of a to-virtual-space page, we must copy all the data that
belongs to this page, even if this data does not consist of complete
objects. Indeed, we copy the end of the object that starts on the
previous page and the beginning of an object that ends on the next
page. To save physical pages (and involved management), we chose
not to copy the other parts of these objects, and thus, we do not
5 The

Procedure Concurrent-Compact
begin
1.
Execute the following incrementally and/or concurrently
2.
Calculate-Offset-Vector
3.
protect to-virtual-space pages
4.
stop-Mutators
5.
protect require-update pages
6.
for every root r do
7.
r := Get-New-Address(r)
8.
Resume-Mutators
9.
// Now traps occur on accessed protected pages.
10.
If an idle processor is available, execute concurrently:
11.
P := Get-Next-Unhandeled-Page
12.
while P != NULL do
13.
Handle-Move-To-Page(P)
14.
P := Get-Next-Unhandeled-Page
end

size of a status-table entry is the minimum entity on which we can
invoke a cmp-and-swap synchronization operation.

need to map the adjacent pages until they need to be copied in their
entirety.
The code of the trap routine appears in Figure 5. The cmp-andswap routine atomically compares the contents of a memory location (the first parameter) to a given value (the second parameter)
and, if they are the same, modifies the contents of this memory
location to a given new value (the third parameter). It returns the
value that existed in the memory location before the operation. In
our case, if the returned value is UNHANDLED, then the trap becomes the mover of this page. Otherwise, it executes wait until the
other thread handling this page marks it HANDLED.
Figure 6 illustrates the course of a single trap (for simplicity,
the first one.) In the example, ten live objects reside in the fromvirtual-space marked 0 to 9. Objects 0 to 3 should move to the first
to-virtual-space page, objects 4 to 7 should move to the second
to-virtual-space page and the objects 8 and 9 should move to the
third to-virtual-space page. The state of the heap before the trap
is depicted in the upper part (State A). Since only three to-virtualspace pages will be used, only these three pages are protected. At
Stage A no object has been moved yet, but the root has already
been fixed and is pointing to the new location of Object 6. When
the application tries to touch Object 6, a trap is triggered, invoking
the trap routine that moves and fixes the references of all the objects
that move to the second to-virtual-space page, i.e., objects 4 to 7.
Object 4 has a reference to Object 2 and it now points to the new
location of Object 2. After moving and fixing the objects (State
B), the trap routine unprotects the second to-virtual-space page and
unmaps the appropriate pages in from-virtual-space (State C). Note
that at the end of the trap Object 8 (in from-virtual-space) is still
pointing to the old address of Object 7 despite the fact that Object 7
already moved. This pointer will be updated when Object 8 moves.
The overall operation of the concurrent compaction appears in
Figure 7. First, the auxiliary vectors are computed and pages are
protected incrementally. Each thread contributes a bit to this computation when it allocates. Furthermore, if an idle processor exists,
a concurrent low-priority thread is spawned and it helps in finishing this computation more rapidly. Next, while the program threads
are halted, the require-update pages are protected and the roots get
updated. Then, program threads are resumed and the moving of objects and updating of pointers are carried out incrementally by traps
that occur on protected pages. Again, if an idle processor is available, it is used to aid the program threads and finish the compaction
more rapidly.

3.4 Special treatment of dense blocks
In some (rather frequent) cases, the objects inside a block are
already dense. This usually happens to blocks of older objects
that were compacted earlier and remained reachable or blocks that
contain only one object. Recall that a typical block size is 512 bytes.
For objects in such blocks, the computation of the new address is
simpler. It is enough to add a single number ∆ to the address of
each of these objects to obtain its new location. When such a case
is identified (during the preparation of the offset vector), the value
of ∆ is put in the offset vector for this block. To identify this special
case, the least significant bit of the stored value is set. This method
turned out to be highly effective in improving the efficiency of the
Compressor. A naive version of this method was proposed in [1],
where they either treated all blocks as condensed (and obtained
a compacted heap that is not fully compacted) or they did not
use this method at all. Another collector that made use of dense
areas is the MC2 collector [24]. This collector divides the heap into
windows and avoids copying objects in windows with high density.
The windows employed by MC2 are typically much larger than
the Compressor’s blocks. Thus, while the MC2 collector looks for
mostly-dense windows, the Compressor may expect to find a large
number of perfectly dense blocks.
3.5 More Improvements
Some improvements were added to the basic algorithm to enhance
its performance. We list these improvements here.
Moving more than one page When a trap occurs, it imposes
some overhead. Thus, it makes sense to move more than one page
in each trap. We typically moved eight pages per trap. This optimization also reduces the space overhead of the status-table and
the first-object table.
Double mapping to-virtual-space in the beginning The trap routine needs to touch a protected page without lifting the protection.
Thus, a second virtual mapping of the heap is required, which is
not protected and is available for use by the trap routine. To reduce
the trap time, we DoubleMap the entire to-virtual-space to a third
virtual space when compaction begins. Of course, this mapping can
be run concurrently with the program run.

4.

An Implementation for Java

We implemented our algorithm on a Jikes RVM [2], a Java virtual
machine, using the Fast-Adaptive compiler of version 2.3.4 upon
Linux Red-Hat 7.2. The entire system, including the collector itself,
is written in Java (extended with unsafe primitives available only to
the Java Virtual Machine implementation to access raw memory).
To mark the live objects before the compaction, we used the
Jikes provided mark-and-sweep algorithm with some modifications. First, we modified the segregated-fit allocation scheme to use
allocation-caches instead. This was done in order to be able to slide
the objects down the heap without worrying about their sizes. Second, we modified the collector to switch the sweep phase with our
compaction algorithm when compaction is triggered. Note that the
Jikes mark-and-sweep collector is a parallel stop-the-world collector, but a concurrent collector (which does not currently exist in
Jikes) could be equally used.
4.1 Space overhead
In our implementation we chose a block size of 512 bytes, and
moved eight pages per trap. With this choice, the space overhead
for the offset vector consists of a single word for each 512 heap
bytes, i.e. 1/128. The first-object and the status-table table require
a word per page, contributing an overhead of 2 ∗ 1/1024 of the
heap, which is negligible.

The markbit vector is a data structure that is attributed to the
garbage collector, but requires another 1/32 of the heap. If an 8byte alignment is employed by the JVM (such as the IBM JVM),
then half the overhead is required for the markbit vector.
More space overhead that should be considered is the physical
space required during the alternation of mapping to-virtual-space
pages and unmapping from-virtual-space pages during the compaction. Note that if the from-virtual-space objects that are moved
into the to-virtual-space are spread among only two pages and these
pages contains more live objects that have not yet been copied, then
no pages will be unmapped. Such occurrences add to the space
overhead. To avoid this violation, we monitor these occurrences,
and when the budget of additional mapped pages surpasses a constant threshold, we let the trap routine move some more pages to
ensure full evacuation of from-virtual-space pages. Such an occasion has never appeared in practice.
A limitation of our compactor on a 32-bit architecture is that the
size of the virtual memory must be large enough to contain three
virtual heap spaces. These include from-virtual-space to-virtualspace and the additional view that allow modifying a protected tovirtual-space location. This limitation may create some problem for
large heaps on a 32-bit machine, but the problem disappears with
the modern 64-bit architectures.

5. Measurements
The large server benchmark that we used was the SPECjbb2000
benchmark [11]. This is probably the more interesting benchmark
for the Compressor, employing several program threads and larger
heaps. For clients benchmarks we used the SPECjvm98 benchmark suite [11] and the Dacapo benchmark suite [25] version
beta051009.
The platform used to run the multithreaded SPECjbb2000 was
a 2-way HP workstation xw8000 with a 2.4Hz Intel Pentium III
Xeon processor and 2GB of physical memory, running RedHat
Linux version 2.4.20-31.9smp. The SPECjvm98 benchmarks and
the Dacapo benchmarks were run on a 2.8 GHz Pentium 4 Intel
uniprocessor with 512M RAM, running RedHat Linux version
2.6.5-1.
To justify the use of compaction, we chose relatively small heap
sizes (for the Jikes RVM), yet, not tiny ones. For SPECjbb2000 we
used a heap size of 256MB, for the Dacapo and for the SPECjvm98
we used various heap sizes: 24M for fop and ps; 32M for antlr,
bloat, jython and pmd; 48M for hsqldb; 72M for xalan; 16MB for
jess, jack, and db; 18MB for mtrt; and 20MB for javac. The Jikes
RVM requires larger heaps than other JVMs since it uses the same
heap for the JVM runtime data structures, the Compressor data
structures (including the tables) and the application. Each reported
measurement is the average of five runs.
As noted earlier, in our measurements the size of a block is 512
bytes and the number of pages that are moved on a trap is eight.
The compaction was invoked every 10 collections in SPECjbb2000
and every 5 collections on the clients benchmarks. We specifically
mention when we deviated from these parameters, usually for measurement used to tune these parameters. The two versions of the
Compressor, the parallel stop-the-world and the concurrent, are denoted STW and CON, correspondingly.
Unfortunately, there is no previous compaction algorithm provided with Jikes to which we can compare ours. Comparing it to a
standard garbage collector does not seem fair, because compaction
is notorious for being much slower than a collection, sometimes by
a factor of 10; yet no reports on this factor appear in the literature.
To check the efficiency of the Compressor, we compared it against
two other collectors that are implemented in the Jikes RVM. The
first was the Mark-and-Sweep, denoted MS. The second was the
generational Appel collector denoted GenMS. We also tried a copy-

CON
STW
GenMS
MS

17000

)s 15000
po
(t
up 13000
hg
uo
rh 11000
T

80

s 70
/m
bK60
sn 50
oit 40
cao 30
lA20
10

9000
7000

2WH
4WH
6WH
8WH

90

0
-200

1

2

3

4

5

6

Number of WareHouses

7

8

Figure 8. Throughput of Specjbb2000.

jbb 2-WH
jbb 4-WH
jbb 6-WH
jbb 8-WH

STW

MS

319.55
516.89
641.53
770.14

229.37
287.32
315.71
372.46

GemMS
(nursery only)
279.73(29.16)
323.64(36.72)
347.42(31.17)
374.41(22.90)

Table 1. The pause time of the stop-the-world algorithms (ms).
ing collector, but it failed to run on most benchmarks with the tight
heap used. The implementations of these collectors are described
on [9].
5.1 Server performance
Figure 8 depicts the results running the SPECjbb2000 benchmark
over one to eight warehouses and using the four collectors. In
this benchmark, the applications starts with one warehouse and increases the number of warehouses to eight, where each warehouse
corresponds to an application thread. For each number of warehouses the throughput is checked in operations per second (ops).
Higher ops means higher throughput. The better throughput stems
from the low efficiency overhead together with the advantage of
running with a compacted heap.
Interestingly, the concurrent Compressor provides almost equal,
and sometimes even better performance than the stop-the-world
parallel Compressor. This can be explained by the cache consciousness of the concurrent Compressor. In particular, objects are moved
to their new location in the to-virtual-space only when the application accesses them.
In addition to checking the impact of the Compressor on the
overall throughput, we also measured the pauses that it imposes.
For the parallel Compressor, we measured the time it took to run
the compaction while the program was halted. A similar measure
was run on the other garbage collectors. The results are presented
in Table 1. Traditionally, compaction has been considered a hazard
to pause time, as its execution used to take much more time than
the execution of a typical (full heap) collection (sometimes by a
factor of 10). We can see that the Compressor still takes more time
than a typical collection, but its running time is not that far from
the collectors measured, and is usually less than a factor of 2. Of
course, platforms with more parallel processors will benefit more
from its parallelism.
Measuring the behavior of the concurrent Compressor is more
problematic. The program threads are interrupted by traps that they
execute. We would like to evaluate how much of the CPU time is
really used to serve the application and separate it from the time
spent on executing the traps. Since the traps are frequent and short,
and since the traps start and end in the operating system code, this

0

200

400

600

800 1000 1200 1400

Time From resuming the application(ms)

Figure 9. SPECjbb2000: the allocation rate of the program as a
function of time.
3

5

10

15

20

30

jbb2wh

0.987

0.994

1.000

1.006

1.006

0.995

jbb 4wh

0.983

0.980

1.000

1.006

1.010

1.008

jbb 6wh

0.958

0.985

1.000

1.013

1.019

1.014

jbb 8wh

0.949

0.979

1.000

1.012

1.022

1.024

jess

1.126

1.075

1.000

0.953

0.904

0.830

db

1.015

1.005

1.000

1.006

1.001

0.996

javac

1.198

1.100

1.000

0.973

0.990

0.988

mtrt

1.005

1.016

1.000

0.977

0.991

0.961

jack

1.065

1.023

1.000

0.992

0.957

0.953

antlr

1.003

0.991

1.000

0.996

0.998

0.991

bloat

1.090

1.084

1.000

0.936

0.873

fop

1.300

1.198

1.000

1.004

1.031

1.006

hsqldb

1.279

1.111

1.000

1.007

1.030

0.995

jython

0.968

0.983

1.000

0.983

0.999

0.998

pmd

1.025

1.010

1.000

0.966

0.971

0.999

ps

0.995

0.990

1.000

0.999

0.997

0.988

xalan

0.888

0.948

1.000

1.012

1.024

1.030

n/a

Table 2. The performance speedup as a function of the number of
collections between compactions.
cannot be easily measured. We chose to measure the application
use of the processors by checking the allocation rate as a function
of time. Although the rate of allocation is not perfectly stable, it
gives a pretty good approximation of the amount of work done by
the program. The results are depicted in Figure 9. The X-axis in this
figure represents the time (in ms) from the point that the mutators
incrementally compute the offset and first-object vectors. At time
0, the program threads are stopped to fix the roots and protect the
require-update pages. The program threads resume thereafter, but
due to the traps, they do not show full activity until a bit later.
The results in this graph should be compared to Table 1.
Namely, we would like to compare the pause that is imposed by a
stop-the-world Compressor to the behavior shown in Figure 9. For
two warehouses, a pause of 319ms is imposed by a stop-the-world
Compressor. Yet the concurrent Compressor allows a noticeable
run of the program threads after approximately 200ms. For four
warehouses a stop-the-world compaction would take 520ms and
the program would start executing after around half that period
with the concurrent Compressor.
5.2 Collector characteristics
Tuning compaction triggering Table 2 compares the performance of the benchmarks with various triggering frequencies of
the Compressor. The best result of each benchmark is highlighted.
In all the benchmarks, excluding SPECjbb2000 and xalan, frequent
triggering improves the overall performance of the benchmarks.
Tuning the number of pages that are moved in each trap Since
the execution of a trap carries an overhead, the Compressor moves
more than one page in each trap. Moving a small number of pages

CON
STW
MS
GenMS

90

s
m
/b
K
sn
oit
ac
lo
A

2p
8p
128 p

80
70
60
50
40
30
20
10

20
c)e
s(
e 15
m
iT
la 10
to
T 5
0

0
-200

25

0

200

400

600

800

1000

1200

Time From resuming the application(ms)

Figure 10. The allocation rate as a function of time, using a different number of pages per trap, in Specjbb2000 with four warehouses.

d
b

je
s
s

1400

ja
v
a
c

ja
c
k

m
t
r
t

Figure 12. Specjvm98 benchmarks’ overall running times with the
various collectors.
25

125

20

100

CON
STW
MS
GenMS

90

s
m
/b
K
sn
oit
ac
lo
A

2p
8p
128 p

80
70
60
50
40
30

75
50

5

20
10

0

0
-200

c)e
(s 15
e
m
Til
at 10
To

0

200

400

600

800

1000

1200

1400

Time From resuming the application(ms)

Figure 11. The allocation rate as a function of time, using a different number of pages per trap, in Specjbb2000 with eight warehouses.
during a trap increases the total number of traps but reduces the
pause time of each trap. Such a reduction may allow more mutator activity during the concurrent compaction, but the increase in
the number of traps will extend the time it takes to finish the compaction and thus, postpone the time where the application may regain full CPU strength. An attempted tuning of this parameter did
not noticeably affect the throughput of the program but it did affect
the concurrency. In Figures 10 and 11 we can see that when only
two pages are moved on a trap, the program receives some CPU
share almost right after it resumes, but more time is required until
it reaches full activity.
Offset calculation overhead We measured the fraction of time
spent on calculating the offset vector and the first-object vector
in the beginning of the collection. It turned out that this part of
the compaction took around 5-7% on the SPECjvm98 benchmarks
and 11-13% on the SPECjbb2000 benchmark. Since this part of the
Compressor execution is short, we did not bother parallelizing its
operation and it is run on a single thread in our implementation.
5.3 Client performance
In Figures 12 and 13 the overall running times of the SPECjvm98
and Dacapo benchmarks with various collectors is presented. For
these small benchmarks, the Appel generational collector beats the
performance of all non-generational collectors. It actually manages
to almost refrain completely from running full collections. This
phenomenon does not occur at all with the larger SPECjbb2000
benchmark. Concentrating on the non-generational collectors, we
first note the similarly to the measurements of the Specjbb2000
benchmarks, the concurrent Compressor yields almost equal, and
sometimes even better performance than the stop-the-world parallel

25
a
n
tlr

b
lo
a
t

fo
p

h
sq
ld
b

jy
th
o
n

p
m
d

p
s

0

xa
la
n

Figure 13. Dacapo benchmarks’ overall running times with the
various collectors.
Compressor. Second, the Compressor obtains better performance
than the MS collector.
The pause times of the Compressor on these benchmarks is depicted again by checking the allocation rate over time in Figures
14, 15 and 16. These graphs should be compared to the stop-theworld pause times presented in Table 3. We can see that though
the pause time of the stop-the-word Compressor is relatively high
(22-83 ms), with the concurrent Compressor the application does
meaningful allocations after a much smaller period of time (typically 5-10 ms).

6. Related work
Compaction algorithms have been known since the 1970s. Older
compactors used the simple two-finger technique or the LISP2
algorithm [15]. A more elaborate (and elegant) solution requiring no extra auxiliary data structures is the threaded algorithm of
Jonkers [16] and Morris [19]. It was shown in [1] that modern compactors demonstrate a substantial increase in efficiency over these
collectors.
Two parallel compactors were presented in [12] and [1]. Flood
et al. [12] offered the first parallel compactor. However, their algorithm required three passes over the heap and did not move all
objects to a single compacted area. Instead, the heap was split into
N areas (where N is the number of processors) and N threads
were used to compact the heap into N/2 chunks of live objects. An
improved algorithm was proposed by Abuaiad et al. [1]. Their algorithm required only two heap passes and offered an almost perfect
compaction, in the sense that the resulting heap (after compaction)
was almost fully compacted and the order of objects was mostly
preserved. The parallel version of the Compressor does better than
the above compactors, requiring only a single heap pass, obtaining
equally balanced parallelism, and achieving a perfect compaction:

45

jess
jack

40

s 35
m
b/k 30
tea 25
r
sn 20
iot
ac 15
ol 10
A

db
mtrt
javac

5
0
-10

0

10

20

30

40

50

Time from resuming the application (ms)

Figure 14. SPECjvm98: The allocation rate of the benchmarks as
a function of time.
antlr
bloat
fop
hsqldb

40
35

et 30
ra 25
no
tia 20
co 15
lA
10
5
0
-20

-10

0

10

20

30

40

50

60

70

Time from resuming the application

Figure 15. Dacapo: The allocation rate of the benchmarks as a
function of time.
jython

30

pmd
25

ps

et
ra
no
it
ac
ol
A

xalan

20

15

10

5

0
-20

-10

0

10

20

30

40

50

60

70

Time from resuming the application

Figure 16. Dacapo: The allocation rate of the benchmarks as a
function of time.
STW
MS
GemMS
(nursery only)
jess
30.26
120.20
n/a(5.15)
jack
83.22
115.62
191.79(8.18)
db
67.77
141.51
149.47(5.34)
mtrt
61.31
160.13
156.24(4.49)
javac
67.23
179.59
178.23(8.25)
antlr
29.86
132.26
136.11(4.32)
bloat
75.68
158.48
153.34(6.43)
fop
54.69
194.01
207.32(10.97)
hsqldb
77.61
151.73
223.31(8.95)
jython
41.13
145.82
206.19(9.78)
pmd
47.28
163.48
169.47(5.34)
ps
22.94
123.16
n/a(2.72)
xalan
68.32
177.01
231.15(7.35)
Table 3. The pause time of the stop-the-world algorithms on
jvm98 and Dacapo (ms).

the order of objects is perfectly preserved and the resulting heap is
perfectly compacted. Thus, the Compressor, as far as we know, is
the best parallel compactor existing today.
Incremental compaction was suggested by Lang and Dupont
[17] and a modern variant was presented by Ben Yitzhak et al. [8].
The idea was to split the heap into regions, and compact one
region at a time by evacuating all objects from the selected region.
Extending these works to compact the full heap does not yield
the efficient parallel compaction algorithm we need. Extending
the first algorithm yields a standard copying collector (that keeps
half the heap empty for collection use). Extending the latter is
also problematic, since it creates a list of all pointers pointing into
the evacuated area; this is not appropriate for the full heap. Also,
objects cannot be moved into the evacuated area, since forwarding
pointers are kept in place of evacuated objects. Ossia et al. [22]
attempted to reduce the compaction pause times by running the
pointer updates phase concurrently using virtual memory traps such
as ours. However, moving the objects was executed in a stop-theworld manner. They proposed to reduce the pause times further by
giving up full compaction and moving only a fraction of the heap
objects. The Compressor is more efficient than their compactor as
it requires only a single heap pass. Furthermore, the concurrent
version of the Compressor runs both the move of the objects and the
pointer updates concurrently with the program threads, achieving
perfect compaction with shorter pause times.
The Metronome [6] and the bookmarking collector [13] use
segregated free lists6 allocation to achieve compaction with a single
objects traversal. However, unlike the compressor, the obtained
compaction in these works does not preserve the objects’ order and
does not compact the objects to a single area in the heap.
Some compaction algorithms (e.g., [18, 6]) use handles to provide an extra level of indirection when accessing objects. Since
only the handles need to be updated when an object moves, the
fix-up phase, in which pointers are updated to point to the new location, is not required. Nevertheless, a design employing handles is
not appropriate for a high performance language runtime system.
Copying collectors obtain compaction for free. However, they
differ from compactors because they utilize only half of the heap’s
space; they move the objects in each collection; and they do not
preserve the objects’ order. Compaction uses small auxiliary data
structures; it may be invoked when necessary; and it preserves
the allocation order of objects. The Mark-Copy collector, and its
sequel, the M C 2 collector [23, 24] are copying collectors which
minimize the additional space required for copying by running
a marking phase before the copying begins, and executing the
coping incrementally. M C 2 divides the heap into windows, and
builds remembered sets for each window during the mark phase.
These remembered sets are used to copy each window separately,
while each window is copied to the space that the former window
evacuated. Using this technic, the additional space required is at
most one window size, and the application can resume between
the copying one window and the next. Unlike the Compressor, the
M C 2 does not preserve the order of objects, its copying phase
is not concurrent and it has to scan the roots for each window
copying. The Mark-Copy algorithm uses memory services to save
memory use in a similar way to the Compressor, but this technique
is not used in the sequel M C 2 . Two more notable incremental and
concurrent copying collectors are the Baker algorithm [7] and the
Sapphire [14].

6 with

segregated free lists, memory is divided into fixed-sized pages, and
each page is divided into blocks of a particular size. Objects are allocated
from the smallest size class that can contain the objects.

7.

Conclusions

Mark-and-Sweep garbage collectors suffer from fragmentation,
which is handled (infrequently) by compaction algorithms. Compaction executions are notoriously long and impose a high overhead
on execution times and extended pauses. Reducing compaction
time and its obtrusiveness is an important goal for memory managers today, especially on modern platforms.
In this paper we presented the Compressor: a novel compacting algorithm that requires only a single heap pass. The Compressor is more efficient than previously reported compactors. First,
we presented a parallel version of the Compressor that runs efficiently on an SMP utilizing all processors while program threads
are halted. Second, we presented a concurrent version of the Compressor that runs incrementally and concurrently, with the program
threads achieving high efficiency and shorter pauses. The concurrent collector has high cache consciousness as it moves pages when
they are touched by the program threads. Because of this nice cache
behavior, its efficiency is not much below, and sometimes may even
be higher, than the efficiency of the parallel version, whereas its
pauses are much shorter.
The Compressor was implemented on the Jikes Research JVM
and measurements demonstrating its efficiency and non intrusiveness were presented.

Acknowledgments
We thank Yoav Ossia, Avi Mendelson, and Harel Paz for helpful
discussions. We thank Steve Blackburn and Daniel Frampton for
the effort they invested in making the Jikes RVM supportive of
compaction. Without their initial work, our implementation work
would have been much harder.

References

[9] Stephen M. Blackburn, Perry Cheng, and Kathryn S. McKinley.
Oil and water? high performance garbage collection in Java with
MMTk. In ICSE 2004, 26th International Conference on Software
Engineering, Edinburgh, May 2004.
[10] Hans-Juergen Boehm, Alan J. Demers, and Scott Shenker. Mostly
parallel garbage collection. ACM SIGPLAN Notices, 26(6):157–164,
1991.
[11] Spec: The Standard Performance Evaluation Corporation.
http://www.spec.org/.
[12] Christine Flood, Dave Detlefs, Nir Shavit, and Catherine Zhang.
Parallel garbage collection for shared memory multiprocessors. In
Usenix Java Virtual Machine Research and Technology Symposium
(JVM ’01), Monterey, CA, April 2001.
[13] Matthew Hertz, Yi Feng, and Emery D. Berger. Garbage collection
without paging. In Proceedings of SIGPLAN 2005 Conference
on Programming Languages Design and Implementation, ACM
SIGPLAN Notices, Chicago, IL, June 2005. ACM Press.
[14] Richard L. Hudson and J. Eliot B. Moss. Sapphire: Copying GC
without stopping the world. In Joint ACM Java Grande — ISCOPE
2001 Conference, Stanford University, CA, June 2001.
[15] Richard E. Jones. Garbage Collection: Algorithms for Automatic
Dynamic Memory Management. Wiley, Chichester, July 1996. With
a chapter on Distributed Garbage Collection by R. Lins.
[16] H. B. M. Jonkers. A fast garbage compaction algorithm. Information
Processing Letters, 9(1):25–30, July 1979.
[17] Bernard Lang and Francis Dupont. Incremental incrementally
compacting garbage collection. In SIGPLAN’87 Symposium on
Interpreters and Interpretive Techniques, volume 22(7) of ACM
SIGPLAN Notices, pages 253–263. ACM Press, 1987.
[18] Martin Larose and Marc Feeley. A compacting incremental collector
and its performance in a production quality compiler. In Richard
Jones, editor, ISMM’98 Proceedings of the First International
Symposium on Memory Management, volume 34(3) of ACM
SIGPLAN Notices, pages 1–9, Vancouver, October 1998. ACM Press.

[1] Diab Abuaiadh, Yoav Ossia, Erez Petrank, and Uri Silbershtein. An
efficient parallel heap compaction algorithm. In OOPSLA [21].

[19] F. Lockwood Morris. A time- and space-efficient garbage compaction
algorithm. Communications of the ACM, 21(8):662–5, 1978.

[2] Bowen Alpern, C. R. Attanasio, Anthony Cocchi, Derek Lieber,
Stephen Smith, Ton Ngo, John J. Barton, Susan Flynn Hummel,
Janice C. Sheperd, and Mark Mergen. Implementing Jalapeño in
Java. In OOPSLA’99 ACM Conference on Object-Oriented Systems,
Languages and Applications, volume 34(10) of ACM SIGPLAN
Notices, pages 314–324, Denver, CO, October 1999. ACM Press.

[20] OOPSLA’03 ACM Conference on Object-Oriented Systems, Languages and Applications, ACM SIGPLAN Notices, Anaheim, CA,
November 2003. ACM Press.

[3] Andrew W. Appel, John R. Ellis, and Kai Li. Real-time concurrent
collection on stock multiprocessors. ACM SIGPLAN Notices,
23(7):11–20, 1988.

[22] Yoav Ossia, Ori Ben-Yitzhak, and Marc Segal. Mostly concurrent
compaction for mark-sweep GC. In Amer Diwan, editor, ISMM’04
Proceedings of the Third International Symposium on Memory
Management, ACM SIGPLAN Notices, Vancouver, October 2004.
ACM Press.

[4] Andrew W. Appel and Kai Li. Virtual memory primitives for user
programs. ACM SIGPLAN Notices, 26(4):96–107, 1991. Also
in SIGARCH Computer Architecture News 19 (2) and SIGOPS
Operating Systems Review 25.

[21] OOPSLA’04 ACM Conference on Object-Oriented Systems, Languages and Applications, ACM SIGPLAN Notices, Vancouver, October 2004. ACM Press.

[23] Narendran Sachindran and Eliot Moss. MarkCopy: Fast copying GC
with less space overhead. In OOPSLA [20].

[5] Hezi Azatchi, Yossi Levanoni, Harel Paz, and Erez Petrank. An onthe-fly mark and sweep garbage collector based on sliding view. In
OOPSLA [20].

[24] Narendran Sachindran, J. Eliot B. Moss, and Emery D. Berger.
MC2 : High-performance garbage collection for memory-constrained
environments. In OOPSLA [21].

[6] David F. Bacon, Perry Cheng, and V.T. Rajan. Controlling fragmentation and space consumption in the Metronome, a real-time garbage
collector for Java. In ACM SIGPLAN 2003 Conference on Languages, Compilers, and Tools for Embedded Systems (LCTES’2003),
San Diego, CA, June 2003. ACM Press.

[25] Dacapo Project: The DaCapo Benchmark Suite. http://wwwali.cs.umass.edu/dacapo/index.html.

[7] Henry G. Baker. List processing in real-time on a serial computer.
Communications of the ACM, 21(4):280–94, 1978. Also AI
Laboratory Working Paper 139, 1977.
[8] Ori Ben-Yitzhak, Irit Goft, Elliot Kolodner, Kean Kuiper, and Victor
Leikehman. An algorithm for parallel incremental compaction.
In David Detlefs, editor, ISMM’02 Proceedings of the Third
International Symposium on Memory Management, ACM SIGPLAN
Notices, pages 100–105, Berlin, June 2002. ACM Press.