The secret lives of

Garbage Collectors

Jonathan Worthington

Things I work on

Writing and teaching
courses, mostly about
software architecture,
TDD and C#

Lead developer and
architect of the Rakudo
Perl 6 compiler; focus
on OO, type system, etc.

Various bits of
mentoring and
consulting

Various other
contributions (native
calling, debugger, ...)

So, this is Build Stuff and...
...if I'm going to talk about GCs here, I better have
been building one, right?

So, this is Build Stuff and...
...if I'm going to talk about GCs here, I better have
been building one, right?

I have been during the last year 
For a small VM centred around meta-object
programming, as part of my Perl 6 project work
Not just me; ~15 contributors so far. I'm doing both
architectural and implementation work, with a focus
on the object system and GC

Suddenly, I'm a GC designer/hacker!

Suddenly, I'm a GC designer/hacker!
Already had a reasonable grasp on how GCs work

Had debugged one before, and was quite used to
explaining the basics of the .Net one when teaching
But explaining and doing bug fixes are rather
different from doing design
Doing design well means understanding lots of
options and being able to make sensible trade-offs

Isn't GC a, like, really specialist area?

Isn't GC a, like, really specialist area?
What do you
specialize in?

Isn't GC a, like, really specialist area?
What do you
specialize in?

Being a
generalist.

Isn't GC a, like, really specialist area?
What do you
specialize in?

Being a
generalist.

GC is a very well researched area. Loads of welldocumented algorithms and many decades of
experiences to learn from.
I didn't need to invent, "just" select and implement

The bad news
When I design systems, I like to collect concerns into
strongly focused, loosely coupled units

Garbage Collection is a real challenge here, because
it is interested in memory allocation and even
memory accesses - which happens everywhere!
Additionally, while many of the algorithms are quite
pretty on paper, real world implementation is full of
subtleties (threads block, CPU caches can be weird,
optimizers and CPUs re-order things...)

I picked a decent time to work on this...
2012 gave us a brand new edition of the leading
handbook on Garbage Collection!

A bit over 500 pages, with
loads of references
That sounds a lot at first, but
it's still 400 pages shorter than
the second edition of
"Programming Entity
Framework"...

So, the basics...
As we execute code, we allocate objects
sub word_histogram($text) {
my @words = $text.lc.comb(/\w+/);
my %histogram;
for @words -> $w { %histogram{$w}++ }
return %histogram;
}
my %hist = word_histogram('Badger badger
badger mushroom mushroom');
my @top_5;
# ...

So, the basics...
As we execute code, we allocate objects
sub word_histogram($text) {
my @words = $text.lc.comb(/\w+/);
my %histogram;
for @words -> $w { %histogram{$w}++ }
return %histogram;
}
my %hist = word_histogram('Badger badger
badger mushroom mushroom');
my @top_5;
# ...

So, the basics...
As we execute code, we allocate objects
sub word_histogram($text) {
my @words = $text.lc.comb(/\w+/);
my %histogram;
for @words -> $w { %histogram{$w}++ }
return %histogram;
}
my %hist = word_histogram('Badger badger
badger mushroom mushroom');
my @top_5;
# ...

@words

So, the basics...
As we execute code, we allocate objects
sub word_histogram($text) {
my @words = $text.lc.comb(/\w+/);
my %histogram;
for @words -> $w { %histogram{$w}++ }
return %histogram;
}
my %hist = word_histogram('Badger badger
badger mushroom mushroom');
my @top_5;
# ...

@words

%histogram

So, the basics...
As we execute code, we allocate objects
sub word_histogram($text) {
my @words = $text.lc.comb(/\w+/);
my %histogram;
for @words -> $w { %histogram{$w}++ }
return %histogram;
}
my %hist = word_histogram('Badger badger
badger mushroom mushroom');
my @top_5;
# ...

@words

%histogram

So, the basics...
When we return, some things go out of scope…
sub word_histogram($text) {
my @words = $text.lc.comb(/\w+/);
my %histogram;
for @words -> $w { %histogram{$w}++ }
return %histogram;
}
my %hist = word_histogram('Badger badger
badger mushroom mushroom');
my @top_5;
# ...

@words

%histogram

So, the basics...
When we return, some things go out of scope…
sub word_histogram($text) {
my @words = $text.lc.comb(/\w+/);
my %histogram;
for @words -> $w { %histogram{$w}++ }
return %histogram;
}
my %hist = word_histogram('Badger badger
badger mushroom mushroom');
my @top_5;
# ...

@words

%histogram

So, the basics...
At some point, we can allocate no more
sub word_histogram($text) {
my @words = $text.lc.comb(/\w+/);
my %histogram;
for @words -> $w { %histogram{$w}++ }
return %histogram;
}
my %hist = word_histogram('Badger badger
badger mushroom mushroom');
my @top_5;
# ...

Oh noes, out
of memory!

@words

%histogram

Reachability analysis
The vast majority of automatic memory
management schemes are based on reachability
Thingy

Call
Stack

@words

%histogram

Reachability analysis
Reachability analysis starts out from a set of roots
(things referenced from local variables, statics, etc.)
Thingy

Call
Stack

@words

%histogram

Reachability analysis
It then looks at what objects the roots reference,
and then their references, and so forth…
Thingy

Call
Stack

@words

%histogram

Reachability analysis
Anything that we never discover is unreachable,
meaning the program can never use it again
Thingy

Call
Stack

@words

%histogram

Reachability analysis
The memory associated with these objects can
therefore be released

Call
Stack
%histogram

And, that's basically it
So, now you understand what a GC does. Beer time!

And, that's basically it
So, now you understand what a GC does. Beer time!

Well, actually…

There's some not-so-basics too…
How do we find a piece of memory to allocate?

What are the set of roots we start the reachability
analysis from, and how do we find them?
How do we find the references held in an object?
How do we keep track of where all the pieces of
memory are, so we can redeem the memory we
discover is no longer in use?

Mark and sweep
Let's start out simple

Mark and sweep
Let's start out simple

The simplest way to handle allocation is to not
handle it at all, but instead delegate to malloc
Our allocator keeps an array of pointers to all the
pieces of memory we obtained from malloc

Mark and sweep
Each object in memory should point to some kind of
type table, saying what type of object it is and which
of its fields are references to other objects

Furthermore, each object needs storage for a "mark
bit", to be used in reachability analysis
Type Table Pointer
Mark Bit
Field 1
Field 2

Mark and sweep
Marking is done by reachability analysis

Whenever we reach an object, if its mark bit is not
set, we set it, then also mark its references
Don't re-process already marked objects, otherwise
we'd never terminate on cyclic data structures

Mark and sweep
The sweep phase moves through the objects array,
redeeming memory and clearing mark bits

Mark and sweep
The sweep phase moves through the objects array,
redeeming memory and clearing mark bits

If the mark bit was not set, the memory is freed - in
our very simple collector just by calling free

Mark and sweep
The sweep phase moves through the objects array,
redeeming memory and clearing mark bits

If the mark bit was not set, the memory is freed - in
our very simple collector just by calling free

Mark and sweep
The sweep phase moves through the objects array,
redeeming memory and clearing mark bits

If the mark bit was not set, the memory is freed - in
our very simple collector just by calling free

Mark and sweep
The sweep phase moves through the objects array,
redeeming memory and clearing mark bits

If the mark bit was not set, the memory is freed - in
our very simple collector just by calling free

Mark and sweep
The sweep phase moves through the objects array,
redeeming memory and clearing mark bits

If the mark bit is set, we clear it, and then copy the
pointer to the first free slot to the left, so future
allocations will be easy

Mark and sweep
The sweep phase moves through the objects array,
redeeming memory and clearing mark bits

If the mark bit is set, we clear it, and then copy the
pointer to the first free slot to the left, so future
allocations will be easy

Mark and sweep
The sweep phase moves through the objects array,
redeeming memory and clearing mark bits

If the mark bit is set, we clear it, and then copy the
pointer to the first free slot to the left, so future
allocations will be easy

Mark and sweep
The sweep phase moves through the objects array,
redeeming memory and clearing mark bits

By the end, we've redeemed the memory of the
unreachable, cleared all the mark bits, and can go
back to running code, allocating memory, etc.

Mark and sweep
As GCs go, this is pretty easy to implement

Unfortunately, it's going to be rather slow as soon as
we have any non-trivial number of objects, since…
malloc itself is rather slow

We have to consider every allocated object
We have to touch every object twice (bad for cache)

Finding the roots
Things in static variables are not so bad to track
down, but local variables are another matter

These may live on the system stack if you are doing
some kind of JIT compilation or a recursive
interpreter
Even if they don't, and your runtime is allocating its
own stack frames, then you may still have object
references in your runtime implementation code which, if you're in C, are on the system stack!

Conservative GC
The system stack is just an area of memory

You are allowed to access it at random
So, we can go hunting for object references on it,
using our pointer array to check if things that look
like pointers really are GC-managed pointers
We may get some false positives, but still safe
But walking the pointer list is O(n), each time… 

Precise GC
By contrast, a precise GC always knows where all of
the pointers to objects are. No guessing!

If you JIT, you need to keep stack maps
For VM implementation code, need to track each of
the local variables in scope when GC may happen
This is typically done by keeping a list of temporarily
rooted things, which are considered by the GC

Temporary rooting
In an attempt at doing this in a structured way, I
ended up defining a macro for this:
MVMROOT(tc, cu, {
MVM_bytecode_unpack(tc, cu);
});

Which is defined as:
#define MVMROOT(tc, obj, block) do {\
MVM_gc_root_temp_push(tc, (MVMCollectable **)&(obj)); \
block \
MVM_gc_root_temp_pop(tc); \
} while (0)

Taking allocation into our own hands
GC may be mostly about deallocation, but we can do
a better job of that if we handle allocation ourselves

Just use malloc to get big blocks of memory, and
allocate objects within those
Heck, we can just "bump the pointer", allocating our
way sequentially through the buffer! That'll be fast!

Ummm…not so simple!
After a GC run, we will have freed up some of the
memory - but some will be in use

Our nice memory block now resembles a tasty
morsel of Swiss cheese

So, what to do?
There are data structures that can help with finding
memory blocks of the right size

Another popular scheme is sized pools: have a block
of memory dedicated to objects that need 24 bytes,
32 bytes, 40 bytes, 48 bytes, etc. Then you just chain
a free list through the pool.
Naturally, all of this is slower than the trivial bumpthe-pointer allocation we'd like 

Aside: fun with caches
I once hunted a GC performance bug

It used conservative GC, and walked a linked list of
fixed size blocks to see if a pointer was within them
In theory, fairly cheap
In reality, fixed sized blocks were page aligned, and
some CPUs just use the least significant bits as the
key into their L1 cache  awful cache thrashing; got
a 20% win from keeping a compact lookup table

Compacting collection
A useful insight:
If we know where all the pointers to an object are,
(which precise collection gives us), then we can
move the object during a GC run!

We just need to be sure to update all the pointers
(this is why precise GC matters)
Opens the door to numerous alternative algorithms
involving compaction or copying

Compacting collection
Do bump-the-pointer allocation, until the memory
block is filled

Compacting collection
Do bump-the-pointer allocation, until the memory
block is filled

Then, do the usual reachability and marking, as seen
in the mark-and-sweep collection

Compacting collection
Next, we need to compute a new address for each of
the living objects, such that they will end up all at
the start of the block

This address mapping needs to be stored, perhaps in
some kind of hash table

Compacting collection
We then go through the living objects. For each one
we copy it to its new address, clear the mark bit, and
update any references within it

Compacting collection
We then go through the living objects. For each one
we copy it to its new address, clear the mark bit, and
update any references within it

Compacting collection
We then go through the living objects. For each one
we copy it to its new address, clear the mark bit, and
update any references within it

Compacting collection
We then go through the living objects. For each one
we copy it to its new address, clear the mark bit, and
update any references within it

Compacting collection
We then go through the living objects. For each one
we copy it to its new address, clear the mark bit, and
update any references within it

Compacting collection
We then go through the living objects. For each one
we copy it to its new address, clear the mark bit, and
update any references within it

Compacting collection
We then go through the living objects. For each one
we copy it to its new address, clear the mark bit, and
update any references within it

Compacting collection
We then go through the living objects. For each one
we copy it to its new address, clear the mark bit, and
update any references within it

Finally, we zero the rest of the area

Compacting collection: improvements
This algorithm ended up making three passes,
though there are tricks to help with that

Computing new addresses in reachability analysis is
tempting, but then you can get them in any order
and compaction becomes much harder
Can build pointers-to-update list as we mark
Then do new address computation, copying and
pointer updates in a single pass

Compacting collection: pros
Cheap bump-the-pointer allocation

Objects are bunched together post-collect (good for
cache hit rate on them)
In theory, careful algorithm choice means we can rearrange objects for cache locality by understanding
how they reference each other
In practice, fancy approaches on this don't seem
yield more benefit than the analysis they need

Compacting collection: cons
We must be precise (know all the pointers)

If we pass an object to native code, then we must pin
it (meaning we promise not to move it). This
complicates new address computation
Interior pointers are tricky to support
We must make at least two passes over an object:
one to mark it and look at its references, and
another to move it; this is not so cache friendly

Semi-space copying
What if we could do bump-the-pointer allocation
and just make one pass over the objects?

Semi-space copying
What if we could do bump-the-pointer allocation
and just make one pass over the objects?

It turns out we can - at a cost
A semi-space collector uses two equally sized
regions of memory

Semi-space copying
We use one of the regions to allocate new objects
in, and keep allocating until it is full

For this memory block, we can use the nice, cheap,
bump-the-pointer allocation


Semi-space copying
The basic idea of the algorithm is to copy each of
the reachable objects into the other memory space

This is a one-pass process. However, we need to
store the new address for each object; the easy way
is a forwarding pointer in the header
Type Table Pointer
Forwarding pointer
Field 1
Field 2

Semi-space copying
Do reachability analysis, but instead of just marking:

Calculate a new address in the second space
Copy the object to the new address
Write the address into the header

Semi-space copying
Do reachability analysis, but instead of just marking:

Calculate a new address in the second space
Copy the object to the new address
Write the address into the header

Semi-space copying
Do reachability analysis, but instead of just marking:

Calculate a new address in the second space
Copy the object to the new address
Write the address into the header

Semi-space copying
Do reachability analysis, but instead of just marking:

Calculate a new address in the second space
Copy the object to the new address
Write the address into the header

Semi-space copying
We update pointers as we go

When we first copy an object, we update the pointer
we saw to it immediately with the address that we
copied it to
If we see a pointer to an object that has a forwarder,
then we already copied it; just update the pointer
If we see a pointer into the new memory region - it's
already updated, so ignore it

Semi-space copying
Once we're done, all the reachable objects have
been copied into the second semi-space

We now continue allocating objects in there, using
bump-the pointer, until it is full. Then the roles flip.

Semi-space copying: pros and cons
Really quite easy to implement

Get cheap, bump-the-pointer, allocation
Very cache friendly, as we only visit each object
once, and we recently touched all the memory
we copied living objects into, so it should be hot
However, we have to double the memory space after usual overhead! Surely this can't be practical?

Visit ALL the heap?!

Generational collection
Most objects don't last long. They are allocated,
used for a short amount of time, and then become
unreferenced. They don't survive a single GC run.

Most objects that survive 1-2 GC runs will likely also
survive quite a few more runs.
This is the generational hypothesis. Most objects are
short lived or long lived. Additionally, long lived
objects are often mutated less, whereas short lived
ones are in active use and so are mutated lots.

Generational collection
A generational collector breaks objects up into at
least two generations (2-3 is the norm)

Objects are allocated in the young generation,
sometimes known as the nursery
If they survive a certain number of collections, they
are promoted to the old generation
The trick is that we only consider the young
generation in most garbage collection runs

Generational collection
The thing that makes this difficult is when the only
remaining reference to a young generation object is
from an old generation object
Old
Generation
Object

Young
Generation
Object

If we're ignoring old (gen-2) objects, we'll miss it!
item_gen2 = item->flags & MVM_CF_SECOND_GEN;
if (item_gen2 && collecting == MVMGCGenerations_Nursery)
continue;

Generational collection
To cope with this, we use a write barrier

Every time we write a pointer to a new object into an
old object, then we put the old object into a
remembered set, and treat it as a root
#define MVM_WB(tc, update_root, referenced) \
{ \
MVMCollectable *u = (MVMCollectable *)update_root; \
MVMCollectable *r = (MVMCollectable *)referenced; \
if (((u->flags & MVM_CF_SECOND_GEN) && r
&& !(r->flags & MVM_CF_SECOND_GEN))) \
MVM_gc_write_barrier_hit(tc, u); \
}

Generational collection
Isn't the write barrier terribly costly?!

Generational collection
Isn't the write barrier terribly costly?!

No, not really
It uses pointers we'd already have in the CPU
register and memory we'd have in cache anyway
Fits well with superscalar CPU architecture
Comes out vastly cheaper than having to consider
the entire heap every collection!

Concurrent collection
One problem with all of this is that running the GC
involves a reasonable amount of work

If you are building a graphical application or
something that needs to feel very responsive to a
user, the pauses can become as a UX issue
Therefore, a range of concurrent GC algorithms
exist, which run the GC at the same time the
program is running, typically on another thread

Concurrent collection: terrifying
We'll not cover concurrent GC algorithms in this
session, partly due to lack of time, and partly for our
collective sanity

In short, they are difficult to implement
Read barriers may be involved. That is, every time
you read a memory address, you may need to check
that the object didn't move underneath you!
Interesting, but a whole other talk

The pause/throughput trade-off
While a concurrent GC can reduce or practically
eliminate pause time, the extra bookkeeping
required to implement it comes at a cost

The .Net CLR actually comes with two collectors: a
client one and a server one
The client one is a concurrent collector. The server
one is not. Why? Because on a server you typically
care about overall throughput, not keeping
up a certain frame rate

Parallel collection
Actually, much easier

Still stop all threads to do the GC run
Just parallelize the work
Many GC algorithms parallelize quite reasonably
Good enough for now, though once we need to deal
with 16+ cores the synchronization overhead may be
a killer  may force us to concurrent anyway

So what did I choose?

So what did I choose?
MoarVM has a generational collector

The young objects are managed by a semi-space
copying collector, for fast allocation/cleanup
The old objects live in sized pools, and a free list is
chained through it
Once in old generation, objects never move
Pinning = allocate right away in the old generation

Takeaways
GCs do all kinds of things behind the scenes

You'll probably not need to implement one, but
performance programming in a language with a GC
means understanding roughly what it's doing
Also, the JVM offers a choice of collectors, and
knowing how each of them basically works may help
with choosing an appropriate one
In reality, benchmarking will help you much more

Things to remember
Allocations make work. Reducing allocations helps.
C# programmers, learn about when to use struct!

Allocating lots of large objects may also have a
negative impact. Repeated string concatenation or
regular collection resizing can be pain points.
Since VMs tend to assume the generational
hypothesis, it's now something of a performance
rule. Avoid mid-life crisis; have short-lived and longlived objects, but not medium-lived.

Thank you!
Hunt me down...

Email: jonathan@edument.se
Twitter: @jnthnwrthngtn

Questions?
?????????
P.S. Think I'd be fun to work with? Edument is hiring. Not
for writing GCs...but if you like teaching/mentoring
and building quality stuff, come and say hi. kthx. 