GC Writer's Guide for Open Runtime Environment
Version 1.5 of September 09, 2000
Copyright (C)  2000 Intel Corporation.  All rights reserved.
Rick Hudson (Document Owner)

Introduction

Garbage collectors for languages such as Lisp, Smalltalk, Modula-3,
and ML have existed for many years. While each of these languages had
a following, it was not until recently that garbage collection became
part of mainstream run-time environments. This document describes
Intel Microprocessor Research Laboratory's implementation of
garbage collection for the Intel Open Runtime Platform (ORP). It
draws considerable intellectual heritage from our implementation of
the Intel high performance ORP on IA32 architecture, as well as our
work on the IA64 implementation. This guide serves as the basis for
understanding the IA32 implementation and looks forward to the IA64
implementation.

We have arranged the guide as follows: Section 1 describes how we
partition and lay out memory. Section 2 discusses our implementation
of a write barrier. The read barrier is discussed in Section
3. Section 4 discusses the allocation sequence including code that can
be used to zero out memory. Section 5 discusses scanning
objects. Section 6 briefly describes the Train Algorithm that we use
to collect the Mature Object Space (M.O.S.).  Section 7 describes a
general mechanism that can be used to monitor information that the GC
as well as other parts of the system can use to improve
performance. Section 8 describes various knobs that can be used when
the GC code is compiled to set up the memory layout and other policy
related information. Finally the interface between the GC and the ORP
is described in Section 9.

Memory Organization 

We divide memory into three areas: collected, traced, and
untraced. The collected area holds all objects that are allocated and
reclaimed by our collector. The traced area may contain pointers that
refer to collected objects. These pointers must be examined and
updated by the collector, but the collector does not otherwise manage
the traced area. The traced area may include statically allocated
data, the runtime stacks and hardware registers, and ORP managed
(explicit allocation/deallocation) areas. The untraced area holds data
that is neither collected nor examined by the collector. This includes
application code that uses JNI. The untraced area will be ignored for
the rest of this paper.

The collected area, sometimes referred to the heap, is divided into
two areas. Young object space (YOS) is the youngest, and holds the
most recently allocated objects.  As objects survive repeated
scavenges, they are promoted to the older Mature Object Space (MOS).
MOS is typically scavenged using a train algorithm. This allows us to
focus collection activity on young objects, which typically die
(become garbage) more rapidly than older objects. The number of
generations in YOS can be easily reconfigured when the garbage
collector is recompiled.

Each YOS generation is divided into one or more steps. Step 1 is the
youngest step in a generation. Typically, as a generation is
scavenged, surviving objects are moved from their current step to the
next older step, and objects in the oldest step of a generation are
promoted into the youngest step in the next generation. [Shaw, 1988]
describes a similar scheme called a bucket brigade.

Objects promoted into the oldest, or train, generation are promoted
into some train other than the oldest train. Currently we promote into
the second oldest train but this policy can be easily changed when the
garbage collector is recompiled.

In our scheme, all the survivors of a step move together. This avoids
attaching age information to individual objects. Rather, age is
encoded in the step. Our technique saves space for age counters in the
objects and time for manipulating the age counters. The bucket brigade
and similar schemes also eliminate age counters, but we avoid an
additional overhead: since all objects of a step move together, we do
not make a promotion decision for each individual object. Instead we
only follow a previously determined plan that indicates where to place
survivors of each step of each generation being scavenged. The step
approach allows age information to be as fine grained or coarse as
desired. The current implementation allows the number of steps in each
generation to be specified when the garbage collector is created.

Each step is stored as a number of fixed size blocks that need not be
adjacent. A block consists of 2^i bytes aligned on a 2^i-byte
boundary, which allows the generation of an object to be computed
efficiently from the object's address (or in fact any address in the
object). This avoids the need for generation or step tags in objects,
which helps keep the garbage collector independent of language
implementation concerns such as object formats, as well as reducing
space and time costs in maintaining such information in each object.

Other schemes have been suggested that attempt to maintain objects in
the address order they were created. This approach has the advantage
of allowing a simple compare to determine which object is youngest. We
have chosen not to use this scheme preferring the flexibility of being
able to allocate a block into whatever part of the collected area is
convenient.

Blocks may be added to a step at will. The price we pay for such
flexibility is fragmentation when objects do not fill blocks
completely. In the youngest generation we avoid fragmentation by
having the plan specify that contiguous blocks be used for step 1,
with survivors being promoted into another step that uses fixed-size
blocks. The contiguous nursery area also reduces the number of page
traps if we use an access protected guard page to signal memory
overflow. Rather than one trap per block, we incur one trap per
scavenge. Currently we do not use a page trap scheme. When a
generation is scavenged, the collector allocates only enough blocks to
hold the surviving objects. When a scavenge is complete, the original
blocks can be reused immediately.

Zorn [1990] claims that since the youngest generation needs to fit in
primary memory, "the mark and sweep algorithm requires less
memory because each generation is half the size of copying algorithm
generations." An important contribution of our scheme is that it
immediately reuses freed blocks following a scavenge. This means that
the amount of space needed for the youngest generation consists of the
area used for allocation plus the number of blocks needed to hold
objects surviving a scavenge. Depending on the length of time between
scavenges, Zorn [1990] claims a survival rate of between 3% and 24% of
objects allocated since the last scavenge. Therefore, our scheme
requires between 52% to 62% of the space needed by the traditional
stopandcopy schemes described by Zorn. The key to our
advantage is that using fixed blocks eliminates the need for
contiguous memory areas.

Write barrier

We use the Sobalvarro-Chambers-Hlze-Hosking-Hudson algorithm to
implement write barriers. This is also referred to as card marking
with remembered sets. The point of the write barrier is to avoid
having to scan the entire older generation by recording where there
have been pointer modifications. A card is an aligned 2k-byte
unit. Thus, one determines the card number corresponding to a modified
location simply by shifting the address of the location right by k
bits. While similar to dirty page schemes used by operating systems,
card marking typically tracks only pointer writes (as opposed to all
writes). The first proposed card marking algorithms maintained one bit
per card, but, as suggested by later researchers, we use a byte to
record whether a card is marked. This reduces the card marking code
from read, modify a bit, and store, back to simply storing a byte. The
entire resulting sequence is then: shift the address right k bits, add
the virtual base of the card table (maintained in a dedicated
register), and store a 0xff byte. Some researchers have suggested that
we could use the value 0 to indicate modified but since most Oses give
us the allocated memory zeroed out we can avoid having to set the card
table to 0xff.

Using the virtual base instead of the actual base eliminates the need
to subtract the base of the heap from an object reference before
shifting the reference to calculate that card index. The
straightforward code would be as follows:

card_table_base[(object_ref->heap_base)>>bits_to_shift]:=MARK;

If however
virtual_card_table_base=card_table_base->(heap_base>>bits_to_shift)
then the card marking logic becomes the less expensive
virtual_card_table_base[object_ref >> bits_to_shift] = MARK;

When doing a store, the IA32 provides instructions that combine a base
address with an offset. Since the compiler never needs to materialize
a pointer to the actual slot being updated (except when indexing
arrays with a non-constant index), the card marking code marks the
card corresponding to the base of the object instead of the slot being
modified. This means that the scanning code needs to scan all objects
starting in each marked card. If an object overlaps several cards then
all these cards need to be scanned. Urz Holze was the first to
discover this optimization.

Objects can overlap cards and the card mark only indicates which card
the start of an object lies in. This requires that given a card we
must find the start of the first objects in the card. This is trivial
if the card is the first one in a block. It is also trivial as we have
the start of the object that overlaps into the card. With that object
we can simple scan forward to the first object in the card that
immediately follows this object. To help locate the last object in a
card, each card has associated with it the location of the last object
allocated in the card. This is typically maintained in a last object
array. Whenever an object is dropped into a card, this array is
updated. As multiple objects are dropped into the same a card, the
value of the last object in a card is overwritten. Given a card, we
scan previous last objects until we find the object that overlaps into
the marked card. Since nurseries do not require card marking, the
logic to maintain the last object array is not part of the allocation
routine.

I will note that an alternative would be to have the Cheney scan note
the first object in card as it does the scan. This might greatly
simplify the card scanning logic at the expense of complicating the
Cheney scan. We have not run the experiments to see if this is a
worthwhile

When we collect and scan cards, we look for interesting slots. An
interesting slot is one pointing from an older generation to a younger
one. If an interesting slot contains a pointer into a generation
currently being collected, then the slot is processed. If an
interesting slot contains a pointer into a generation not being
collected, then the address of the slot is inserted into the
remembered set of the generation referred to, for use in future
collections. Likewise, as we process each slot in the new copies of
objects during collection, if a slot contains an interesting pointer
(after being forwarded), we insert it into the target generation's
remembered set.

Thus, we avoid repeatedly scanning the cards at each collection by
summarizing interesting slots into remembered sets (and clearing the
card's mark). Hosking and Hudson showed that this hybrid
card-marking and remembered-set scheme is more efficient than either
remembered sets or card marking alone.

Another approach that we considered was to have the OS allow us access
to the OS's page dirty bits. There are at least two problems with this
approach. First, modern architectures and operating systems typically
use pages larger than what we believe is optimal for cards. We use a
card size of only 256 bytes, which minimizes the time spent scanning
cards. Second, OS collection of page dirty information or reflection
of page traps to user code typically has high overhead. This need not
be the case but is generally beyond our control.

When we scan the card table we can safely skip the areas that are not
in use or are used by areas that are being collected. In addition we
have found that since the card table is a sparse array, loading full
words and checking for zero is faster than loading and checking a
single byte at a time.
 
Object Scanning

The garbage collector needs to determine the location of all pointers
within an object in order to scan the object. Scanning is needed in
three places in our garbage collector.  The card marking logic scans
objects. When the GC moves an object the Cheney scan logic scans
objects. Finally, we must use scanning logic to find pointer slots
within objects in fixed (or pinned) object space. Since scanning is a
large part of what the GC does, it is important that the mechanism
that locates pointers within heap objects is efficient. This section
discusses the design of a mechanism that scans objects efficiently.

The interface used for scanning an object consists of two
routines. The first routine initializes the scanner. The second
routine passes back a pointer to a slot within an object holding a
reference. Successive calls return successive locations, and when all
slots have been enumerated, it returns the null pointer. The
underlying data structure available to the second routine is built by
the class loader and provided to the garbage collector. It consists of
a zero-terminated array of offsets to each pointer slot in an
object. A pointer to this array is stored in the class structure in a
place known to the GC.  To materialize all the slots in an object the
GC iterates over the offset array adding each offset to the object
base, as illustrated immediately below:

Assume that an object has slots at offset 16 and 24, then the gcInfo
structure would consist of 16, 24, 0.

 
           Header -----  
    +0      Vtable     |       
    +4      Double     |
    +8                 |--> gcInfo ->    16
   +12      Int                          24 
   +16      Reference                     0           
   +24      Reference 
 

Assuming that p_obj is a pointer to an object to be scanned, the
following sequence iterates over the slots in the object.

unsigned int *offset_scanner=init_object_scanner(p_obj);
lang_Object **pp_obj;
while((pp_obj=p_get_ref(offset_scanner,p_obj))!= NULL){
    // Move the scanner to the next reference.
    offset_scanner = p_next_ref (offset_scanner);
    perform required actions using pp_obj
    ...
}

The array mechanism is similar except it does not need the help of a
pointer offset array. The mechanism steps through the array in reverse
order. This is useful since it means that the check for termination is
a zero-compare as opposed to comparison against some other limit. This
means that only the initialization code needs to determine the array
size, and the nextReference code just checks for an offset of 0.

unsigned int offset = init_array_scanner (p_obj);
lang_Object **pp_target_obj;
while((pp_target_obj=p_get_array_ref(p_obj,offset))!=NULL){
    offset = next_array_ref (offset);
    perform required actions using pp_target_object
    ...
}

Clearing allocation areas

Newly allocated objects need to be type safe and good language design
will arrange for a zero filled field to be type safe. The ORP GC
ensures that newly created objects will have all the user visible
fields cleared. But when should we clear the bytes?

There seem to be three possible times to clear objects. When the
object is allocated is one obvious choice. Unfortunately, the time to
set up for a memset can dominate the time to clear objects when the
objects are small. Rather than calling a generic routine such as
memset, we could put stores of zero inline, but that will put more
pressure on the instruction cache. Further, clearing larger areas at
once may exhibit economies of scale. This suggests a second time to
clear objects, namely clearing the entire space during system
initialization. This might be done via the virtual memory abstraction
offered by an operating system, i.e., some operating systems clear
memory before they pass it to applications such as the ORP.  To do
clearing this way, the collector would free virtual memory associated
with evacuated regions after each garbage collection, and have the OS
zero fresh pages on demand. Since this approach requires context
switching into the operating system, it may increase the latency of
garbage collection. This leads to our third possible time for clearing
memory: clear each nursery just before the first object is allocated
in it.

Experiments on the IA32 architecture indicate that clearing the
nurseries just prior to allocation of the first object is the best
policy, since we provide one nursery at a time and the latency will be
the time it takes to clear a single nursery. The fastest code that can
be used to clear large blocks of memory is typically available from
the OS.

Allocating

The allocation mechanism is designed for the common case where there
is sufficient room to allocate the object in a nursery and the
allocation of the object involves no special consideration such as
specific alignment or being pinned. A pinned object is one that is
allocated and then not moved by the collector. Diwan and Tarditi
showed that the cost of allocation is not determined solely by the
instructions in the short code sequence that does the allocation. It
is important to consider more global issues such as the constraints
these sequences place on highly optimized code. It is our (unmeasured)
belief that the majority of allocation requests for a typical
application will be for objects not needing pinning or special
alignment. The issues we face in designing a fast- case allocation
sequence fall into two categories: avoiding locking, and optimizing
the code sequence while placing minimal constraints on the code
surrounding it.

The garbage collector provides a simple allocator, gc_malloc_or_null
that places minimal constraints on the caller as well as the GC. In
particular the caller does not need to provide the GC with any
guarantees about the state of the system.  Our design introduces the
novel notion of permitting allocation at places that are not GC
safe. In other words the caller does not need to "get its house in
order" each time it allocates an object. The point is that since
there are fewer constraints on state at allocation points, the
optimizer will be less constrained and can thus produce better
code. But gc_malloc_or_null is not the whole story, of course.

As part of an engineering tradeoff gc_malloc_or_null is allowed to
return NULL if it is unable to allocate an object without invoking a
GC.  This means that the caller must be prepared to receive NULL as an
answer. If NULL is returned, the caller should use the fully general
routine gc_malloc to allocate space. Locations that call gc_malloc
must be GC safe points. Thus the usual allocation sequence will
consist of a call to gc_malloc_or_null followed by a test for NULL and
a conditional branch to rarely executed code that establishes GC
safety and then calls gc_malloc. The important thing to notice is that
with this interface the normal case operations place minimal
constraints on the caller.  In this design, if it proves to be
worthwhile, one could inline gc_malloc_or_null.

The ORP, typically as part of the class loader, is responsible for
determining the size of an object and whether there are any
constraints on the allocation of an object. If there are constraints
then the class loader arranges for the size of the object to have its
high bit set. This is intended to cause the simple gc_malloc_or_null
allocator to give up and return NULL, while the gc_malloc routine must
handle the general case.

This interface allows gc_malloc_or_null to fold the check for
constraints into the heap limit check: it simply takes the free
pointer, adds the unsigned size of the object to it, and compares the
result with the allocation area limit. If there does not appear to be
enough room, it returns NULL.

Evaluating this approach is more difficult than simply counting the
cycles required by the allocation sequence. That said it does not
relieve us of the need to minimize this number of cycles used to
allocate an object.

The current IA32 implementation gives a separate nursery (allocation
area) to each thread.  The pool of nurseries is maintained and passed
out as required by the threads. When a nursery is required but none
are available then a garbage collection takes place.

Collection Algorithms

For a description of the train algorithm the reader should refer to
the "Hudson, R.  L. and Moss, J. E. B. 1992 Incremental garbage
collection for mature objects in Proceeding of the International
Workshop on Memory Management, Number 637 in Lecture Notes in Computer
Science (LNCS) (St. Malo France, Sept. 17-19 1992), pp
388-403. Springer Verlag."

In the years since this paper was published there has been an
unpublished improvements. In the above paper the complexity of the
algorithm was stated to be O(n2). This complexity can be improved to
O(n) simple by adding a marking phase to the collection of each
train. The marking phase is responsible for starting with all pointers
into the train and then marking each object that is transitively
reachable from a root with the train that the object is reachable
from. Such a table could be kept on the side without having to change
the object layout. Once such a phase has been completed the cars can
be collected one at a time and all unreachable objects in the train
can be discarded and reachable objects can be moved immediately out of
the train into the referencing train. Notice that unlike in the
original train algorithm objects only reachable from cars within the
train will be immediately discarded. This avoids the worse case
scenario that lead to the O(n2 ) complexity. Until we encounter an
application that exhibits the O(n2) behavior, we will not implement
this improvement. Once we find an application that requires this
improvement we will consider implementing it. The important thing here
is to note that the problem has been solved and its implementation is
now a programmer time and resource issue.  Stack Scanning and Root set
enumeration

A previous paper, [Amer Diwan, J. Eliot B. Moss, and Richard
L. Hudson.  Compiler support for garbage collection in a statically
typed language. In Conference on Programming Language Design and
Implementation, pages 273-282, San Francisco, California, June
1992. SIGPLAN, ACM press.] lays out basic techniques that can be used
to enumerate pointers on the stack. Later papers [Stichnoth,?? In
Conference on Programming Language Design and Implementation,
Montreal, June 1998. SIGPLAN, ACM press.] refined these
approaches. Our work closely follows these papers.

Since the issues of how stacks are scanned and the root set is
enumerated is on the ORP and JIT side of the world it will not be
discussed here beyond saying that accurate enumeration is required.
Moving threads to a GC safepoint

How the GC thread moves running application's thread to a safe
point is an area that is not very well covered in the
literature. Halting threads while avoiding latency and race conditions
is an inherently difficult programming problem. What follows is a
description of the algorithm we chose for bringing all threads to a GC
safepoint. The GC thread is the thread that the GC is running in. The
other application threads are known as target threads. One job of the
GC thread is to make sure that each target thread is at a GC
safe-point and so that the GC thread can enumerate the pointers in the
target threads stack.

The GC thread can observe what state a target thread is in without
suspending the thread. One state that a thread can be in is running
JIT compiled code. If a thread is running JIT compiled code then the
GC thread suspends it and determines if it is at a GC safe-point. Our
JITs are designed such that it is statistically likely that the thread
in the running JIT compiled code will be at a GC safe-point. There are
a few rare corner cases where multiple instructions need to be
performed atomically with respect to the GC and therefore are not
GC-safe. One such sequence is during card marking. If we write to a
card and then mark the card and a GC happens between the two
instructions, the current GC will not know about the write. If on the
other hand we mark the card first and then write the value, the
current following GC will scan the old value and the next GC will not
be aware of the newly written value.  Fortunately such occurrences are
statistically rare and in the unlikely chance that the target thread
is suspended in such a sequence restarting the thread should move it
to a GC safe-point. Another more complicated method would be to
recognize that we were in such a sequence and have the GC thread
provide compensation code. Such code would mark the correct card and
adjust the program counter of the target thread and continue. We chose
a simpler implementation of resuming the tread and then repeating the
GC safe-point algorithm. In any case once the code has been eventually
suspended at a GC safe-point the GC thread will enumerate the target
thread.

The GC thread can observe if a target thread is in native code. If the
target thread is in native code, then it can be in one of three
states. If the thread's roots can not be enumerated, then the
thread is in the GC disabled state. If the GC observes this state it
lets the thread continue to run until it is no longer in the GC
disabled state. By convention, native code is in a GC disabled state
for only short periods of time so the GC thread will never be delayed
long waiting for the target's thread state to change.

If a target thread is in native code and it is possible to enumerate
the target threads roots, then the thread is in the either GC enabled
state or the GC enabled will block state. Without suspending the
target thread, the GC thread will atomically move a target thread from
the GC enabled state to the GC enable will block state. The GC enable
will block state will allow the code to run until it is about to
transition into another state. Prior to transitioning into another
state, the target thread will wait for a thread specific event
signaled by the GC thread. The GC thread can enumerate a target thread
that is in the GC enable will block state even as the target thread
continues to run.

Restarting Threads 

Target threads that are suspended in the running JIT compiled code
state are simply resumed by the GC thread. Target threads that are
suspended in the GC enable will block state have a target thread
specific event signaled allowing them to continue.

Threads that are in the GC enable will block state but have not been
suspended are left in that state and since the GC thread has signaled
the resume event will be able to transition into other states. Since
the GC thread resets the resume event at the start of the GC a target
thread in the GC enables will block state will block waiting for the
GC to signal the resume event.

Clues about how we debug garbage collectors

Debugging garbage collectors is one of the most challenging jobs any
of us has ever done. We have developed several tools that have helped
us understand problems.  The first is a good interface that fully
explains the contract between the GC and the VM/JIT. A clear
understanding and agreement on this contract is important.

The second most valuable tool we have is the ability to trace an
object from when it is born to when it dies. This was implemented by
placing debugging code in every interesting routine in the GC. This
debugging code calls a routine called gc_trace with every object the
routine encounters along with an informative statement about which
routine gc_trace is being called from. When referential integrity is
broken, we have been able to trace the broken object and determine why
it was broken. This has been of great help to not only the GC writers
but also the VM and JIT writers attempting to debug the system.

Heap walking routines have also been very valuable. These routines
iterate over all the objects in the heap executing whatever ad hoc
queries that are needed.

The Interface  

The interface into the GC is defined by the routines listed in the
Appendix. There are places where if the JIT used a call interface
performance would suffer. For example the card marking write barrier
can be implemented by the JIT by simply putting the code inline. This
is a performance tradeoff and in no way a replacement for a well
defined high level interface such as provided in the Appendix.

The interface is divided into several sections. These include routines
to support the initialization and termination of GC, routines to
support finalization of objects, routines to support querying the
layout of objects, routines to support barriers such as write
barriers, routines to support building the root set during a
collection, routines to support the allocation and initialization of
objects, routines to support soft, weak, and phantom reference
objects, routines to support threads, routines to support the
functionality required by the various language specifications, and
finally routines to support pinned objects. We have also defined some
unsupported routines that can be used for debugging the garbage
collector.

The interface attempts to reduce to a great extend the knowledge it
has about the structures in the heap. It does this by providing a
partially revealed type for objects as well as their virtual table
constructs. This is not enforced by the code but is there as an aid if
we ever need to generate a dynamically loaded library for the garbage
collector.

The actual interface is embodied in the gc_for_orp.h.

