Adding native thread support to SBCL
- or n things every programmer should know about signal handling
Daniel Barlow
metacircles Ltd
dan@metacircles.com
July 14, 2003

Abstract

• The de-facto standard for CL multithreaded
programming was based on special-purpose
Lisp hardware (where, for example, user hooks
could be installed directly into the scheduler)
and is less suited to today’s mainstream computing systems.

Threads (lightweight processes sharing a common
address space) are now a near-standard technique
used in many large applications, and supported in varying ways and to different extents - in many
languages and most operating systems. We added
Topics include thread-local storage, garbage colthread support to the Steel Bank Common Lisp enlection,
mmap, breakpoints and ptrace, signal hanvironment on Linux platforms, using the clone()
dling, floating point, terminal handling, dynamic
syscall : this presents interesting challenges
linking, and atomic operations. We conclude with a
look at some tools (cparse, SWIG, FFIGEN) to help
• Lisp’s introspective abilities and runtime enin this area, and some recommendations for kernel
vironment requires OS support for things like
and library authors about how they can make life easwriting to our own instruction stream, setting
ier for users of non-C languages.
breakpoints on ourself, and finding the register contents and faulting address in a SIGSEGV
handler. These features are often provided,
1 The challenge
more seldom tested, and hardly ever documented.
Traditionally Lisp and Unix were very different cul• The primary interface to the Linux kernel is tures. Unix vendors of the 80s produced workstathrough glibc, and defined in terms of the glibc tions that, although expensive by PC standards, acheader files. Accessing this functionality from tually had a very attractive price/performance ratio.
languages other than C is sometimes harder Unix is based around a philosophy of multiple inthan it needs to be.
dependent processes which communicate by sending
1

streams of bytes to each other - typically using sock- generally at the places where we have to do unusual
things for Unix, but with especial emphasis on feaets, files or pipes.
Lisp companies, on the other hand, produced tures we need for threads.
fiendishly expensive single-user workstations with
Lisp-based operating systems and hardware support
2 The language
for fast type and bounds checking. Because these
systems were single user, and the hardware protected
against most kinds of accidental damage, Lisp sys- Why Lisp, anyway? If you ask ten different Lisp protems used to allow each task to see all the others: no grammers you’ll get ten different answers, but some
marshalling overhead, easy object sharing, no mem- of the common traits are
ory protection. They called their tasks ’processes’,
• Interactive : like interpreted BASICs of old,
but in today’s terms they’re more like threads.
Lisp implementations usually start up in a state
Unfortunately for Lisp machine vendors,
that expressions can be typed into them to be
“fiendishly expensive” and “custom hardware”
executed immediately. This makes testing and
could mean only one thing. The Lisp hardware
prototyping much faster. Note that, unlike incustomers were mostly the AI companies: the
terpreted BASICs, any self-respecting Lisp also
combined effect of Moore’s Law making generalintegrates a compiler into this environment.
purpose hardware a lot faster, and the bottom falling
out of the AI market1 , mostly killed them.
• Extensible : the parentheses that plague so
many Lisp newbies are actually a feature. BeThanks to the the ongoing march of technology, in
cause
the printed syntax corresponds so closely
this decade we can put a fully functional Lisp onto
to the parse trees, macros make it easy to rewrite
a standard hardware platform, and it’s not even the
source
forms to extend the syntax of the lanbiggest process - Mozilla or OpenOffice.org comguage
in
exciting ways. This eases the crefortably exceed the resource requirements for Unixation of domain-specific languages, so making
based Lisps. Stock hardware has the CPU power,
bottom-up
programming a lot more fun.
and standard OSes such as Linux can – with a small
amount of forcing – make good software platforms
for a Lisp.

• Multiparadigm : flexible syntax has meant
that Lisp has often been used by language researchers when experimenting with new language paradigms. Out of the box, Common
Lisp has support for programming in imperative, functional, OO style, or some combination
of these styles. Nor is it hard to write, for example, a Prolog interpreter in Lisp, if the problem
is best expressed in declarative style, or to add
support for lazy evaluation.

Steel Bank Common Lisp (SBCL) is a native
code Common Lisp environment that runs on a number of Unix platforms. It started life as a fork from
CMU Common Lisp, which was originally part of
the Spice project at CMU - so its lineage (and some
of its code) can be traced back for decades. Over the
course of the last nine months we added support for
Linux kernel threads to SBCL: in this paper we look

• Standard : Common Lisp was standardised by
ANSI in 1994 (making it the first ANSI stan-

1

Comparison with the dot-com bust is left as an exercise for
the reader

2

dard object-orientated language, incidentally). delay, this is still not a sensible use of cycles.
The standard was mostly a formalisation of
The problem is not that this function exists - after
what existing Lisps are doing, so tends to be all, sometimes you need to do this - the problem is
pragmatic instead of idealistic.
that there are no other functions in the standard interface to make a thread wait for an event, so people end
up writing their own queue implementations on top
2.1 Multiprocessing in Lisp
of this. If we provided - and encouraged people to
use - a queue implementation that didn’t do all these
The Common Lisp standard does not include a extra context switches to processes that we should
threading interface. Although the Lisp hardware sys- already know aren’t ready to run, things would go
tems of old had Lisp functions to inspect and manip- much more smoothly.
ulate multiple tasks, other Lisp implementations of
Given these problems, and with the intention of
the day were running on Unix or on microcomputcreating an interface that looks a bit more familiar to
ers, so didn’t usually support threading and would
today’s programmers, we mostly ignored the Lispm
not have voted for a change that required them to.
interface. We’re calling our threads threads, not proRegardless, the interface has since become somecesses. Threads are identified by ids, and we have
thing of a de facto standard for Lisps that did supfunctions like make-thread, current-thread-id and
port threads, usually using userland threads with a
destroy-thread to manipulate them.
scheduler written in Lisp. However, it makes some
It’s likely that a Lispm-like compatibility layer
assumptions about the thread implementation that
will be added to ease porting from other Lisps, but
make it a poor fit for kernel threads.
even in that layer it’s unlikely that we’ll ever impleFor example, without-interrupts is a macro that
ment without-interrupts.
wraps around code that must be executed without interruption. On a Lisp machine, it really does turn interrupts off. On Unix, it masks signals. This means
that the scheduler won’t run for the duration, so Lisp 3 The platform
code often called this to disable multitasking while
doing some operation that must be atomic. Need- Linux: “because it’s there”. Our interest is in creatless to say, this is slow and difficult to arrange when ing a free Lisp development environment, so a prothe kernel is doing the scheduling - and bad news for prietary OS would be a non-starter. The alternative,
scalability anyway.
to implement an OS from scratch, would be a seriprocess-wait is another problematic function: it ously mammoth task, and to implement from scratch
takes a predicate function as a parameter, and stops with the same attention to performance and scalabilthe current process until the predicate returns true. ity as has been lavished on Linux, would probably
With a Lisp-based scheduler this is pretty simple to take as long as Linux did. Why reinvent the wheel
implement, but in an OS scheduler, we’re going to when there’s a perfectly good antigravity belt there
have to call the predicate every time we switch to for the taking?
this process, then switch away again if it’s false. AlIn this section, we look at some of the differences
though we could reduce the system load by adding a between SBCL and “normal” Unix programs.
3

3.1 Calling convention

3.3

In C, functions accept a fixed number of arguments,
each of a fixed type. A limited degree of support
for variable numbers of arguments is available with
stdarg. There is a single return value.

It is unusual for a Unix program to continue running
for very long after receiving SIGSEGV, SIGBUS, or
SIGTRAP. It’s not unexpected for it to stop after receiving SIGFPE (e.g. on a divide-by-zero error).

In Lisp, we can (in general) pass any type of object to any function, and additionally we have “keyword”, “optional” and “rest” arguments 2 , and multiple return values. It’s somewhere between difficult and impossible to shoehorn all of this into a
C-compatible calling convention, and SBCL doesn’t
even try. Calls to C functions therefore involve a glue
layer which translates from one argument passing
convention to the other, so are probably best avoided
in time-critical inner loops.

However, dropping back to the shell prompt is
usually undesirable in an interactive system: postmortem debugging is generally far less productive
than in-process debugging where all the state (including open files, network connections, etc) is available, and besides we’d rather not kill off any of the
other threads that are running unrelated jobs. So we
really want to catch these signals and run error handlers that can fix the problem and continue.

Signal handling and debugging

(Indeed, sometimes we intentionally ask for these
signals in normal operation. For example, the
garbage collector tracks writes to old pages by mapping them read-only and catching SIGSEGV when
they’re written to)

3.2 Preprocessing
Even if we did use C calling convention, we’d still
have a problem where parts of a C language API are
specified as preprocessor macros. To inline C code
into our Lisp functions would involve doing things
to gcc that rms would really not be quite happy with.
So, we have to write wrapper functions for these
macros, usually in C, and suffer the additional function call overhead.

To diagnose and recover effectively we need
more than just the signal number. We declare all
our signal handlers using sigaction() with the
SA SIGACTION flag. This causes our handler to
be called with three arguments: the second argument
is a siginfo structure, and the third is usually a
ucontext structure 3 containing a wealth of information including the program counter at the time of
signalling, the register contents, and other per-signal
information - for example, in the case of a SIGSEGV,
the faulting address. Much of this information is
used internally in SBCL or passed onto the debugger.

In the particular context of adding thread support,
this presents a particular problem for thread-local
data access. As we’ll see later, Lisp needs a lot of
thread-local data, and having to do a call into C on
every access is a good deal slower than we’d really
like it to be.

Libraries that install signal handlers are, therefore,
bad news. Even if their handlers still work in our

2
An optional argument is declared with a name in the parameter list, which is bound to a default value if unsupplied. A rest
argument is a name to which a list of all the unnamed arguments
is bound.

3

This holds on Solaris, Tru64, FreeBSD, and all Linux architectures we have ports for, other than the SPARC. I don’t know
why it has to be different

4

should be noted that hitting a breakpoint causes us
to receive SIGTRAP in our own process - this works
pretty well in practice, but tends to confuse gdb if
you’re also running under gdb. Ideally it would be
able to tell when a SIGTRAP was due to a breakpoint it had inserted and when it should instead be
delivered to the process it’s attached to.

runtime, and even if they then call the previously installed handler instead of bailing out, they rarely do
it properly with correct siginfo and ucontext - and almost certainly don’t think to check whether we were
using sigaltstack() to get an alternate signal
stack. So, at best our handlers will not know why
they’re running, and more likely we’ll get a core
dump with fairly meaningless backtrace, or an infinite loop of segmentation faults.

Libraries that call exit() are bad news in any
circumstances.

SBCL users have seen this problem in practice
with SDL, for example. It installs a SIGSEGV handler, presumably to free up resources and return the 4 Threading
console to text mode if necessary. This is bad, because we use SIGSEGV for our page write protec- In this section we look at the specific issues we ention. When SDL quits it restores the old handler, but countered adding thread support to SBCL.
it doesn’t restore it with the SA SIGINFO flag, so it
still loses4

4.1

3.3.1

Introspection

Thread creation, destruction, accounting

In Linux 2.4, the POSIX threads interface is implemented by LinuxThreads. Writing to this interface
would make us compatible with all platforms that
offer POSIX threads, if it were possible. There’s a
problem, though: in kernel versions up to and including 2.4 (it gets a lot better in 2.6) there’s a pretty
poor match between the kernel’s thread model and
the requirements of the POSIX thread standard. This
requires LinuxThreads to do some pretty offensive
things.

SBCL includes an interactive compiler - in fact, there
is no interpreter; everything that is evaluated is compiled first. When the user asks to evaluate a function,
we compile it directly into memory, and then jump
to it. This presents cache consistency issues somewhat like those for self-modifying code, though happily far more tractable as we are very unlikely to be
executing in the same page as we’re modifying. Although I am aware of no standard function to ‘flush
the icache’, it’s usually possible somehow - or, on
The most important problem is signal handling.
x86, happens automatically.
POSIX threading has some fairly strict requirements
There are other circumstances in which we write about which thread(s) a signal is delivered to, which
to code pages, that need similar treatment: for exam- in LinuxThreads are implemented by having one
ple, when setting breakpoints, or during GC when thread catch a signal and resignal it to another. As
moving functions from one place to another. It we’ve seen already, though, resignalling is a no-no
when we’re using three-argument signal handlers.
4

This was determined about a year ago by an SBCL user, and
I believe was reported as a bug to SDL maintainers. It may well
have been changed since

We decided instead to use the kernel threads directly. After all, POSIX specifies interfaces for C,
5

not for high-level languages. So, we call clone() it- root set.
self, or at least the small C library wrapper for it that
There is a wrinkle. Sometimes a thread may set its
allocates a stack for the child.
’pseudo-atomic’ flag: this says that it’s in the middle
We’re not directly using NPTL. This is partly be- of some operation that shouldn’t be interrupted by
cause we can’t, as so much of it is defined as gcc or signals or garbage collection - for example, it has
binutils extensions and we’re not using gcc or binu- allocated but not yet initialised memory for a new
tils, and it’s partly because POSIX semantics are not object. After we stop the world for GC, we have to
a requirement for us. We should emphasise, though, check the flag for each thread and resume any that
that the work done on the kernel for NPTL is impor- were in pseudo-atomic, letting them run until they
tant to us: by using kernel threads we benefit directly reach a safe suspension point.
from the speed and scalability optimisation they do,
and when 2.6 is available with new kernel facilities
4.2 Special variables, thread-local storage
like futexes, we’ll use them too.
Common Lisp has a feature called ”special variables”. A special is declared using defvar or def4.1.1 Garbage Collection
parameter, and behaves almost but not quite like a
Garbage collection works by periodically pausing global variable. If we write
the program, then, starting with the ’root set’ - the
variables that we know are in use, such as the CPU (defvar *foo* 1)
registers and the local variables on the stack - following all the references from those variables to other
then *foo* is set to 1 in all parts of the program
objects, and copying everything we find into some – except that we can temporarily rebind it to some
new area. When we run out of objects, anything left other value
uncopied is evidently no longer needed and can be
deallocated.5
(let ((*foo* 2))
This is going to be simpler if we know that noth(some computation))
ing is changing the data behind our backs, so we have
to stop the other threads while it happens. To make
and during the body of the let form it will temlife slightly simpler, we dedicate the parent thread porarily be set to 2. Even if the body calls some other
of SBCL exclusively to cleaning up after its chil- function, *foo* will still be 2 in that other function.
dren: usually it sits in a loop calling waitpid() When the let form finishes executing it reverts to
and reaping children. When signalled by a child to the value 1. Computer science graduates will recogperform GC, it uses ptrace() to attach and sus- nise this as dynamic scoping; Perl programmers will
pend to each of the children, and when they’re all recognise it as a ’local’ variable (as opposed to the
safely stopped (we’re notified by waitpid() for more normal ’my’ variables).
each child) we can do GC. The PTRACE PEEKUSR
But what do we do if two threads refer to *foo*
call is used to get the child’s registers to add to the
at once, and one or both of them has rebound it? If
5
This is an oversimplification
it only has a single value at any one time this could
6

get really messy: are rebindings in A visible in B?
We’d rather they weren’t: a change to a bound special variable should affect only the thread it happens
in.

we want to access, the %fs register is now a pointer
into an entry in a descriptor table, which stores various attributes of the memory area: size, location,
permissions etc. This can be a Local Descriptor Table or the Global Descriptor table. Manipulating the
values in a descriptor table is a root-only operation,
but the kernel people in their infinite foresight have
created a modify ldt() system call (mostly for
the use of Wine, which needs to fake %fs and %gs
convincingly for Windows apps to run) that we can
use.

So we need some form of thread-local storage, and
we need it to be at least reasonably fast. All of memory is the same in both threads, and the only thing
that differs between threads is the register contents.
We have a few options:
We could dedicate a register to the thread storage
base. On an x86 we don’t have a lot of registers available, though. Tying a register up permanently for this
purpose would make code run more slowly, and implementing this changes would also need quite a lot
of work on the compiler (which already has uses for
most of them wired in).

There is a downside to using the LDT: chiefly that
it limits us to 8192 entries and hence 8192 threads.
As part of the 2.6/NPTL work, Ingo Molnar introduced a user-specified slot in the GDT (again,
primarily for Wine to use) which is reloaded perprocess in the kernel context switch. When 2.6 is
available more widely, we’ll probably switch to using this, although it should be noted that we currently
have fixed 2Mb stacks, so until we fix that to allow
variable stack sizes we can’t get nearly that many
threads anyway.

Each thread has its own stack: if we can identify
the current thread from the stack pointer, we can use
that as an index into a table of thread-local values.
But we only have registers (%esp, %ebp) for the
bottom of the stack - which keeps moving up and
down - and for the current frame. Either we’d have to
push the thread base address onto the stack on every
function call, or we could make sure that each stack
is aligned on, say, a 2Mb boundary, and then find
the top of the stack just by masking the stack pointer
appropriately. But 2Mb is a lot of stack if we want to
run lots and lots of threads.

The eventual upshot is that we can set up a block
of memory for each thread, assign an offset for every
variable that gets dynamically bound, then write e.g.
%fs:20 to access the thread-local value for the variable at offset 20. Because %fs points to a different
place in each thread, the value that comes back will
be different in each thread.

Or we could use a segment register. Linux uses
the 32 bit ’flat’ mode of the 386, so segment registers don’t see as much use as they did back in the
4.3 Locks
”good old” MSDOS days - usually they’re all set to
0, in fact - but they’re still available. We use the %fs
Most of the complexity of thread programming is
register, having been warned that %gs would be used
in correctly arbitrating access to shared resources or
by NPTL.
portions of code that aren’t safe to be run by multiple
Setting up segment registers in flat mode is a lit- threads at once. In fact the latter is the same thing as
tle more complicated than it used to be in real mode. the former, given that the reason the code isn’t safe
Rather than being just an offset to the memory area to run more than once is that it accesses shared re7

sources. So, any threaded environment must offer plement in a native threads system: all we need do
some kind of locking constructs.
is call alarm(), and install a SIGALRM handler that
Spinlocks are built on top of lock cmpxchg. signals a Lisp condition (similar to an exception in
They’re not directly available to end-users - though languages like Java). Then we can handle the timenote that this is Lisp so it’s intentionally not impossi- out at whatever point is appropriate using standard
ble to call internal functions if you know what you’re Lisp facilities.
doing. User-available locks have queues attached to
them: a waiting thread puts itself on a queue, then
sleeps by calling sigtimedwait() and can be
woken later by the lock holder when it relinquishes
the lock. When 2.6 is available we’ll provide futexbased locks which should be a little more efficient
than this, and allow fewer possibilities for missed
signals.

5

Future work

There is more work to be done on the threading implementation to make it really robust. This includes
resizing the TLS area when we run out of room in it,
adding parameters to control the stack sizes, and so
We provide ordinary mutexes (one thread may forth.
hold a mutex, others have to wait). These are not
Larger projects for the thread support include the
recursive: if you own a lock and you try to get it ability to benefit from many of the features introagain, you will hang potentially forever, because it’s duced in 2.6 by and for NPTL, such as futexes and
not free - if you’re attempting to get a lock you al- the per-process GDT slots, and ports to non-x86 arready have, that usually indicates a problem. Re- chitectures (which can use an ordinary register for
cursive locks are also available for situations where TLS instead of needing a segment selector) and nonthey’re necessary. If you attempt to get a recursive Linux OSes. Andreas Fuchs has already reported
lock and you have it already, you just carry on hold- some initial success on a FreeBSD port using their
ing it.
rfork thread call.
We also have condition variables, similar to those
Other SBCL projects in current development inin the Posix thread model and in Java. These pro- clude Unicode support, MacOS X and Windows
vide a race-free means for a thread that holds a lock ports, and changes to take advantage of 64 bit adto release it and go to sleep on a queue until woken dress spaces.
by some other thread. When it’s woken, it’s guaranteed to reacquire the lock before being allowed to do
anything else.
6 Existing tools for FFI

Timeouts are important: we usually don’t want to
lock indefinitely, and there are other situations where
we want to abort an operation if it takes too long as
well. For example in a web server, there’s usually
little point in serving a response if it’s taken more
than about a minute to compute, because the user
will have gone elsewhere. Timeouts are trivial to im-

Creating Lisp language bindings for a library is to
some extent very easy. Given a function name
and a shared library, we just use dlopen() and
dlsym(), and call it.
The tedious part is in knowing what to call it with.
Some tools exist that help with digging constants and
8

structure layouts out of header files, but it’s not a
completely automatic process.

came commonplace for library authors to annotate their header files appropriately, this could
be a very powerful tool.

• SB-GROVEL is part of SBCL. It takes a configuration file listing the constants and structure/union names/fields of interest, then writes,
compiles and runs a short C program that uses
sizeof and offsetof to determine the appropriate
numbers, and writes them out as a Lisp source
file. It has rough edges still, and obviously
doesn’t help with API calls or variables that are
only implemented as preprocessor macros (e.g.
errno)

7

Recommendations

Today’s Unix-like systems are definitely easier to do
interesting things with than those of a few years ago.
It’s often been possible for a long time to get the
fault address from a SIGSEGV, but looking at old
CMUCL code it’s apparent that the exact method
for doing this used to vary widely between different
• cparse 6 does a moderately good but not perfect Unices, and often involved looking in kernel source
job of parsing C headers. Unfortunately, any- to find the signal handler stack frame layout. These
9
thing less than gcc-style perfection is probably days - except, as we noted, on Linux/SPARC - it’s a
pretty safe bet that SA SIGINFO and a ucontext
insufficient to parse GNU Libc headers.
will do the trick.
• FFIGEN 7 is a patched gcc that can be used to
However, the kernel/C library alone is no longer
build a dbm file of library calls in a format that
the platform for useful programs - today programmakes it easy to create language bindings to
mers depend on a host of layered libraries, whether
them. OpenMCL (another free Lisp compiler,
toolkits such as GNOME or KDE, or smaller spefor PPC systems only) uses this and it’s claimed
cial purpose libraries like cdparanoia. In a spirit of
to be pretty slick. As it’s based on gcc it’s not
continuous improvement, then, I’d like to close by
trivial to build, though, and someone will need
offering the following suggestions to library authors:
to keep updating it against new versions of gcc
as they’re released. Presently only works with
• When the usual interface to something involves
OpenMCL.
use of the preprocessor (e.g. errno or the
• SWIG, the Simplified Wrapper and Interface
stat family of calls), provide and document
Generator, takes annotated C/C++ header files
an alternative functional interface that can be
as its input specification, and can generate bindcalled with dlsym().
ings for several high-level languages. It’s reported that it now has s-expressions as an out• Avoid mandating “convenience” features such
put format, and experimental support for transas installing signal handlers for cleanup or callforming these to UFFI 8 declarations. If it being exit() in error situations. You can’t antici6
pate what the user will or won’t find convenient,
http://bricoworks.com/ moore/cparse/
7
so allow some way to disable these things.
http://openmcl.clozure.com/Doc/interface-translation.html
8

UFFI is a portable CL package for foreign language interfaces.

9

9

and MacOS 10.1, but happily they’ve fixed it in 10.2

• If you can avoid monopolising the event loop,
this is a good thing. Everyone wants to monopolise the event loop, and only one of the contenders can win.
• For extra points, provide some machinereadable file describing your exported functions
and data structures. This could be as simple
as running your header files through SWIG to
make sure it can parse them and output appropriate interfaces.

10