Adding native thread support to SBCL - or n things every programmer should know about signal handling Daniel Barlow metacircles Ltd dan@metacircles.com July 14, 2003 Abstract • The de-facto standard for CL multithreaded programming was based on special-purpose Lisp hardware (where, for example, user hooks could be installed directly into the scheduler) and is less suited to today’s mainstream computing systems. Threads (lightweight processes sharing a common address space) are now a near-standard technique used in many large applications, and supported in varying ways and to different extents - in many languages and most operating systems. We added Topics include thread-local storage, garbage colthread support to the Steel Bank Common Lisp enlection, mmap, breakpoints and ptrace, signal hanvironment on Linux platforms, using the clone() dling, floating point, terminal handling, dynamic syscall : this presents interesting challenges linking, and atomic operations. We conclude with a look at some tools (cparse, SWIG, FFIGEN) to help • Lisp’s introspective abilities and runtime enin this area, and some recommendations for kernel vironment requires OS support for things like and library authors about how they can make life easwriting to our own instruction stream, setting ier for users of non-C languages. breakpoints on ourself, and finding the register contents and faulting address in a SIGSEGV handler. These features are often provided, 1 The challenge more seldom tested, and hardly ever documented. Traditionally Lisp and Unix were very different cul• The primary interface to the Linux kernel is tures. Unix vendors of the 80s produced workstathrough glibc, and defined in terms of the glibc tions that, although expensive by PC standards, acheader files. Accessing this functionality from tually had a very attractive price/performance ratio. languages other than C is sometimes harder Unix is based around a philosophy of multiple inthan it needs to be. dependent processes which communicate by sending 1 streams of bytes to each other - typically using sock- generally at the places where we have to do unusual things for Unix, but with especial emphasis on feaets, files or pipes. Lisp companies, on the other hand, produced tures we need for threads. fiendishly expensive single-user workstations with Lisp-based operating systems and hardware support 2 The language for fast type and bounds checking. Because these systems were single user, and the hardware protected against most kinds of accidental damage, Lisp sys- Why Lisp, anyway? If you ask ten different Lisp protems used to allow each task to see all the others: no grammers you’ll get ten different answers, but some marshalling overhead, easy object sharing, no mem- of the common traits are ory protection. They called their tasks ’processes’, • Interactive : like interpreted BASICs of old, but in today’s terms they’re more like threads. Lisp implementations usually start up in a state Unfortunately for Lisp machine vendors, that expressions can be typed into them to be “fiendishly expensive” and “custom hardware” executed immediately. This makes testing and could mean only one thing. The Lisp hardware prototyping much faster. Note that, unlike incustomers were mostly the AI companies: the terpreted BASICs, any self-respecting Lisp also combined effect of Moore’s Law making generalintegrates a compiler into this environment. purpose hardware a lot faster, and the bottom falling out of the AI market1 , mostly killed them. • Extensible : the parentheses that plague so many Lisp newbies are actually a feature. BeThanks to the the ongoing march of technology, in cause the printed syntax corresponds so closely this decade we can put a fully functional Lisp onto to the parse trees, macros make it easy to rewrite a standard hardware platform, and it’s not even the source forms to extend the syntax of the lanbiggest process - Mozilla or OpenOffice.org comguage in exciting ways. This eases the crefortably exceed the resource requirements for Unixation of domain-specific languages, so making based Lisps. Stock hardware has the CPU power, bottom-up programming a lot more fun. and standard OSes such as Linux can – with a small amount of forcing – make good software platforms for a Lisp. • Multiparadigm : flexible syntax has meant that Lisp has often been used by language researchers when experimenting with new language paradigms. Out of the box, Common Lisp has support for programming in imperative, functional, OO style, or some combination of these styles. Nor is it hard to write, for example, a Prolog interpreter in Lisp, if the problem is best expressed in declarative style, or to add support for lazy evaluation. Steel Bank Common Lisp (SBCL) is a native code Common Lisp environment that runs on a number of Unix platforms. It started life as a fork from CMU Common Lisp, which was originally part of the Spice project at CMU - so its lineage (and some of its code) can be traced back for decades. Over the course of the last nine months we added support for Linux kernel threads to SBCL: in this paper we look • Standard : Common Lisp was standardised by ANSI in 1994 (making it the first ANSI stan- 1 Comparison with the dot-com bust is left as an exercise for the reader 2 dard object-orientated language, incidentally). delay, this is still not a sensible use of cycles. The standard was mostly a formalisation of The problem is not that this function exists - after what existing Lisps are doing, so tends to be all, sometimes you need to do this - the problem is pragmatic instead of idealistic. that there are no other functions in the standard interface to make a thread wait for an event, so people end up writing their own queue implementations on top 2.1 Multiprocessing in Lisp of this. If we provided - and encouraged people to use - a queue implementation that didn’t do all these The Common Lisp standard does not include a extra context switches to processes that we should threading interface. Although the Lisp hardware sys- already know aren’t ready to run, things would go tems of old had Lisp functions to inspect and manip- much more smoothly. ulate multiple tasks, other Lisp implementations of Given these problems, and with the intention of the day were running on Unix or on microcomputcreating an interface that looks a bit more familiar to ers, so didn’t usually support threading and would today’s programmers, we mostly ignored the Lispm not have voted for a change that required them to. interface. We’re calling our threads threads, not proRegardless, the interface has since become somecesses. Threads are identified by ids, and we have thing of a de facto standard for Lisps that did supfunctions like make-thread, current-thread-id and port threads, usually using userland threads with a destroy-thread to manipulate them. scheduler written in Lisp. However, it makes some It’s likely that a Lispm-like compatibility layer assumptions about the thread implementation that will be added to ease porting from other Lisps, but make it a poor fit for kernel threads. even in that layer it’s unlikely that we’ll ever impleFor example, without-interrupts is a macro that ment without-interrupts. wraps around code that must be executed without interruption. On a Lisp machine, it really does turn interrupts off. On Unix, it masks signals. This means that the scheduler won’t run for the duration, so Lisp 3 The platform code often called this to disable multitasking while doing some operation that must be atomic. Need- Linux: “because it’s there”. Our interest is in creatless to say, this is slow and difficult to arrange when ing a free Lisp development environment, so a prothe kernel is doing the scheduling - and bad news for prietary OS would be a non-starter. The alternative, scalability anyway. to implement an OS from scratch, would be a seriprocess-wait is another problematic function: it ously mammoth task, and to implement from scratch takes a predicate function as a parameter, and stops with the same attention to performance and scalabilthe current process until the predicate returns true. ity as has been lavished on Linux, would probably With a Lisp-based scheduler this is pretty simple to take as long as Linux did. Why reinvent the wheel implement, but in an OS scheduler, we’re going to when there’s a perfectly good antigravity belt there have to call the predicate every time we switch to for the taking? this process, then switch away again if it’s false. AlIn this section, we look at some of the differences though we could reduce the system load by adding a between SBCL and “normal” Unix programs. 3 3.1 Calling convention 3.3 In C, functions accept a fixed number of arguments, each of a fixed type. A limited degree of support for variable numbers of arguments is available with stdarg. There is a single return value. It is unusual for a Unix program to continue running for very long after receiving SIGSEGV, SIGBUS, or SIGTRAP. It’s not unexpected for it to stop after receiving SIGFPE (e.g. on a divide-by-zero error). In Lisp, we can (in general) pass any type of object to any function, and additionally we have “keyword”, “optional” and “rest” arguments 2 , and multiple return values. It’s somewhere between difficult and impossible to shoehorn all of this into a C-compatible calling convention, and SBCL doesn’t even try. Calls to C functions therefore involve a glue layer which translates from one argument passing convention to the other, so are probably best avoided in time-critical inner loops. However, dropping back to the shell prompt is usually undesirable in an interactive system: postmortem debugging is generally far less productive than in-process debugging where all the state (including open files, network connections, etc) is available, and besides we’d rather not kill off any of the other threads that are running unrelated jobs. So we really want to catch these signals and run error handlers that can fix the problem and continue. Signal handling and debugging (Indeed, sometimes we intentionally ask for these signals in normal operation. For example, the garbage collector tracks writes to old pages by mapping them read-only and catching SIGSEGV when they’re written to) 3.2 Preprocessing Even if we did use C calling convention, we’d still have a problem where parts of a C language API are specified as preprocessor macros. To inline C code into our Lisp functions would involve doing things to gcc that rms would really not be quite happy with. So, we have to write wrapper functions for these macros, usually in C, and suffer the additional function call overhead. To diagnose and recover effectively we need more than just the signal number. We declare all our signal handlers using sigaction() with the SA SIGACTION flag. This causes our handler to be called with three arguments: the second argument is a siginfo structure, and the third is usually a ucontext structure 3 containing a wealth of information including the program counter at the time of signalling, the register contents, and other per-signal information - for example, in the case of a SIGSEGV, the faulting address. Much of this information is used internally in SBCL or passed onto the debugger. In the particular context of adding thread support, this presents a particular problem for thread-local data access. As we’ll see later, Lisp needs a lot of thread-local data, and having to do a call into C on every access is a good deal slower than we’d really like it to be. Libraries that install signal handlers are, therefore, bad news. Even if their handlers still work in our 2 An optional argument is declared with a name in the parameter list, which is bound to a default value if unsupplied. A rest argument is a name to which a list of all the unnamed arguments is bound. 3 This holds on Solaris, Tru64, FreeBSD, and all Linux architectures we have ports for, other than the SPARC. I don’t know why it has to be different 4 should be noted that hitting a breakpoint causes us to receive SIGTRAP in our own process - this works pretty well in practice, but tends to confuse gdb if you’re also running under gdb. Ideally it would be able to tell when a SIGTRAP was due to a breakpoint it had inserted and when it should instead be delivered to the process it’s attached to. runtime, and even if they then call the previously installed handler instead of bailing out, they rarely do it properly with correct siginfo and ucontext - and almost certainly don’t think to check whether we were using sigaltstack() to get an alternate signal stack. So, at best our handlers will not know why they’re running, and more likely we’ll get a core dump with fairly meaningless backtrace, or an infinite loop of segmentation faults. Libraries that call exit() are bad news in any circumstances. SBCL users have seen this problem in practice with SDL, for example. It installs a SIGSEGV handler, presumably to free up resources and return the 4 Threading console to text mode if necessary. This is bad, because we use SIGSEGV for our page write protec- In this section we look at the specific issues we ention. When SDL quits it restores the old handler, but countered adding thread support to SBCL. it doesn’t restore it with the SA SIGINFO flag, so it still loses4 4.1 3.3.1 Introspection Thread creation, destruction, accounting In Linux 2.4, the POSIX threads interface is implemented by LinuxThreads. Writing to this interface would make us compatible with all platforms that offer POSIX threads, if it were possible. There’s a problem, though: in kernel versions up to and including 2.4 (it gets a lot better in 2.6) there’s a pretty poor match between the kernel’s thread model and the requirements of the POSIX thread standard. This requires LinuxThreads to do some pretty offensive things. SBCL includes an interactive compiler - in fact, there is no interpreter; everything that is evaluated is compiled first. When the user asks to evaluate a function, we compile it directly into memory, and then jump to it. This presents cache consistency issues somewhat like those for self-modifying code, though happily far more tractable as we are very unlikely to be executing in the same page as we’re modifying. Although I am aware of no standard function to ‘flush the icache’, it’s usually possible somehow - or, on The most important problem is signal handling. x86, happens automatically. POSIX threading has some fairly strict requirements There are other circumstances in which we write about which thread(s) a signal is delivered to, which to code pages, that need similar treatment: for exam- in LinuxThreads are implemented by having one ple, when setting breakpoints, or during GC when thread catch a signal and resignal it to another. As moving functions from one place to another. It we’ve seen already, though, resignalling is a no-no when we’re using three-argument signal handlers. 4 This was determined about a year ago by an SBCL user, and I believe was reported as a bug to SDL maintainers. It may well have been changed since We decided instead to use the kernel threads directly. After all, POSIX specifies interfaces for C, 5 not for high-level languages. So, we call clone() it- root set. self, or at least the small C library wrapper for it that There is a wrinkle. Sometimes a thread may set its allocates a stack for the child. ’pseudo-atomic’ flag: this says that it’s in the middle We’re not directly using NPTL. This is partly be- of some operation that shouldn’t be interrupted by cause we can’t, as so much of it is defined as gcc or signals or garbage collection - for example, it has binutils extensions and we’re not using gcc or binu- allocated but not yet initialised memory for a new tils, and it’s partly because POSIX semantics are not object. After we stop the world for GC, we have to a requirement for us. We should emphasise, though, check the flag for each thread and resume any that that the work done on the kernel for NPTL is impor- were in pseudo-atomic, letting them run until they tant to us: by using kernel threads we benefit directly reach a safe suspension point. from the speed and scalability optimisation they do, and when 2.6 is available with new kernel facilities 4.2 Special variables, thread-local storage like futexes, we’ll use them too. Common Lisp has a feature called ”special variables”. A special is declared using defvar or def4.1.1 Garbage Collection parameter, and behaves almost but not quite like a Garbage collection works by periodically pausing global variable. If we write the program, then, starting with the ’root set’ - the variables that we know are in use, such as the CPU (defvar *foo* 1) registers and the local variables on the stack - following all the references from those variables to other then *foo* is set to 1 in all parts of the program objects, and copying everything we find into some – except that we can temporarily rebind it to some new area. When we run out of objects, anything left other value uncopied is evidently no longer needed and can be deallocated.5 (let ((*foo* 2)) This is going to be simpler if we know that noth(some computation)) ing is changing the data behind our backs, so we have to stop the other threads while it happens. To make and during the body of the let form it will temlife slightly simpler, we dedicate the parent thread porarily be set to 2. Even if the body calls some other of SBCL exclusively to cleaning up after its chil- function, *foo* will still be 2 in that other function. dren: usually it sits in a loop calling waitpid() When the let form finishes executing it reverts to and reaping children. When signalled by a child to the value 1. Computer science graduates will recogperform GC, it uses ptrace() to attach and sus- nise this as dynamic scoping; Perl programmers will pend to each of the children, and when they’re all recognise it as a ’local’ variable (as opposed to the safely stopped (we’re notified by waitpid() for more normal ’my’ variables). each child) we can do GC. The PTRACE PEEKUSR But what do we do if two threads refer to *foo* call is used to get the child’s registers to add to the at once, and one or both of them has rebound it? If 5 This is an oversimplification it only has a single value at any one time this could 6 get really messy: are rebindings in A visible in B? We’d rather they weren’t: a change to a bound special variable should affect only the thread it happens in. we want to access, the %fs register is now a pointer into an entry in a descriptor table, which stores various attributes of the memory area: size, location, permissions etc. This can be a Local Descriptor Table or the Global Descriptor table. Manipulating the values in a descriptor table is a root-only operation, but the kernel people in their infinite foresight have created a modify ldt() system call (mostly for the use of Wine, which needs to fake %fs and %gs convincingly for Windows apps to run) that we can use. So we need some form of thread-local storage, and we need it to be at least reasonably fast. All of memory is the same in both threads, and the only thing that differs between threads is the register contents. We have a few options: We could dedicate a register to the thread storage base. On an x86 we don’t have a lot of registers available, though. Tying a register up permanently for this purpose would make code run more slowly, and implementing this changes would also need quite a lot of work on the compiler (which already has uses for most of them wired in). There is a downside to using the LDT: chiefly that it limits us to 8192 entries and hence 8192 threads. As part of the 2.6/NPTL work, Ingo Molnar introduced a user-specified slot in the GDT (again, primarily for Wine to use) which is reloaded perprocess in the kernel context switch. When 2.6 is available more widely, we’ll probably switch to using this, although it should be noted that we currently have fixed 2Mb stacks, so until we fix that to allow variable stack sizes we can’t get nearly that many threads anyway. Each thread has its own stack: if we can identify the current thread from the stack pointer, we can use that as an index into a table of thread-local values. But we only have registers (%esp, %ebp) for the bottom of the stack - which keeps moving up and down - and for the current frame. Either we’d have to push the thread base address onto the stack on every function call, or we could make sure that each stack is aligned on, say, a 2Mb boundary, and then find the top of the stack just by masking the stack pointer appropriately. But 2Mb is a lot of stack if we want to run lots and lots of threads. The eventual upshot is that we can set up a block of memory for each thread, assign an offset for every variable that gets dynamically bound, then write e.g. %fs:20 to access the thread-local value for the variable at offset 20. Because %fs points to a different place in each thread, the value that comes back will be different in each thread. Or we could use a segment register. Linux uses the 32 bit ’flat’ mode of the 386, so segment registers don’t see as much use as they did back in the 4.3 Locks ”good old” MSDOS days - usually they’re all set to 0, in fact - but they’re still available. We use the %fs Most of the complexity of thread programming is register, having been warned that %gs would be used in correctly arbitrating access to shared resources or by NPTL. portions of code that aren’t safe to be run by multiple Setting up segment registers in flat mode is a lit- threads at once. In fact the latter is the same thing as tle more complicated than it used to be in real mode. the former, given that the reason the code isn’t safe Rather than being just an offset to the memory area to run more than once is that it accesses shared re7 sources. So, any threaded environment must offer plement in a native threads system: all we need do some kind of locking constructs. is call alarm(), and install a SIGALRM handler that Spinlocks are built on top of lock cmpxchg. signals a Lisp condition (similar to an exception in They’re not directly available to end-users - though languages like Java). Then we can handle the timenote that this is Lisp so it’s intentionally not impossi- out at whatever point is appropriate using standard ble to call internal functions if you know what you’re Lisp facilities. doing. User-available locks have queues attached to them: a waiting thread puts itself on a queue, then sleeps by calling sigtimedwait() and can be woken later by the lock holder when it relinquishes the lock. When 2.6 is available we’ll provide futexbased locks which should be a little more efficient than this, and allow fewer possibilities for missed signals. 5 Future work There is more work to be done on the threading implementation to make it really robust. This includes resizing the TLS area when we run out of room in it, adding parameters to control the stack sizes, and so We provide ordinary mutexes (one thread may forth. hold a mutex, others have to wait). These are not Larger projects for the thread support include the recursive: if you own a lock and you try to get it ability to benefit from many of the features introagain, you will hang potentially forever, because it’s duced in 2.6 by and for NPTL, such as futexes and not free - if you’re attempting to get a lock you al- the per-process GDT slots, and ports to non-x86 arready have, that usually indicates a problem. Re- chitectures (which can use an ordinary register for cursive locks are also available for situations where TLS instead of needing a segment selector) and nonthey’re necessary. If you attempt to get a recursive Linux OSes. Andreas Fuchs has already reported lock and you have it already, you just carry on hold- some initial success on a FreeBSD port using their ing it. rfork thread call. We also have condition variables, similar to those Other SBCL projects in current development inin the Posix thread model and in Java. These pro- clude Unicode support, MacOS X and Windows vide a race-free means for a thread that holds a lock ports, and changes to take advantage of 64 bit adto release it and go to sleep on a queue until woken dress spaces. by some other thread. When it’s woken, it’s guaranteed to reacquire the lock before being allowed to do anything else. 6 Existing tools for FFI Timeouts are important: we usually don’t want to lock indefinitely, and there are other situations where we want to abort an operation if it takes too long as well. For example in a web server, there’s usually little point in serving a response if it’s taken more than about a minute to compute, because the user will have gone elsewhere. Timeouts are trivial to im- Creating Lisp language bindings for a library is to some extent very easy. Given a function name and a shared library, we just use dlopen() and dlsym(), and call it. The tedious part is in knowing what to call it with. Some tools exist that help with digging constants and 8 structure layouts out of header files, but it’s not a completely automatic process. came commonplace for library authors to annotate their header files appropriately, this could be a very powerful tool. • SB-GROVEL is part of SBCL. It takes a configuration file listing the constants and structure/union names/fields of interest, then writes, compiles and runs a short C program that uses sizeof and offsetof to determine the appropriate numbers, and writes them out as a Lisp source file. It has rough edges still, and obviously doesn’t help with API calls or variables that are only implemented as preprocessor macros (e.g. errno) 7 Recommendations Today’s Unix-like systems are definitely easier to do interesting things with than those of a few years ago. It’s often been possible for a long time to get the fault address from a SIGSEGV, but looking at old CMUCL code it’s apparent that the exact method for doing this used to vary widely between different • cparse 6 does a moderately good but not perfect Unices, and often involved looking in kernel source job of parsing C headers. Unfortunately, any- to find the signal handler stack frame layout. These 9 thing less than gcc-style perfection is probably days - except, as we noted, on Linux/SPARC - it’s a pretty safe bet that SA SIGINFO and a ucontext insufficient to parse GNU Libc headers. will do the trick. • FFIGEN 7 is a patched gcc that can be used to However, the kernel/C library alone is no longer build a dbm file of library calls in a format that the platform for useful programs - today programmakes it easy to create language bindings to mers depend on a host of layered libraries, whether them. OpenMCL (another free Lisp compiler, toolkits such as GNOME or KDE, or smaller spefor PPC systems only) uses this and it’s claimed cial purpose libraries like cdparanoia. In a spirit of to be pretty slick. As it’s based on gcc it’s not continuous improvement, then, I’d like to close by trivial to build, though, and someone will need offering the following suggestions to library authors: to keep updating it against new versions of gcc as they’re released. Presently only works with • When the usual interface to something involves OpenMCL. use of the preprocessor (e.g. errno or the • SWIG, the Simplified Wrapper and Interface stat family of calls), provide and document Generator, takes annotated C/C++ header files an alternative functional interface that can be as its input specification, and can generate bindcalled with dlsym(). ings for several high-level languages. It’s reported that it now has s-expressions as an out• Avoid mandating “convenience” features such put format, and experimental support for transas installing signal handlers for cleanup or callforming these to UFFI 8 declarations. If it being exit() in error situations. You can’t antici6 pate what the user will or won’t find convenient, http://bricoworks.com/ moore/cparse/ 7 so allow some way to disable these things. http://openmcl.clozure.com/Doc/interface-translation.html 8 UFFI is a portable CL package for foreign language interfaces. 9 9 and MacOS 10.1, but happily they’ve fixed it in 10.2 • If you can avoid monopolising the event loop, this is a good thing. Everyone wants to monopolise the event loop, and only one of the contenders can win. • For extra points, provide some machinereadable file describing your exported functions and data structures. This could be as simple as running your header files through SWIG to make sure it can parse them and output appropriate interfaces. 10