Extending the Haskell Foreign Function Interface with
Concurrency
Simon Marlow and Simon Peyton Jones

Wolfgang Thaller
wolfgang.thaller@gmx.net

Microsoft Research Ltd., Cambridge, U.K.

simonmar,simonpj  @microsoft.com

Abstract

threads.

1 Introduction


We express the design as a concrete set of proposed extensions
or clarifications to the existing designs for the Haskell FFI and
Concurrent Haskell (Section 4).


We give a precise specification of the extension in terms of an
operational semantics (Section 5).

Two of the most longest-standing and widely-used extensions to
Haskell 98 are Concurrent Haskell [10] and the Haskell Foreign
Function Interface [7]. These two features were specified independently, but their combination is quite tricky, especially when the
FFI is used to interact with multi-threaded foreign programs.

We sketch three possible implementations, one of which has
been implemented in a production compiler, the Glasgow
Haskell Compiler[1] (Section 6).

The core question is this: what is the relationship between the native threads supported by the operating system (the OS threads), and
the lightweight threads offered by Concurrent Haskell (the Haskell
threads)? From the programmer’s point of view, the simplest solution is to require that the two are in one-to-one correspondence.
However, part of the design philosophy of Concurrent Haskell is to
support extremely numerous, lightweight threads, which (in today’s
technology at any rate) is incompatible with a one-to-one mapping
to OS threads. Instead, in the absence of FFI issues, the natural implementation for Concurrent Haskell is to multiplex all the Haskell
threads onto a single OS thread.

2 Background
Concurrency and the foreign-function interface are two of the most
long-standing and widely used extensions to Haskell 98. In this
section we briefly introduce both features, establish our context and
terminology.

2.1 The Foreign Function Interface
The Foreign Function Interface extends Haskell 98 with the ability
to call, and be called by, external programs written in some other
language (usually C). A foreign import declaration declares a
foreign function that may be called from Haskell, and gives its
Haskell type. For example:

In this paper we show how to combine the clean semantics of the
one-to-one programming model with the performance of the multiplexed implementation. Specifically we make the following contributions:


First we tease out a number of non-obvious ways in which this
basic multiplexed model conflicts with the simple one-to-one
programming model, and sketch how the multiplexed model
can be elaborated to cope (Section 3).

foreign import ccall safe "malloc"
malloc :: CSize -> IO (Ptr ())
declares that the external function malloc may be invoked from
Haskell using the ccall calling convention. It takes one argument
of type CSize (a type which is equivalent to C’s size_t type), and
returns a value of type Ptr () (a pointer type parameterised with
(), which normally indicates a pointer to an untyped value).

We propose a modest extension to the language, namely
bound threads. The key idea is to distinguish the Haskell
threads which must be in one-to-one correspondence with an
OS thread for the purpose of making foreign calls, from those
that don’t need to be. This gives the programmer the benefits of one-to-one correspondence when necessary, while still
allowing the implementation to provide efficient lightweight

Similarly, a foreign export declaration identifies a particular
Haskell function as externally-callable, and causes the generation
of some impedance matching code to provide the Haskell function
with the foreign calling convention. For example:
foreign export ccall "plus"
addInt :: CInt -> CInt -> CInt
declares the Haskell function addInt (which should be in scope at
this point) as an externally-callable function plus with the ccall
calling convention.
A foreign import declaration may be annotated with the modifiers “safe” or “unsafe”, where safe is the default and may be

Submitted to The Haskell Workshop, 2004

1

omitted. A foreign function declared as safe may indirectly invoke
Haskell functions, whereas an unsafe one may not (that is, doing so results in undefined behaviour). The distinction is motivated
purely by performance considerations: an unsafe call is likely to be
faster than a safe call, because it does not need to save the state of
the Haskell system in such a way that re-entry, and hence garbage
collection, can be performed safely—for example, saving all temporary values on the stack such that the garbage collector can find
them. It is intended that, in a compiled implementation, an unsafe
foreign call can be implemented as a simple inline function call.

There are three approaches to mapping a system of lightweight
threads, such as Haskell threads, onto the underlying OS threads:

We use Haskell-centric terminology: the act of a Haskell program
calling a foreign function (via foreign import) is termed a foreign
out-call or sometimes just foreign call; and the act of a foreign program calling a Haskell function (via foreign export) is foreign
in-call.

Multiplexed. All Haskell threads are multiplexed, by the Haskell
runtime system, onto a single OS thread, the Haskell execution thread. Context switching between Haskell threads occurs only at yield points, which the compiler must inject. This
approach allows extremely lightweight threads with small
stacks. Interaction between Haskell threads requires no OS
interaction because it all takes place within a single OS thread.

One-to-one. Each Haskell thread is executed by a dedicated OS
thread. This system is simple but expensive: Haskell threads
are supposed to be lightweight, with hundreds or thousands of
threads being entirely reasonable, but most operating systems
struggle to support so many OS threads. Furthermore, any
interaction between Haskell threads (using MVars) must use
expensive OS-thread facilities for synchronisation.

2.2 Concurrent Haskell

Hybrid. Models combining the benefits of the previous two are
possible. For example, one might have a pool of OS “worker
threads”, onto which the Haskell threads are multiplexed in
some way, and we will explore some such models in what
follows.

Concurrent Haskell is an extension to Haskell that offers lightweight concurrent threads that can perform input/output[10]. It
aims to increase expressiveness, rather than performance; A Concurrent Haskell program is typically executed on a uni-processor,
and may have dozens or hundreds of threads, most of which are
blocked. Non-determinism is part of the specification; for example, two threads may draw to the same window and the results are,
by design, dependent on the order in which they execute. Concurrent Haskell provides a mechanism, called MVars, through which
threads may synchronise and cooperate safely, but we will not need
to discuss MVars in this paper.

All of these are reasonable implementations of Concurrent Haskell,
each with different tradeoffs between performance and implementation complexity. We do not want to inadvertently rule any of these
out by overspecifying the FFI extensions for concurrency.
Note that our focus is on expressiveness, and not on increasing performance through parallelism. In fact, all existing implementations
of Concurrent Haskell serialise the Haskell threads, even on a multiprocessor. Whether a Haskell implementation can efficiently take
advantage of a multiprocessor is an open research question, which
we discuss further in Section 7.

For the purposes of this paper, the only Concurrent Haskell facilities
that we need to consider are the following:
data ThreadId -- abstract, instance of Eq, Ord
myThreadId :: IO ThreadId
forkIO :: IO () -> IO ThreadId

3 The problem we are trying to solve

The abstract type ThreadId type represents the identity of a Haskell
thread. The function myThreadId provides a way to obtain the
ThreadId of the current thread. New threads are created using
forkIO, which takes an IO computation to perform in the new
thread, and returns the ThreadId of the new thread. The newlycreated thread runs concurrently with the other Haskell threads in
the system.

Concurrent Haskell and the Haskell FFI were developed independently, but they interact in subtle and sometimes unexpected ways.
That interaction is the problem we are trying to solve.
Our design principle is this: the system should behave as if it was
implemented with one OS thread implementing each Haskell thread.
This behaviour is simple to explain, and avoids having to expose
two “layers” of threads to the programmer.

Details on the rest of the operations provided by Concurrent
Haskell can be found in [10] and the documentation for the
Control.Concurrent library distributed with GHC[1].

However, if implemented naively, the one-to-one model is expensive to implement. The multiplexed model is much cheaper and,
where no foreign calls (out or in) are involved, the one-to-one model
and the multiplexed model cannot be distinguished by the Haskell
program. When foreign interaction enters the picture, matters become more complicated. In the rest of this section we identify several implicit consequences of our design principle, and discuss how
the multiplexed implementation technique can accommodate them.

2.3 Haskell threads and OS threads
Every operating system natively supports some notion of “threads”,
so it is natural to ask how these threads map onto Concurrent
Haskell’s notion of a “thread”. Furthermore, this mapping is intimately tied up with the FFI/concurrency interaction that we explore
in this paper.

3.1 Foreign calls that block

To avoid confusion in the following discussion, we define the following terminology:


Some foreign calls, such as the C function read(), may block
awaiting some event, or may simply take a long time to complete.
In the absence of concurrency, the Haskell program making the
call must also block or take a long time, but not so for Concurrent
Haskell. Indeed, our design principle requires the opposite:

A Haskell thread is a thread in a Concurrent Haskell program.
An operating system thread (or OS thread) is a thread managed by the operating system.
2

Requirement 1: a safe foreign call that blocks should block only
the Haskell thread making the call. Other Haskell threads
should proceed unimpeded.

because all foreign calls are executed by a single OS thread, but
only at the cost of violating Requirement 1. Alas, satisfying Requirement 1 using the variant described in Section 3.1, seems to be
incompatible with Requirement 2, because this variant deliberately
use a pool of interchangeable OS threads. The hybrid model suffers
from the same problem.

Notice that we only require that a safe foreign call be non-blocking
to the other Haskell threads. It would be difficult to make a highperformance unsafe foreign call non-blocking, because that would
force the implementation to perform the same state-saving as for a
safe call, since the Haskell system must continue running during
the call.

We are forced, therefore, to propose a small extension to Concurrent
Haskell, in which we divide the Haskell threads into two groups:


Requirement 1 seems obvious enough, but the Haskell FFI specification is silent on this point, and indeed until recently GHC did not
satisfy the requirement. This caused confusion to Haskell programmers, who were surprised when a foreign call blocked their entire
Concurrent Haskell program.

A bound thread has a fixed associated OS thread for making
FFI calls.
An unbound thread has no associated OS thread: FFI calls
from this thread may be made in any OS thread.

The idea is that each bound Haskell thread has a dedicated associated OS thread. It is guaranteed that any FFI calls made by a bound
Haskell thread are made by its associated OS thread, although pureHaskell execution can, of course, be carried out by any OS thread.
A group of foreign calls can thus be guaranteed to be carried out
by the same OS thread if they are all performed in a single bound
Haskell thread.

Requirement 1 might seem to completely rule out the multiplexed
implementation, because if the Haskell execution thread blocks,
then execution of Haskell threads will halt. However a variants of
the multiplexed model solves the problem:


At a foreign call, arrange that the foreign function is executed
by some other OS thread (freshly spawned, or drawn from a
pool), while execution of other Haskell threads is continued
by the single Haskell execution thread. This approach pays
the cost of a OS thread switch at every (safe) foreign call.

We do not specify that all Haskell threads are bound, because doing
so would specify that Haskell threads and OS threads are in one-toone correspondence, which leaves the one-to-one implementation
model as the only contender.


A hybrid model can also be designed to satisfy this requirement:

Can several Haskell threads be bound to the same OS thread? No:
this would prevent the one-to-one implementation model and cause
difficulties for the others. For each OS thread, there is at most a
single bound Haskell thread.

Have a pool of OS threads, each of which can play the role
of the Haskell execution thread, but only one at a time does.
At a safe foreign call, the Haskell execution thread leaves the
Haskell world to execute the foreign call, allowing one (and
only one) member of the pool to become the new Haskell execution thread. No OS thread switch is required on a call, but
on the return some inter-OS-thread communication is required
to obtain permission to become the Haskell execution thread
again.

3.3 Multi-threaded clients
Suppose a C program wants is using a library written in Haskell,
and it invokes a Haskell function (via foreign export). This
Haskell function forks a Haskell thread, and then returns to the
C program. Should the spawned Haskell thread continue to run?
According to our design principle, it certainly should – as far as
the programmer is concerned, there is not much difference between
forking a Haskell thread and forking an OS thread.

3.2 Fixing the OS thread for a foreign call
Some C libraries that one might wish to call from Haskell have an
awkward property: it matters which calls to the library are made
from which OS thread. For example, many OpenGL functions have
an implicit “rendering context” parameter, which the library stores
in OS-thread-local state. The (perfectly reasonable) idea is that
OpenGL can be used from multiple threads, for example to render
into independent windows simultaneously.

Requirement 3a: Haskell threads spawned by an foreign in-call
continue to run after the in-call returns.
A closely related issue is this. Suppose the C program using the
Haskell library itself makes use of multiple OS threads. Then our
design principle implies that if one invocation runs Haskell code
that blocks (on an MVar, say, or in another foreign call) that should
not impede the progress of the other call:

This in turn means that to use the OpenGL library from Concurrent
Haskell, the FFI must satisfy:
Requirement 2: it must be possible for a programmer to specify
that a related group of foreign calls are all made by the same
OS thread.

Requirement 3b: multiple OS threads may concurrently invoke
multiple Haskell functions (via foreign export), and these
invocations should run concurrently.

Notice that there is no constraint on which OS thread executes any
particular Haskell thread – we need only control which OS thread
executes the foreign calls.

To support this behaviour in the multiplexed model is not difficult,
but requires some specific mechanism. In both cases, the current
Haskell execution OS thread must pay attention to the possibility
of another OS thread wanting to make an in-call, lest the latter wait
indefinitely while the former chunters away. In fact, the same mechanism is necessary to respond to an OS thread returning to Haskell
from a safe foreign out-call.

Requirement 2 is automatically satisfied by the one-to-one execution model, provided we are willing to say that the “related” calls
are all carried out by a single Haskell thread. The multiplexed
model (basic version) also automatically satisfies Requirement 2,
3

3.4 Callbacks

An OS thread can be associated with at most one Haskell
thread.

A common idiom in GUI libraries is for the application to make
a call to the event loop in the library, which in turn makes calls
back into the application in the form of callbacks. Callbacks are
registered prior to invoking the event loop.

The new function isCurrentTheadBound provides a way for
the Haskell programmer to find out whether the current thread
is bound or not:
isCurrentThreadBound :: IO Bool
We define isCurrentThreadBound to always return True
when the current Haskell thread is a bound thread. It may also
return True when the current Haskell thread is indistinguishable from a bound thread by both Haskell code and foreign
code called by it.

Consider how this works for a Haskell application calling an external GUI library. The callbacks will be foreign-export-ed Haskell
functions, so the event loop (in C) will call the callback (in Haskell),
which may in turn make a foreign call to a GUI function (in C
again). It is essential that this latter call is made using the OS thread
as runs the event loop, since the two share thread-local state. Hence:
Requirement 4: it must be possible to ensure that a foreign outcall from Haskell is made by the same OS thread that made
the foreign in-call.

Therefore, an implementation using the one-to-one threading model (see Section 6.1) may return True for all threads,
even for Haskell threads created using forkIO, because every Haskell thread has its associated OS thread and can safely
access thread-local state.

With the notion of bound threads in hand, this is not hard to achieve.
We simply specify that a foreign in-call creates a bound thread, associated with the OS thread that performed the in-call. Any foreign
out-calls made by that (bound) Haskell thread will therefore be executed by the invoking OS thread.

Foreign import. When a Haskell thread invokes a foreign
import annotated with safe, other Haskell threads in the program will continue to run unimpeded. This is not necessarily
true if a Haskell thread invokes a foreign import annotated
with unsafe.

Indeed, this is the only primitive mechanism for creating a bound
thread:

Notice that unsafe calls are not required to block Haskell
threads if the foreign call blocks; instead the behaviour is
unspecified. In particular, it is legitimate for a simple, lowperformance implementation to implement unsafe calls as
safe calls.


An unbound Haskell thread is created using Concurrent
Haskell’s existing forkIO combinator.

Foreign export. Invoking a function declared with foreign
export creates a new Haskell thread which is bound to the
OS thread making the call.

A bound thread is created by a foreign invocation.
We provide a forkOS combinator, which allows a Haskell thread
(rather than a foreign invocation) to create a new bound thread, but
it works by making a foreign call with invokes a callback (see Section 4.2.1).

The main thread. In a complete, standalone Haskell program, the
system should run Main.main in a bound Haskell thread,
whose associated OS thread is the main OS thread of the program. It is as if the program contained the declaration
foreign export ccall "haskellMain"
Main.main :: IO ()
and the Haskell program was started from C by invoking
haskellMain().

3.5 Summary
This concludes our description of the problems we address, and of
the core of our design. There is no new syntax, and only an implicit
distinction between two types of Haskell threads, depending on the
way in which the thread was created. The next section describes the
language extension in detail, including the small number of combinators that we provide as library functions to allow programmers to
work with bound threads.

4.2 Derived combinators
Given the basic functionality outlined above, we can define some
useful combinators. These are provided to the programmer via the
Control.Concurrent library.

4 The Concurrent Haskell Foreign Function
Interface

4.2.1 forkOS
The forkOS function has the same type as forkIO:

Thus motivated, we now summarise our proposed changes to the
existing Concurrent Haskell design, and the Haskell FFI specification.

forkOS :: IO () -> IO ThreadId
Like forkIO, it also creates a new Haskell thread, but additionally
it creates a new OS thread and binds the new Haskell thread to it.
This is accomplished by simply making a foreign call to an external
function that (a) creates the new OS thread, and (b) in the new OS
thread, invokes the requested action via a callback, thus creating a
new bound Haskell thread.

4.1 Specific extensions
We propose the following specific additions:
Bound threads. There are two types of Haskell threads, bound and
unbound. A bound thread is permanently associated with a
particular OS thread, and it is guaranteed that all foreign functions invoked from that bound thread will be run in the associated OS thread. In all other ways, bound and unbound threads
behave identically.

We give the implementation of forkOS below for reference, although we have not introduced all of the concepts used in it. It
assume the existence of an external function createOSThread to
create the OS thread; its implementation is simple, but depends on
4

Further relevant semantics for IO code in Haskell can be found in
Peyton Jones’ “Tackling the Awkward Squad”[9] and the original
Concurrent Haskell paper[10].

the particular thread creation primitives used on the current operating system.
forkOS action = do
mv <- newEmptyMVar
entry <- wrapIO $ do
t <- myThreadId
putMVar mv t
action
createOSThread entry
tid <- takeMVar mv
freeHaskellFunPtr entry
return tid

The syntax of a native thread is given in Figure 1. A native thread
of form N S has thread identifier N, while S is an abstraction of
its call stack. If H is on top of the stack, the thread is willing to
execute a Haskell thread. If F si h is on top of the stack, the thread is
in the process of dealing with a call to a foreign function, which will
return its result to the Haskell thread h. The safety of the foreign
call is given by si, which is either u meaning unsafe, or s meaning
safe.

 
A native thread of the form N is a thread which originates in
foreign code; it does not have any Haskell calls anywhere on its
stack.

foreign import ccall "createOSThread"
createOSThread :: FunPtr (IO ()) -> IO ()
foreign import ccall "wrapper"
wrapIO :: IO () -> FunPtr (IO ())

 
A native thread of form N H has a stack that exists only to serve
Haskell threads, and so can safely block inside a foreign call without impeding other Haskell threads. We call these threads “worker
threads”.

4.2.2 runInBoundThread
The runInBoundThread combinator runs a computation in a bound
thread. If the current thread is bound, then that is used; otherwise
a new bound thread is created for the purpose. The combinator is
useful when the program is about to make a group of related foreign
calls that must all be made in the same OS thread.

The syntax of a Haskell thread is given in Figure 2. A Haskell
thread h of form a N has action a. The indicator N identifies the
native thread N to which the Haskell thread is bound.

 

An action a is a sequence of operations, finishing with a return of
some kind. An operation is either an unspecified IO operation (such
as performing some evaluation, or operating on an MVar), a call to
the primitive forkIO, or a call to a foreign function f .

The implementation is straightforward:
runInBoundThread :: IO a -> IO a
runInBoundThread action = do
bound <- isCurrentThreadBound
if bound
then action
else do
mv <- newEmptyMVar
forkOS (action >>= putMVar mv)
takeMVar mv

We do not model the data passed to, or returned from, a foreign call,
nor any details of what the IO operations are.
Note that forkOS is not mentioned alongside forkIO here. While
spawning a new unbound thread requires direct support by the runtime system, creating a new bound thread is done by making a foreign call to the operating system-provided thread creation primitive.
Therefore, forkOS need not be considered when we discuss the semantics.

Note that runInBoundThread does not return until the IO action
completes.

5.1 Evolution

5 Operational Semantics

The symbol N refers to a set of native threads, and H to a set of
Haskell threads. An executing program consists of a combination
of the two, written N ; H .

In order to make the design for our language extension precise, we
now give an operational semantics for Concurrent Haskell with the
FFI and bound threads. The operational semantics is highly abstract: it does not model any details of actual computation at all.
Instead, it models only the operations and interactions we are interested in:

We describe how the system evolves in a very standard way, using
transition rules, of form


The running system consists of a pool of native (OS) threads
and a pool of Haskell threads.


Haskell threads may fork new Haskell threads (forkIO), make
foreign calls, and perform unspecified IO operations.


Native threads have an identifier and a stack. A native thread
may be currently executing Haskell code or foreign code, depending on what is on top of the stack. Native threads executing foreign code may make a call to Haskell code, creating a
new Haskell thread.

N ;H



N  ;H 

The structural rules are these:

N

N ;H
t ;H





N  ;H 
N  t ;H 

N ;H
N ;H

h





N  ;H 
N  ;H 

h

These standard rules allow us to write the interesting transitions
with less clutter. The transition rules for the system are given in
Figure 3.
We informally describe each rule in the semantics below:

The semantics models the relationship between native threads
and Haskell threads, and the difference between bound and
unbound Haskell threads.

IO A Haskell thread may perform an arbitrary IO operation. Note
that we only require that there is one native thread ready to
5

Native thread

t

::



 
Stack

S

::



NS



ε
H :S
F si h : S


Empty
Executing Haskell
Executing a foreign call
Executing foreign code only



u
s

Unsafe
Safe


Call Safety

si

::

Figure 1: Syntax of Native Threads

h

::



 a bt

bt

::



ε
N

Not bound
Bound to native thread N

Haskell action

a

::



p >> a
RET

Sequence
Return from a call into Haskell

Operation

p

::



τ
forkIO a
F si f

IO operation
Fork a thread
Foreign call

Haskell thread
Bound thread id



Figure 2: Syntax of Haskell Threads


 

N H : S ; τ >> a


 bt 


   bt
N  H : S ;  a  bt   b  ε
N  F si  a  N : H : S ;
N  F si  a  ε : H  ;
N  S ; a bt
N  H :  ;  a >> RET  N
N  H : F s h : S ;  a >> RET  N
N  S ;

 IO

 

 bt 
N  H : S ;  F si f >> a  N 
N  H  ;  F si f >> a  ε 
N  F si abt : S  ; 
N  ; 
N  F s h : S ; 
N  H : S  ;  RET  N 
;  RET  ε 
 nothing 

 
 FORKIO
 FCALL1 
 FCALL2 
 FRET 
 HCALL1
 HCALL2
 HRET 
 HEND
 W KR

 nothing


N  ;

 W KREND
 EXT 

N H:S; a

N H : S ; forkIO b >> a

;

NH;
where N is fresh

 
NH;



 nothing 
 where N is fresh
 nothing
N  ; 
Figure 3: Operational Semantics

6

 NEND

execute Haskell code (with H on top of its stack). The native
thread may or may not be the same as the native thread bound
to this Haskell thread, if any.

We will also describe the issue of I/O multiplexing, i.e. how to
transparently speed up I/O beyond the levels attainable using safe
foreign calls.

FORKIO A Haskell thread invokes forkIO, giving rise to a new,
unbound, Haskell thread.

6.1 One OS Thread for one Haskell Thread

FCALL1 A bound Haskell thread makes a foreign call. The call
must be made in the native thread bound to this Haskell
thread.

A very simple and straightforward implementation would be to use
the operating system supplied thread library to create exactly one
OS thread for each thread in the system.

FCALL2 An unbound Haskell thread makes a foreign call. The
call is made in a worker thread.

In such an implementation, there is no distinction between bound
and unbound threads; every thread can be considered a bound
thread. This is entirely in accordance with the operational semantics
outlined in Section 5. Only three of the rules (FORKIO, FCALL2
and HEND) explicitly refer to unbound Haskell threads; it’s easy
to see that nothing in these rules prevents each Haskell thread from
having its dedicated OS thread.

FRET A foreign call returns; the Haskell thread which made the
call is reintroduced into the pool of Haskell threads.
HCALL1 A native thread running exclusively foreign code (no
Haskell frames on the stack) makes a call to a Haskell function. A new bound Haskell thread is created.
HCALL2 A native thread currently executing a safe foreign call
from Haskell invokes another Haskell function. A bound
Haskell thread is created for the new call.

In Section 4.1, we defined that isCurrentThreadBound may
return True whenever the calling Haskell thread is indistinguishable from a bound thread; we can therefore define
isCurrentThreadBound = return True, and we do not need to
keep track of how Haskell threads were created.

HRET A bound Haskell thread returns; the native thread which
made the call continues from the call site.
HEND An unbound Haskell thread returns.

Concurrent Haskell’s thread creation and synchronisation primitives are simply mapped to the corresponding operating system
functions. The forkIO primitive can be implemented the same way
as forkOS.

WKR This rule models the birth of new worker OS threads, in case
they should all be blocked in a foreign call.
WKREND A worker thread exits.
EXT A new native thread is created by foreign code.

The only challenge is managing access to the heap; it is very hard
to support truly simultaneous multithreaded Haskell execution (on
SMP systems), so it will be necessary to have a global mutual exclusion lock that prevents more than one thread from executing Haskell
code at the same time.

NEND A native thread, executing foreign code only, exits.
In a executing program, there may be multiple valid transitions according to the semantics at any given time. We do not specify
which, if any, of the valid transitions must be performed by the
implementation.

This global lock would have to be released periodically to allow
other threads to run; it would also have to be released for safe foreign calls.

Note that in particular this admits an implementation that does nothing at all; legislating against such behaviour in the semantics is entirely non-trivial, so we do not attempt it. Rather, we informally
state the behaviour we expect from an implementation:

Incidentally, this is exactly the strategy used by the native-code
O’Caml implementation (see Section 7.2).


If a Haskell thread is performing an unsafe foreign call, then
the implementation is allowed to refrain from making any further transitions until the call returns.


If a Haskell thread is performing a safe call, then the implementation should continue to make valid transitions in respect
of other Haskell threads in the system.

6.2 All Haskell Threads in one OS Thread
The second approach is to extend the fully multiplexed scheme to
include bound threads. This is a natural extension for an existing
single-threaded Haskell implementation where performance of foreign calls is not critical.

The implementation should not otherwise starve any Haskell
threads.

A single OS thread (the Haskell execution thread) is allocated for
the Haskell system, and is used exclusively to execute Haskell code.
All Haskell threads are multiplexed using this OS thread.

Transitions which are not part of this semantics are erroneous. An
implementation may either detect an report the error, or it may behave in some other implementation-defined manner. An application
which depends on implementation-defined behaviour is, of course,
non-portable.


Additionally, the Haskell execution thread must keep track of:
Any OS threads which have made in-calls. Each of these has
given rise to a bound Haskell thread.


A pool of OS threads that can be used to make “safe” foreign
calls.

6 Implementation
In this section, we will outline three different implementations of
the language extensions described in this paper. The third of these
has been implemented in the Glasgow Haskell Compiler; it is the
most complex of the three, but it provides the best performance.

When a Haskell thread makes an out-call, there are two cases to
consider:


The Haskell thread is bound. The Haskell execution thread
7


must pass a message to the appropriate OS thread in order to
make the call, and the OS thread must return the result via
another message back to the Haskell execution thread.

Haskell thread terminates by returning, the OS thread will release the Capability and the scheduler loop will exit, returning
to the foreign code that called it.

The Haskell thread is unbound. The situation is similar, except that the OS thread to make the call can be drawn from the
pool.

Threads that are just returning from a foreign call and threads that
are handling a call to Haskell from foreign code are given priority
over other threads; whenever it enters the scheduler, the thread that
holds the capability checks whether it should yield its capability to
a higher-priority thread (items 2 and 3 in the above list).

The complicated part of this implementation is the passing of messages between OS threads to make foreign calls and return results: essentially this is a remote procedure call mechanism. However, if the Haskell system is an interpreter, it may already have
support for making dynamic foreign calls in order to implement
foreign import.

After yielding the capability and after passing the capability to another thread (item 4 in the list), the thread will immediately try to
reacquire the capability; the thread will be blocked until another
thread passes a capability to it again (via item 4 above), or until the
Capability becomes free without being explicitly passed anywhere
(item 5).

A compiled implementation is unlikely to want use this scheme,
due to the extra overhead on foreign calls. For an interpreter, however, this implementation strategy may be less complicated than the
hybrid scheme discussed in the next section.

6.4 I/O Multiplexing

6.3 GHC’s Implementation

Traditional “multiplexing” run time systems that do not support
non-blocking foreign calls usually still provide support for nonblocking input and output.

GHC’s run-time system employs one OS thread for every bound
thread; additionally, there is a variable number of so-called
“worker” OS threads that are used to execute the unbound
(lightweight) threads.

The obvious way to do this on POSIX systems is to use the select
or poll system calls together with non-blocking I/O. When a read
or write request fails to return the requested amount of data,
the Haskell thread in question will be suspended. The scheduler
loop will periodically use select or poll to check whether any
suspended Haskell threads need to be woken up; if there are no
runnable Haskell threads, the entire run-time system will block in
the select or poll system call.

Only one of these threads can execute Haskell code at any one time;
the global lock that ensures this is referred to as “the Capability”.
GHC’s main scheduler loop is invoked in all threads; all but one of
the scheduler loops are waiting for the Capability at any one time.
Having more than one Capability available would indicate that
truly simultaneous multithreaded Haskell execution is available;
our current implementation does not however support this, because
it would require synchronised access to the heap and other shared
state. Whether the implementation can be practically extended in
this direction is an open question.

The Concurrent FFI makes this machinery unnecessary; a “safe”
foreign call to read or write will have the desired effect for a
multi-threaded Haskell program. However, using select or poll it
is possible to achieve much better performance than using safe foreign calls, because it does not require an extra OS thread for each
potentially-blocking I/O operation.

6.3.1 Passing around the Capability

At first, we tried extending GHC’s existing (single-OS-thread) implementation of I/O multiplexing to work with the hybrid threading
model described above. In this scheme, an OS thread that blocks inside select still held the Capability to prevent multiple OS threads
from using select simultaneously. When foreign code called in
to or returned to Haskell while the RTS was waiting for I/O, it was
necessary to interrupt the select by sending a dummy byte across
a pipe, which slowed down foreign calls (both incoming and outgoing) a lot.

A thread will relinquish its Capability (i.e. execution of Haskell
code will continue in a different OS thread) under the following
conditions:
1. A safe (i.e. non-blocking) foreign call is made (FCALL1/2).
For an unsafe call, we just hold on to the Capability, thereby
preventing any other threads from running.
2. Another OS thread is waiting to regain the Capability after
returning from a foreign call.

Fortunately, it turned out that a more efficient solution can be implemented entirely in Haskell, with no special support from the run
time system beyond the extensions described in this paper.

3. Another OS thread is waiting for the Capability because that
thread is handling a foreign call-in.
4. The scheduler loop determines that the next Haskell thread to
run may not be run in the OS thread that holds the Capability.

The Haskell I/O library spawns an unbound Haskell thread, called
the “I/O Service Thread”, which uses a foreign call to select or a
similar system call to watch a set of file descriptors. One of these
file descriptors is the read end of a dedicated “wakeup pipe” which
will be used to notify the service thread when the set of file descriptors to be watched has changed.

When a scheduler loop encounters a Haskell thread that is
bound to a different OS thread, it has to pass the Capability
to that OS thread. When a scheduler in a bound OS thread
encounters an unbound thread, it has to pass the Capability to
a worker OS thread.

When an unbound Haskell thread needs to block in order to wait for
some I/O, it will do the following:

5. The Haskell thread bound to the current OS thread terminates
(HRET).

1. Store the file descriptor in question in a global mutable vari-

If the current OS thread has a bound Haskell thread and this
8

7.1 Java

able (an MVar).
2. Wake up the service thread by writing a byte to the wakeup
pipe.

Java[2] began with a lightweight threading implementation, with all
Java threads managed by a single OS thread (Java calls this “green
threads”). Later implementations of Java moved to a native threading model, where each Java thread is mapped to its own OS thread.
The reasons for the switch seem to be primarily

3. Wait for the service thread to notify us via an MVar.


The I/O service thread will repeatedly do the following:


1. Grab the set of file descriptors to be watched from the global
mutable variable.

Non-scalability of green threads to SMP machines
Inability to call functions in external libraries which may
block, without blocking the entire system

2. Do a safe foreign call to select or a similar system call in
order to block until the status of one of the file descriptors or
of the wakeup pipe changes.

And perhaps the motivation was partly due to the JNI, which works
smoothly because native threads are one-to-one with Java threads.

3. If a byte has arrived on the wakeup pipe, read it from there in
order to reset the pipe’s state to non-readable.

In contrast to Java, scaling to SMP machines is not a goal for us.
There is no efficient SMP-capable Concurrent Haskell implementation, because doing so is still an open research question; the main
sticking point is how to synchronise access to the main shared resource (the heap) without killing performance of the system. Furthermore, scaling to multiprocessors can often be achieved in the
same ways as scaling to a cluster, by using multiple processes with
explicit communication.

4. Notify all Haskell threads waiting for file descriptors that have
become readable or writable via their MVars.
5. Repeat.
When a bound thread needs to wait for a file descriptor to become
readable, it should just safe-call select for that file descriptor, because that will be more efficient than waking the I/O service thread.
This scheme manages to keep the number of separate OS threads
used when n unbound threads are doing I/O at the same time down
to just two as opposed to n when safe foreign calls are used. GHC’s
previous scheme (released in GHC 6.2) needed just one OS thread
in the same situation, but at the cost of one call to select every time
through the scheduler loop, a write() to a pipe for every (safe) foreign call, and a lot of additional complexity in the run time system.

7.2 O’Caml
O’Caml[4] supports a choice between user-level and native threads
for its bytecode interpreter, but compiled code must use native
threads. An O’Caml programmer using native threads may currently assume that each O’Caml thread is mapped to a single OS
thread for the purposes of calling external libraries.

The new scheme requires no help from the run time system, removes the periodic call to select and supports more efficient foreign calls, at the cost of some inter-OS-thread messaging for every
read or write that actually needs to block. According to our measurements, this overhead can be neglected.

Native threads were chosen over user-level threads for compiled
code for the following reasons:


Note also that this scheme is not tied to GHC’s hybrid threading
model; While there would be no performance gain for a one-to-one
implementation, it also makes sense to use this I/O multiplexing
scheme on top of the all-in-one-OS-thread implementation outlined
in Section 6.2.


The difficulty of implementing multithreaded I/O in a userlevel scheduler across multiple platforms is high. Using native
threads allows this issue to be handed off to the operating system, which significantly reduces implementation complexity
and improves portability.
Compiled O’Caml threads use the machine stack. With
user-level threads, the scheduler must therefore be able to
manage multiple machine stacks, which is heavily platformdependent.

7 Related Work
We believe there is nothing in the literature that bears directly on
the particular issues addressed in this paper. However, there is a
great deal of folklore and existing practice in the form of language
implementations, which we review here.

In O’Caml, a non-blocking foreign call is made by defining a C
function which wraps the foreign call between the special calls
enter_blocking_section() and leave_blocking_section();
this may only be done when using the native implementation of threads. Similarly, calls to O’Caml functions from
C must be wrapped between leave_blocking_section() and
enter_blocking_section(). This is equivalent to, if slightly less
convenient than, Haskell’s safe foreign calls and callbacks.

To summarise the related work: there is a general trend amongst
languages with concurrency support to move from lightweight
threads to OS threads in one-to-one mapping with the language’s
own threads. The most commonly quoted reasons for the switch
are for accessing foreign library functions that might block, and
scaling to SMP machines.

O’Caml could straightforwardly be extended with the concept of
bound threads, which would leave the implementation free to use
user-level threads with a pool of native threads for foreign calls in
the same way as GHC. This would of course entail more implementation complexity, which may be worse than for GHC due to the use
of the machine stack by O’Caml native code as noted above (GHC
uses separate thread stacks managed by the runtime).

In relation to this paper, all of these languages could support the
bound/unbound thread concept, which would then give the implementation freedom to use cheaper lightweight threads for the unbound threads. To our knowledge, there are no other languages that
actually do support this idea.
9

7.3 C  and .NET

to use a one-to-one implementation for Concurrent Haskell, and the
bound/unbound concept would be redundant. However, there are
reasons to believe that this is unlikely to happen, at least in the near
future:

The .NET Common Language Runtime (CLR) uses a one-toone mapping between CLR threads and native Windows threads.
Hence, threads in C are fairly heavyweight.




The Native POSIX Threads Library[6] is 1:1, and claims better performance than a competing N:M implementation[3].
The improvement is largely attributed to the complexity of
implementing the N:M scheme.

To mitigate this, the .NET base class libraries include the
ThreadPool class, which manages a pool of worker threads and
a queue of tasks to be performed, including asynchronous I/O and
timers. The ThreadPool class multiplexes the waiting operations
onto a single thread, which significantly reduces the cost of a blocking operation compared with using a new thread. Computation
tasks can also be submitted to the ThreadPool, and will be performed whenever there is a free thread in the pool. Therefore,
ThreadPools achieve cheaper concurrency by avoiding repeated
thread creation/deletion, at the expense of possibly having to wait
for a computation to be performed if the thread pool is empty.


Each OS threads by definition needs its own machine stack.
Machine stacks are immovable, so must be allocated a fixed
portion of the address space with enough room for the stack
to grow. Since the library doesn’t know ahead of time how
much stack space a thread will need, it must guess, and inevitably this will end up wasting a lot of address space, which
on a 32-bit machine is a scarce resource. In contrast, Haskell
threads have stacks that are fully garbage collectable, and can
be moved and grown at will.

When used for multiple I/O requests, the ThreadPool concept is
basically equivalent to the I/O multiplexing scheme used in GHC
(Section 6.4). The main difference is that GHC’s scheme is hidden
from the programmer, who automatically gets the benefit of optimised multiplexing for all I/O operations provided the underlying
implementation supports it.

Anderson et. al.[5] proposed a way to effectively combine the benefits of user-level threads and kernel threads by having explicit communication between the kernel scheduler and the user-level thread
scheduler. A derivative of this scheme is currently being implemented in the FreeBSD operating system; no performance measurements were available at the time of writing.

7.4 User-level vs. kernel threads

8 Conclusion

Why should we care about lightweight threads? Many other languages have ditched the concept in favour of a one-to-one mapping
between the language’s own threads and native OS threads.

We have designed a simple extension to the Haskell Foreign Function Interface, for specifying precisely the interaction between the
FFI and Concurrent Haskell. It allows for the following features:


The reason is that lightweight Haskell threads can still be significantly cheaper than using OS threads. For example, the fastest implementation of native threads on Linux, the Native POSIX Threads
Library[6] claims 20µsec per thread creation/exit, whereas GHC’s
implementation of Concurrent Haskell can achieve an order of magnitude improvement over this: the conc004 test from GHC’s test
suite performed 106 thread creation/exit operations in 1.3sec on a
1Gz PIII, giving a thread creation/exit time of 1.3µsec. The NPTL
paper doesn’t give details on what hardware was used for their measurements, but a 1GHz PIII would seem to be a reasonable guess,
being a midrange system at the date of that publication.


Non-blocking foreign calls
Callbacks and Call-ins from multithreaded applications
Interacting with multithreaded foreign libraries, and foreign
libraries that make use of thread-local state

Furthermore, the extensions require no new syntax, and have a simple operational semantics. A few simple library functions are provided for the programmer to work with the extensions.
Moreover, we have done this without requiring any fundamental
restructuring of existing Haskell implementations: there is no requirement that the Haskell runtime be multithreaded, or that particular OS threads are used to execute Haskell code. However, we
do accommodate an efficient implementation based on lightweight
Haskell threads and a pool of OS worker threads for execution.

Native threads on other OSs are even more expensive; Windows for
example has notoriously expensive operating system threads.
These implementations of OS threads are mapping OS threads
onto kernel threads, and the kernel is managing the scheduling of
threads. This is the reason for much of the overhead: many thread
operations require a trip into the kernel.

There is an implementation of the efficient scheme in a production
Haskell Compiler (GHC), and we are currently gathering experience in using it.

So can OS threads be implemented in user space? Certainly; there
are many implementations of purely user-space threading libraries,
and these are indeed often faster than kernel threads. One problem,
however, is that this doesn’t let the multithreaded application take
advantage of a multiprocessor; for that you need at least one kernel
thread for each processor, so to this end hybrid models have been
developed[8, 3] which use a mixture between user-space and kernel
threads (sometimes called an M : N threading model, indicating that
M OS threads are mapped to N kernel threads).

9 References
[1] The Glasgow Haskell Compiler.
org/ghc.

http://www.haskell.

[2] The Java language. http://java.sun.com/.
[3] Next Generation POSIX Threading. http://www-124.ibm.
com/pthreads/.

It remains to be seen whether an implementation of OS threads
can approach the performance of lightweight Concurrent Haskell
threads. If that were to happen, then there would be no reason not

[4] The O’Caml language. http://www.ocaml.org/.
10

[5] T. E. Anderson, B. N. Bershad, E. D. Lazowska, and H. M.
Levy. Scheduler activations: Effective kernel support for the
user-level management of parallelism. ACM Transactions on
Computer Systems, 10(1):53–79, February 1992.
[6] Ulrich Drepper and Ingo Molnar.
The Native POSIX
Thread Library for linux.
Technical report, Redhat,
February 2003. http://www.redhat.com/whitepapers/
developer/POSIX_Linux_Threading.pdf.
[7] Manuel Chakravarty (ed.). The Haskell 98 foreign function
interface 1.0: An addendum to the Haskell 98 report. http:
//www.cse.unsw.edu.au/˜chak/haskell/ffi/.
[8] Richard McDougall and Jim Mauro. Solaris Internals. Prentice Hall, 2000.
[9] Simon Peyton Jones. Tackling the awkward squad: monadic
input/output, concurrency, exceptions, and foreign-language
calls in Haskell. In CAR Hoare, M Broy, and R Steinbrueggen, editors, Engineering theories of software construction, Marktoberdorf Summer School 2000, NATO ASI Series,
pages 47–96. IOS Press, 2001.
[10] Simon Peyton Jones, Andrew Gordon, and Sigbjorn Finne.
Concurrent Haskell. In Conference Record of the 23rd Annual
ACM Symposium on Principles of Programming Languages,
pages 295–308, St Petersburg Beach, Florida, January 1996.
ACM.

11