Event-driven Programming for Robust Software
Frank Dabek, Nickolai Zeldovich, Frans Kaashoek, David Mazières, Robert Morris
MIT Laboratory for Computer Science
{fdabek,kolya,kaashoek,rtm}@lcs.mit.edu, dm@scs.cs.nyu.edu

Abstract

ing of events. When a program cannot complete an operation immediately because it has to wait for an event (e.g.,
the arrival of a packet or the completion of a disk transfer),
it registers a callback—a function that will be invoked when
the event occurs. Event-based programs are typically driven
by a loop that polls for events and executes the appropriate
callback when the event occurs. A callback executes indivisibly until it hits a blocking operation, at which point it
registers a new callback and then returns.
In this position paper, we show that writing programs in
the event-based model can be convenient and show how to
extend the event-based model to exploit multi-processors
in a way that requires programmers to make only minor
changes to their code. This multi-processor event model
makes it easy to capture the levels of CPU concurrency
typically available to systems programs, but does not require the programmer to insert locks or condition variables. The multi-processor support is part of the libasync
asynchronous programming library. Unlike programming
with threads, libasync doesn’t force programmers to manage thread synchronization, doesn’t force programmers to
guess how much space to reserve for stacks, provides high
performance under high load, and doesn’t have hidden performance costs. As a result, programs written with libasync
tend to be robust. Based on our experience with the library,
we believe there is no reason to use threads for managing
concurrent I/O.

Events are a better means of managing I/O concurrency
in server software than threads: events help avoid bugs
caused by the unnecessary CPU concurrency introduced by
threads. Event-based programs also tend to have more stable performance under heavy load than threaded programs.
We argue that our libasync non-blocking I/O library makes
event-based programming convenient and evaluate extensions to the library that allow event-based programs to take
advantage of multi-processors. We conclude that events provide all the benefits of threads, with substantially less complexity; the result is more robust software.

1 Introduction
The debate over whether threads or events are best suited
to systems software has been raging for decades. The central question of this debate is whether threads or events
should be used to manage concurrent I/O. Many prefer
threads because they preserve the appearance of serial programming and can take advantage of multi-processor hardware. Programmers find programming with threads, however, difficult and as a result produce buggy software. This
paper contributes to the debate by showing that eventbased programming can provide a convenient programming
model, that it is naturally robust, and that it can also be extended to take advantage of multi-processors. For these reasons, we argue that there is no reason to use threads for
managing concurrent I/O in system programs—events provide all the benefits that threads provide, but in a more robust way.
Thread-based programs use multiple threads of control
within a single program in a single address space [2].
Threaded programs provide I/O concurrency by suspending a thread blocked on I/O and resuming execution in a
different thread. Under this model, the programmer must
carefully protect shared data structures with locks and use
condition variables to coordinate the execution of threads.
Event-based programs are organized around the process-

2 Threads lead to unreliable software
The main advantage of threads is that they allow the programmer to overlap I/O and computation while preserving
the appearance of a serial programming model. The main
disadvantage of threads is that they introduce concurrent execution even where it is not needed. This concurrency unnecessarily forces programmers to cope with synchronization between threads. In practice, threaded programs almost
always have latent data races and deadlocks and, as a result, they are not robust. In our view, programmers must be
spared the complexities of concurrency to make software
more robust.
1

The problems with thread-based programming have long
been recognized. Ousterhout argues that the convenience of
threads is not worth the errors they encourage, except for
multi-processor code [10]. Engler et al. demonstrate that
synchronization errors, particularly potential deadlocks, are
common in the Linux kernel [4]. Savage et al. [12] found
races in both student code and production servers, and summarize the difficulties of threaded programming as follows:
Multithreaded programming is difficult and error
prone. It is easy to make a mistake in synchronization that produces a data race, yet it can be
extremely hard to locate this mistake during debugging.
While it might seem that one could eliminate synchronization by moving to non-pre-emptive threads on a uniprocessor, errors will still arise if a thread blocks (yields the
CPU) unexpectedly. In complex systems, it is easy for programmers of one module not to understand the exact blocking behavior of code in other modules. Thus, one can easily
call a blocking function within a critical section without realizing that pre-emption may occur [1]. Deadlocks are also
a possibility when using non-preemptive threads, as one still
needs to hold locks across known yields.
Another complication of threads is the need to predict
the maximum size of a stack. Most systems conservatively
use one or two pages of memory per thread stack—far more
than the few hundred bytes typically required per callback.
The designers of the Mach kernel found stack memory overhead so high that they restructured the kernel to use “continuations,” essentially rewriting the kernel in an event-driven
style [3]. Stack memory overhead is especially burdensome
in embedded environments where memory is scarce. The
standard practice of allocating full pages for thread stacks
also causes additional TLB pressure and cache pressure,
particularly with direct-mapped caches [3]. Cohort scheduling attempts to increase data and instruction locality in
threaded programs by running related computations, organized as “stages,” consecutively [6].
Robust software must gracefully handle overload conditions. Both Pai et al. [11] and Welsh et al. [13] explore
the advantages of event-driven programs under high load.
Welsh demonstrates that the throughput of a simple server
using kernel-supported threads degrades rapidly as many
concurrent threads are required. Pai extends the traditional
event-driven architecture to overcome the lack of support
for non-blocking disk I/O in most UNIX implementations
by using helper processes to handle disk reads. On workloads involving many concurrent clients, Pai’s event-based
Web server provides higher performance than a server using
kernel-based threads.
Writing programs in the event-driven style alleviates all
of the problems mentioned above. Data races are not a con-

cern since event-based programs use a single thread of execution. Event-driven programs need only allocate the memory required to hold a callback function pointer and its arguments, not a whole thread stack; this reduces overall memory usage. In addition, these pointers and arguments can
be allocated roughly contiguously, reducing TLB pressure.
The livelock-like behavior of threaded servers under high
load is avoided by event driven programs that queue events
that cannot be serviced rather than dedicating resources to
them [9].
While this paper focuses on user-level servers, the arguments apply to operating system kernels as well. In this
realm, the event-driven architecture corresponds to a kernel
driven by transient interrupts and traps. Ford et al. compare
event-driven kernels with threaded kernels in the context
of Fluke [5]. Lauer and Needham argue that kernels based
on message passing (which corresponds to the event-driven
model) have a dual constructed in the procedure-oriented
(threaded) model (and vice versa) [7]. Based on this observation, the authors conclude that neither model is inherently
preferable. This argument is predicated on a thread model
that is non-preemptive and supported by monitors in the language: it neglects the problems created by pre-emption and
locking that we observe in thread packages commonly in
use today.

3 Making asynchronous programming easy
The most cited drawback of the event-driven model is
programming difficulty. Threaded programs can be structured as a single flow of control, using standard linguistic constructs such as loops across blocking calls. Eventdriven programs, in contrast, require a series of small callback functions, one for each blocking operation. Any stackallocated variables disappear across callbacks. Thus, eventdriven programs rely heavily on dynamic memory allocation and are more prone to memory errors in low-level languages such as C and C++. As an example, consider the
following hypothetical asynchronous write function:
void awrite (int fd, char *buf, size_t size,
void (*cb) (void *), void *arg);

awrite might return after arranging for the following
to happen: as soon as the file descriptor becomes writeable,
write the contents of buf to the descriptor, and then call cb
with arg. arg is state to preserve across the callback—
state that likely would be stack-allocated in a threaded program. A number of bugs arise with functions like awrite.
For example, awrite probably assumes buf will remain
intact until the callback, while a programmer might be
tempted to use a stack-allocated buffer. Moreover, the cast
of arg to void pointer and back is not type safe.

The C++ non-blocking I/O library in use by the authors, libasync, provides several features to eliminate such
memory problems. It supplies a generic reference-counted
garbage collector so as to free programmers from worrying
about which function is responsible for deallocating what
data.
libasync also provides a type-safe method of passing
state between callbacks. A heavily overloaded template
function, wrap, allows the programmer to pass data between
callbacks with function currying: wrap takes as arguments
a function or method pointer and one or more arguments
and returns a function object accepting the original function’s remaining arguments. Thus, the state of an operation
can be bundled as arguments to successive callbacks; the
arguments are type-checked at compile time.
Finally, the library also provides classes to help deal with
the complications of short I/O operations (i.e., when kernel
buffers fill up and a system call such as writev only writes
part of the data). The suio class can hold onto a reference
counted object until the “printed” data is consumed by an
output call.
Experience with libasync shows that it is easy to learn
and use. We use the library day-to-day when implementing
network applications. In both undergraduate and graduate
classes students have used the library to complete laboratory
assignments including web proxies and cryptographic file
systems. Several of the students have gone on to use the
library in projects unrelated to the classes.

4 Multi-processor event programming
We have modified the asynchronous programming library described in Section 3 to take advantage of multiprocessors. The modified library (libasync-mp) provides a
simple but effective model for running threads on multiple
CPUs, but avoids most of the synchronization complexity
of threaded programming models.
Programs written with libasync-mp take advantage of
multi-processors with a simple concurrency mechanism:
each callback is assigned a color by the programmer, and
the system guarantees that no two callbacks with the same
color will run concurrently. Because callbacks are assigned
a default color, libasync-mp is backwards compatible with
existing event-based applications. This allows the programmer to incrementally add parallelism to an application by
focusing on just the event callbacks that are likely to benefit
from parallel execution. In contrast, typical uses of threads
involve parallelizing the entire computation, by (for example) creating one server thread per client; this means that
all mutable non-thread-private data must be synchronized.
The concurrency control model provided by libasync-mp
also avoids deadlocks: a callback has only one color, so cycles can’t arise; even if multi-color callbacks were allowed,

the colors are pre-declared, so deadlock avoidance would be
easy.
The libasync-mp model is more restrictive than many
thread synchronization models; for example, there is currently no concept of a read-only color. However, our experience suggests that the model is sufficient to obtain about as
much parallel speedup as could typically be obtained from
threads.
libasync-mp maintains a single queue of pending callbacks. The library uses one kernel thread per CPU to execute callbacks. Each kernel thread repeatedly dequeues a
callback that is eligible to run under the color constraints
and executes it. An additional callback, inserted by the library, calls the select () system call to add new callbacks to the queue when the events they correspond to occur.
Ideal support for multi-processor programming would
make it easy to port existing programs to a multi-processor,
and would make it easy to modify programs to get good parallel speedup. Thus we are interested in two metrics: performance and ease of programming. We evaluate the two
criteria in an example application: the SFS file server [8].
Our performance results were obtained on a 700 MHz
4-processor Pentium III Xeon system running Linux kernel 2.4.18. Processor scaling results were obtained by completely disabling all but a certain number of processors on
the server while running the benchmark.
All communication between the SFS server and clients
is encrypted using a symmetric stream cipher, and authenticated with a keyed cryptographic hash. Because of its use of
encryption, the SFS server is compute-bound under heavy
workloads and therefore we expect that by using libasyncmp we can extract significant multiprocessor speedup.
The modifications to the SFS server are concentrated
in the code that encrypts, decrypts, and authenticates data
sent to and received from the clients. The callback responsible for sending data to the client is representative of how
we parallelized this server: we split the callback into three
smaller callbacks. The first and last remain synchronized
with the rest of the server code (i.e. have the default color),
and copy data to be transmitted into and out of a per-client
buffer. The second callback encrypts the data in the client
buffer, and runs in parallel with other callbacks (i.e., has
a different color for each client). The amount of effort required to achieve this parallel speedup was approximately
90 lines of changed code, out of a total of roughly 12,000 in
the SFS server.
We measured the total throughput of the file server to
all clients, in bits per second, when multiple clients read
a 200 MByte file whose contents remained in the server’s
disk buffer cache. We repeated this experiment for different
numbers of processors.
The bars labeled “libasync-mp” in Figure 1 show the per-

Throughput (MBytes/second)

30

tiple processors.

References
20
libasync-mp
N-copy

10

0
0

1

2

3

4

Number of CPUs

Figure 1. Performance of the SFS file server
using different numbers of CPUs, relative to
the performance on one CPU.

formance of the parallelized SFS server on the throughput
test. On a single CPU, the parallelized server is 0.95 times
as fast as the original uniprocessor server. The parallelized
server is 1.62, 2.18, and 2.55 times as fast as the original
uniprocessor server on two, three and four CPUs, respectively. Further parallelization of the SFS server code would
allow it to incrementally take advantage of more processors.
To explore the performance limits imposed by the hardware and operating system, we also measured the total performance of multiple independent copies (as many as CPUs
were available) of the original libasync SFS server code. In
practice, such a configuration would not work unless each
server were serving a distinct file system. An SFS server
maintains mutable per-file-system state, such as attribute
leases, that would require synchronization among the server
processes. This test gives an upper bound on the performance that SFS with libasync-mp could achieve.
The results of this test are labeled “N-copy” in Figure 1.
The SFS server implemented using libasync-mp closely follows the aggregate performance of multiple independent
server copies for up to three CPUs. The performance difference for the 2- and 3-processor cases is due to the penalty
incurred due to shared state maintained by the server, such
as file lease data, user ID mapping tables, and so on.

5 Conclusions
The traditional wisdom regarding event-driven programming holds that it offers superior performance but is too
difficult to program and cannot take advantage of SMP
hardware. We have shown that not only does event-driven
programming offer better performance and robustness than
threads, but by using libasync-mp, programmers can easily
write event-driven applications and take advantage of mul-

[1] The Amoeba reference manual: Programming guide.
http://www.cs.vu.nl/pub/amoeba/manuals/pro.pdf.
[2] B IRRELL , A. D.
An introduction to programming
with threads. Tech. Rep. SRC 35, Digital SRC, 1989.
http://gatekeeper.dec.com/pub/DEC/SRC/
research-reports/abstracts/src-rr%-035.
html.
[3] D RAVES , R. P., B ERSHAD , B. N., R ASHID , R. F., AND
D EAN , R. W. Using continuations to implement thread
management and communication in operating systems. In
Proceedings of the 13th ACM Symposium on Operating Systems Principles (1991), Association for Computing Machinery SIGOPS, pp. 122–136.
[4] E NGLER , D., C HELF, B., C HOU , A., AND H ALLEM , S.
Checking system rules using system-specific, programmerwritten compiler extensions. In Usenix Symposium on Operating Systems Design and Implementation (OSDI) (2000).
[5] F ORD , B., H IBLER , M., L EPREAU , J., M C G RATH , R.,
AND T ULLMANN , P. Interface and execution models in the
Fluke kernel. In Operating Systems Design and Implementation (1999), pp. 101–115.
[6] L ARUS , J. R., AND PARKES , M. Using cohort scheduling
to enhance server performance. In Proc. Usenix Technical
Conference (2002).
[7] L AUER , H. C., AND N EEDHAM , R. M. On the duality
of operating system structures. In Proc. Second International Symposium on Operating Systems, IRIA (Oct. 1978).
Reprinted in Operating Systems Review, Vol. 12, Number 2,
April 1979.
[8] M AZI ÈRES , D., K AMINSKY, M., K AASHOEK , M. F., AND
W ITCHEL , E. Separating key management from file system
security. In Proceedings of the 17th ACM Symposium on
Operating Systems Principles (SOSP ’99) (Kiawah Island,
South Carolina, December 1999).
[9] M OGUL , J. C., AND R AMAKRISHNAN , K. K. Eliminating
receive livelock in an interrupt-driven kernel. ACM Trans.
Comput. Syst. 15, 3 (Aug. 1997), 217–252.
[10] O USTERHOUT, J. K. Why threads are a bad idea (for most
purposes). Invited talk at the 1996 USENIX technical conference, 1996.
[11] PAI , V., D RUSCHEL , P., AND Z WAENEPOEL , W. Flash: An
efficient and portable web server. In Proceedings of the 1999
Annual Usenix Technical Conference (June 1999).
[12] S AVAGE , S., B URROWS , M., N ELSON , G., S OBALVARRO ,
P., AND A NDERSON , T. Eraser: A dynamic data race detector for multithreaded programs. ACM Transactions on Computer Systems 15, 4 (1997), 391–411.
[13] W ELSH , M., C ULLER , D., AND B REWER , E. SEDA: An
architecture for well-conditioned, scalable Internet services.
In Proceedings of the 18th ACM Symposium on Operating
Systems Principles (Oct. 2001), pp. 230–243.