Using Time Instead of Timeout
for Fault-Tolerant Distributed Systems
LESLIE LAMPORT
SRI International

A general method is described for implementing a distributed system with any desired degree of faulttolerance. Instead of relying upon explicit timeouts, processes execute a simple clock-driven algorithm.
Reliable clock synchronization and a solution to the Byzantine Generals Problem are assumed.
Categories and Subject Descriptors: C.2.4 [Computer-Communications Networks]: Distributed
Systems--network operating systems; D.1.3 [Programming Techniques]: Concurrent Programming; D.4.1 [Operating Systems]: Process Management--synchronization; D.4.3 [Operating Systems]: File Systems Management--distributed file systems; D.4.5 [Operating Systems]: Reliability--/ault-toleranee; D.4.7 [Operating Systems]: Organization and Design--distributed systems;

real-time systems
General Terms: Design, Reliability
Additional Key Words and Phrases: Clocks, transaction commit, timestamps, in~ractive consistency,
Byzantine Generals Problem

1. INTRODUCTION

In programming asynchronous multiprocess systems, the customary approach
has been to make process synchronization independent of the execution rates of
any components. This requires synchronization algorithms in which one process
must wait for another to do something before it can proceed. In distributed
systems, this means waiting for a message from the other process. These timeindependent algorithms cannot be fault-tolerant because a process could fail by
doing nothing, and such a failure manifests itself only as a reduction of the
process's execution rate [5].
The usual method of obtaining fault-tolerant synchronization in distributed
systems is to add timeouts to time-independent algorithms. A process sets a timer
whenever it begins waiting for another process, a n d a failure is assumed to have
occurred if a certain period of time elapses without a response from the other
This work was supported in part by the National Science Foundation under Grant. No. MCS 7816783, and in part by the Ballistic Missile Defense Agency under Contract No. DASG60-78-C-0046.
Author's address: Computer Science Laboratory, SRI International, 333 Ravenswood Avenue, Menlo
Park, CA 94025.
Permission to copy without fee all or part of this material is granted provided that the copies are not
made or distributed for direct commercial advantage, the ACM copyright notice and the title of the
publication and its date appear, and notice is given that copying is by permission of the Association
for Computing Machinery. To copy otherwise, or to republish, requires a fee and/or specific
permission.
© 1984 ACM 0164-0925/84/0400-0254 $00.75
ACM Transactions on Programming Languages and Systems, Vol. 6, No. 2 April 1984, 254-280.

Using Time Instead of Timeout

•

255

process. A number of fault-tolerant synchronization algorithms have been proposed that use timeouts in this way. However, these algorithms provide only a
limited degree of fault-tolerance. Every previously published synchronization
algorithm that we know of can be defeated by the failure of a single component.
The "fault-tolerant" algorithms are tolerant only of restricted kinds of failure,
and can fail if a faulty process sends conflicting information to two other
processes. Moreover, most of the algorithms provide only ad hoc solutions to
individual problems. An exception is the method of [12], which provides a general
approach to distributed synchronization that is similar to ours, but it assumes
"nice" failures.
The use of timeouts rests upon assumptions about the real-time behavior of
the system. It is assumed that a waiting process can infer from the occurrence of
a timeout that a failure has occurred. In this paper, we show how assumptions
about real-time behavior can be used to infer information other than the existence
of a failure. We describe a general algorithm to achieve any desired form of
synchronization with any desired degree of fault-tolerance--including the ability
to tolerate totally arbitrary and malicious failures. The algorithm is based on the
use of absolute times instead of timeouts, and can be considered an extension of
the approach of [6], achieving fault-tolerance by using physical instead of logical
clocks. The generality of the algorithm is demonstrated by applying it to several
distributed computing problems. A resource allocation problem is considered in
some detail, and other applications are briefly sketched, including a robust
distributed database and a reliable transaction commit protocol.
Achieving high reliability is expensive; the exact cost is discussed in the
conclusion. It may therefore be impractical to implement the entire system with
our algorithm. In this case, the algorithm can be used to implement a reliable
synchronization kernel, which can maintain system integrity even though component failures cause the loss of some functionality. This is illustrated by a brief
discussion of three examples--a distributed file system, the transaction commit
protocol, and a reliable fault detection scheme.
2. THE ASSUMPTIONS

We model a distributed system as a network of processes joined by communication
links (not necessarily completely connected). Each process executes an eventdriven algorithm, where an event is the arrival of a message or the process's clock
reaching a certain value. The use of timeouts rests upon the following assumption.

Assumption UC1. For any event e that causes process i to send a message to
process j, there is a 5 such that if event e occurs at time T and processes i and j
and the communication link joining them are nonfaulty, then the message arrives
at process j by time T + 5--where time T and the time when the message arrives
are either both measured according process i's clock, or are both measured
according to process j ' s clock.
In general, the value of 5 may depend upon the processes i and j and the
particular event e. For simplicity, we assume that there is a single constant
that works for all i, j, and e. This saves us from having to keep track of many
ACM Transactions on Programming Languages and Systems, Vol. 6, No. 2, April 1984.

256

Leslie Lamport

different 5's. The important assumption being made in UC1 is that the value of
5 is fixed in advance, and does not depend upon other message traffic or other
demands for processing resources.
There are two parts to the delay 5:
- - T h e time needed to process the event and generate the message.
- - T h e time needed to transmit the message across the communication link.
Bounding the message transmission time requires a high enough bandwith to
handle all the messages that may be generated. Bounding the processing time
requires enough processing power to handle all the events that may occur.
In practice, it may not be possible to find an absolute bound on the processing
and message transmission times. In this case, a value of 5 should be chosen which
is large enough so that the messages generated by almost all events e arrive
within 5 seconds. If a burst of message traffic over the communication link or a
burst of other demands on the processing resource prevents the message generated
by e from arriving within 5 seconds, then UC1 defines this to be a failure. Our
fault-tolerant algorithms can handle such failures. With a statistical distribution
of message transmission and response times, the probability of a failure can be
decreased by making 5 larger.
When process i sends a message to process j and waits for a response, it can
time out 25 seconds after deciding to send the message--5 seconds for the
message generated by the "decision event" to reach process j, and another 5
seconds for the acknowledgment message {generated by the "receipt-of-message
event") to reach i. By applying UC1 twice, we see that the timeout will only
occur if there is a failure. Note how this uses the fact that UC1 bounds the length
of time as measured by both the sender i and the receiver j.
It often simplifies the description of an algorithm if we allow a process to send
messages to itself. We therefore assume that there is a communication link
joining any process with itself, and that UC1 still applies when i = j. Since
sending a message to itself is really an internal processor operation, this means
we are assuming that any internal action initiated by i at time T will be completed
by time T + 5.
To satisfy UC1, all nonfaulty processes must maintain clocks whose running
rates are more or less the same. 1 Actually, it is enough that their rates are within
a constant multiple of one another. In practice, the clocks will run at the same
rate to within one part in 104, and often to within one part in 106.
Although the different processes' clocks must run at approximately the same
rate, the use of timeouts does not require that they maintain the same absolute
time. In this paper, we make the additional assumption that the clocks are
synchronized to keep approximately the same absolute time, and show how they
can be used in solving the synchronization problems that arise in distributed
systems. More precisely, we make the following assumption. 2
' It is customary to use the term "clock" for a device t h a t tells absolute time, and "timer" for a device
t h a t measures time intervals. Since it is easy to build a clock given such a timer, we will not
discriminate between the two terms.
2 Note t h a t UC2 uses the notion of simultaneity, so it tacitly assumes a preferred global time f r a m e - i.e., "Newtonian" time.
ACM Transactions on Programming Languages and Systems, Vol. 6, No. 2, April 1984.

Using Time Instead of Timeout
A s s u m p t i o n UC2.
differ by at most e.

°

257

At any time, the clocks of any two nonfaulty processes

Since real clocks run at slightly different rates, they tend to drift apart. Hence,
satisfying UC2 requires t h a t the clocks be periodically resynchronized. In the
presence of completely arbitrary failures, this resynchronization is quite difficult.
Fault-tolerant clock synchronization algorithms do exist, and are described in
[8]. Here, we simply assume UC2.
It is easy to see t h a t UC1 and UC2 together imply the following result.
PROPOSITION 1. I f an event e occurring at time T on process i's clock causes
process i to send a message to process j , and processes i and j and the communication
link joining t h e m are nonfaulty, then the message arrives at process j by time T +
+ e on its clock.
In the expression T + 5 + e, 5 represents the maximum time needed to generate
and transmit the message, while e represents the m a x i m u m difference between
the two clocks.
T h e r e is one further assumption t h a t we will need.
A s s u m p t i o n UC3. A process can determine the immediate source of any
message t h a t it receives.

Process i is the "immediate source" of a message sent to process j over the
communication link joining i and j. If i is simply relaying the message to j from
some other process, UC3 does not imply t h a t j can determine the original source
of the m e s s a g e - - o n l y t h a t j knows it just came from i. Identifying the source of
a message becomes a problem only if a failure can cause an incorrect message to
be delivered, otherwise the source can be included as part of the message. W i t h o u t
assumption UC3, it appears t h a t a single failed process can defeat any distributed
algorithm by "impersonating" nonfaulty processes, thereby disrupting all communication. UC3 is easy to achieve if interprocessor communication is by direct
wire between the processors. It may be very difficult to achieve if interprocess
communication is by a broadcast medium such as an ethernet.
3. HOW TO SEND A MESSAGE BY NOT SENDING A MESSAGE

Using clocks, one can convey information by not doing something. We formulate
this ability with a m e t h o d for sending a message by doing nothing, letting the
nonreceipt of a message be interpreted as the receipt of a special N U L L message.
This is accomplished with the following algorithm by which process i sends
process j a message m at time T over the communication link joining them.
Algorithm I: send message timestamped T from i to j
Process i:
WHEN clock = T DO
IF m ¢ NULL T H E N send message T:m t o j FI OD
Process j:
WHEN clock = T + 5 + e DO
IF exactly one T: m message has been received from i
THEN message received := m
ELSE message received := NULL FI OD
ACM Transactions on Programming Languages and Systems, Vol. 6, No. 2, April 1984.

258

•

Leslie Lamport

T h e "exactly one" in the last IF s t a t e m e n t is needed to handle the case in
which a failed process i sends multiple messages.
In Algorithm I, process i sends a n o n N U L L t i m e s t a m p e d message in the
ordinary way. However, it sends a N U L L message t i m e s t a m p e d T by simply
doing nothing. If a process can fail only by "dying"--i.e., by simply halting and
doing n o t h i n g - - t h e n a failed process i can only send a N U L L message. Hence,
one possible meaning of N U L L is "I have failed". In general, the following result
holds. It is an easy consequence of Proposition 1 and Assumption UC3.
PROPOSITION 2.
then:

Algorithm I has the property that, if process j is nonfaulty,

1. By time T + ~ + ~ on its clock, process j receives a message timestamped T from
process i.
2. If process i and the communication link joining i and j are also nonfaulty, then
this message is the one sent by process i.
W e can use Algorithm I to send a separate message from process i to process j
at every clock tick. If the clocks give the time in nanoseconds, t h e n this means
t h a t a billion messages per second are being sent. Of course, almost all of t h e m
will be N U L L messages, which do not involve any real message transmission.
However, the logical existence of all these messages lies at the heart of our
general m e t h o d for implementing distributed synchronization.
If there is no direct c o m m u n i c a t i o n link joining processes i and j , t h e n a
message sent from i to j must be relayed via intermediate processes kl . . . . . kn-1.
Algorithm I is generalized as follows.

Algorithm H: send message timestamped T from i to j along ~, where ~/is the path i = ko,
kl . . . . . k, = j
Process i:
WHEN clock = T DO
IF m ¢ NULL THEN send message T:i:m to k~ FI OD
Process ks: (0 < s < n)
WHEN message T: ko: . . . : ks-~ : m received from k,_~ DO
send message T:ko: . . . :k,:rn to ks+~OD
Process j:
WHEN clock = T + n$ + ~ DO
IF exactly one T:ko: . . . :kn-~:m message received from k,_~
T H E N message received := m
ELSE message received := NULL FI OD
T h e following generalization of Proposition 2 holds.
PROPOSITION 3. Algorithm H has the property that, if process j is nonfaulty,
then:

1. By time T + n5 + ~ on its clock, process j receives a message timestamped T
along path %
ACM Transactions on Programming Languages and Systems, Vol. 6, No. 2, April 1984.

Using Time Instead of Timeout

•

259

2. If all the processes and communication links on the path ~/ are nonfaulty, then
this message is the one sent by process i.
The truth of this proposition is almost obvious. The only difficulty is seeing
that the failure of some process not on the path ~ cannot cause j to receive an
extra T: ko: . . . : k , - l : m ' message. A little thought will show why this is so, and
there is a simple inductive proof of the proposition whose discovery we leave to
the reader.
Like Algorithm I, Algorithm II has the property that the NULL message is
transmitted without any real messages being sent. Hence, we can use it to send
a separate message every clock tick over the path % To achieve fault-tolerant
communication, so two nonfaulty processes can send messages to one another
despite the failure of communication links or other processes, Algorithm II is
applied several times for different ~ to send the same message along disjoint
paths. The number of different paths depends upon the type of failure to be
handled. If a failure cannot cause the transmission of an incorrect (real) message,
then up to / failures can be tolerated by using [ + 1 disjoint paths. 3 To handle [
arbitrary failures, one must use 2[ + 1 paths and majority voting.
Synchronizing processes requires not just reliable communication between two
processes, but a reliable broadcasting algorithm by which a process can communicate with all other processes. We therefore assume such an algorithm, implemented with the underlying network message-passing operations. The precise
condition that the algorithm must satisfy is the following, where A is some
unspecified constant.
IC.
then:

If process i initiates the broadcast of a message at time T on its clock,

1. If i is nonfaulty, then every nonfaulty process j receives the message sent by i
by time T + A on its clock.
2. If j and j ' are both nonfaulty, then either each of them receives the same
message by time T + A on its clock, or each of them receives no message by
time T + A on its clock.
We are assuming that process i broadcasts the message to itself as well as to
the other processes. It turns out to be convenient to let i send messages to itself
in this way. Note that condition 2 follows from condition 1 if process i is
nonfaulty. We can obviously assume that A > 6, since one cannot implement
such a broadcast algorithm on a network having a message-passing delay greater
than A.
Condition IC is essentially the interactive consistency condition discussed and
solved in [11], [9], and [1]--it differs only in stating an explicit time bound, and
considering the possibility of no message being received within that time. The
solutions require the ability to send a message along a path in such a way that:
1. The message always arrives at its destination.
2. If all the processes and communication links are nonfaulty, then the message
arrives correctly.
3 T h i s is also t h e case if one a s s u m e s unforgeable digital signatures.
ACM Transactions on Programming Languages and Systems, Vol. 6, No. 2, April 1984.

260

•

Leslie Lamport

However, Proposition 3 shows that this is guaranteed by Algorithm II. It is easy
to see that by using Algorithm II for message transmission, each solution to the
Byzantine Generals Problem that works in the presence of f "traitors" (failed
processes) provides an algorithm that satisfies condition IC when there are at
most f process and communication link failures.
The ability of the Byzantine Generals solutions to handle arbitrary malicious
behavior by the traitors means that the resulting broadcast algorithms work
despite completely arbitrary failures. Restricting the class of failure permits less
costly solutions. For example, assuming that failures cannot cause incorrect
messages to be transmitted makes the implementation of message authenticators
trivial, so the algorithm using authenticators in [11] and its generalization to the
"signed, written message" algorithm of [9] can be employed. Making still stronger
assumptions can permit even simpler solutions [13]. Thus, assuming a perfectly
reliable interprocess communication network permits trivial solutions.
Using Algorithm II, we can transmit N U L L messages by doing nothing. An
examination of the Byzantine Generals solutions shows it is easy to implement
them so that receiving only N U L L messages results in the broadcast of the
N U L L value. We then obtain a broadcast algorithm satisfying IC in which a
N U L L message is broadcast by doing nothing. With such an algorithm, we can
let each process broadcast a separate value at every clock tick--so long as almost
all the messages are NULL.
We will not say any more about what kind of solution is used; we simply
assume some broadcast algorithm that satisfies Condition IC. However, we must
note that this can be achieved only if any two processes can always communicate
at least indirectly with one another. By assuming it, we are therefore assuming
that failures cannot partition the network of nonfaulty processes and communication links into two disjoint components. This is tantamount to assuming a
sufficient degree of connectivity for the communication graph.
4. HOW TO SOLVE ANY SYNCHRONIZATION PROBLEM
4.1 The State Machine Algorithm

We now describe our general method for implementing any kind of distributed
synchronization for N processes. It involves an algorithm for reliably implementing an arbitrary state machine. The desired synchronization is achieved by
using an appropriate state machine.
The basic idea is that at every instant of time T, each process i issues a
command Ci.T by broadcasting it to all processes. At time T + A, the commands
C1.T. . . . , CN,T are executed in order, and any "timeout" actions that are supposed
to occur then are performed by executing a special timeaction command. To
specify this precisely, we define a command to be a subroutine that has three
arguments: a state, a time, and a process number. (The timeaction command does
not have a process number argument.) Executing a command may change the
value of the state argument (so it involves a "call by reference") and may produce
output. We regard the output as a message from the process to itself.
The generation and execution of commands by process i is performed according
to the following algorithm. We assume that this is begun at some initial values
of clock and state.
ACM Transactions on Programming Languages and Systems, Vol. 6, No. 2, April 1984.

Using Time Instead of Timeout

•

261

Algorithm III:
LOOP FOREVER
FOR j := 1 UNTIL N
DO execute C/,c~k- ~[state, clock - A, j ] OD;
execute timeaction[ state, clock - A];
generate command C~,c~k and initiate its broadcast to all other processes;
clock := clock + 1
END LOOP

The incrementing of the clock (the last statement in the loop) is actually done
by the passage of time, so the first three statements of the loop must be performed
as if they were instantaneous atomic actions. These actions are all internal to
the process, so there is no difficulty achieving the effect of instantaneous
execution. An example given below shows how this can be implemented in
practice. For now, we simply pretend that the actions are instantaneous.
The first two of these actions, which execute all the commands C~,T for T =
clock - A and the timeaction command, form the state machine's time T execution
step. We let stateT denote the value of state immediately after the execution of
the time T execution step. Thus, the time T execution step begins with state =
stateT-1 and ends with state = stateT.
When its clock strikes time T + A, each process does the following:
1. It executes the time T execution step of the state machine--executing every
other process's command plus the timeaction command for time T, and
computing stateT.
2. It initiates the broadcast of its command for time T + A.
We can prove the following result about this algorithm.
PROPOSITION 4. I f the broadcasting mechanism satisfies condition IC,
then Algorithm I I I satisfies the following conditions at any time T, where
A' = A + 5:
- - E v e r y process i that has not failed by time T + A' on its clock will by then have
executed the time T state machine step, using the command it generated at time
T as the command Ci,r.
- - A n y two processes that have not failed by time T + A' execute the identical
sequence of state machine execution steps through time T.

PROOF. Condition IC implies that a nonfautty process receives all the commands Ci,T by time T + A, so it guarantees that a nonfaulty process knows the
commands that it needs to execute the first action. If we were to take Algorithm
III literally, with the execution of a state machine step being instantaneous, then
a simple induction argument would prove the result with A' = A. However, we
must take into account the time needed to perform the step. We can consider
the execution of the time T state machine step to be the act of generating a
message from the process to itself, initiated by the clock striking T + A. (The
message is actually the output generated by the step, or NULL if no output is
generated.) By assumption UC1, letting i = j, we see that the time needed to
execute the step is at most 5, leading to the value A + 5 for A'. []
ACM Transactions on Programming Languages and Systems, Vol. 6, No. 2, April 1984.

262

•

LeslieLamport

The reader should note that all the interprocess message delays are included
in the A term, and come from the broadcasting of the commands; the extra delay
5 represents an execution time within a single process.
With a nanosecond clock, Algorithm III performs a billion state machine
execution steps per second. Just as broadcasting a billion messages per second is
easy so long as most of them are NULL messages, performing a billion execution
steps per second is easy so long as:
- - T h e NULL command does nothing--i.e., it does not change the value of state
or produce output.
- - F o r any given value of state, executing timeaction[state, T] does not nothing
for almost all values of T.
In this case, almost all of those billion execution steps per second are performed
by doing nothing.
5. EXAMPLES
5.1 A Resource Allocation Problem

We first consider a simple resource allocation problem used as an example in
[6], extended to include fault-tolerance. The problem is to synchronize access to
a shared resource by N processes so that only one process at a time can use it.
Each process must issue a request for the resource, and wait until the request is
granted before using the resource. The solution must satisfy the following three
conditions.
1. A process that has been granted the resource must release it or fail before the
resource can be granted to another process.
2. Different requests for the resource must be granted in the order in which they
are made, unless they are made at approximately the same time.
3. If every process that is granted the resource eventually releases it or fails,
then every request is eventually granted.
The first and third conditions were obtained by modifying the corresponding
conditions in [6] in an obvious way to specify fault-tolerance. However, we
interpret "before" to have the ordinary meaning of "occurring earlier in time",
and not as the --. relation defined in [6]. This is necessary because --. is defined
in terms of the sending and receiving of messages, and our method for sending
information {NULL messages) without sending real messages eliminates too
many messages for --* to be a useful relation. With the stronger definition of
"before", it is impossible to ensure that requests are granted in the order they
are made. Thus, condition 2 requires this to be true only if the requests are made
at sufficiently different times. In our algorithm "sufficiently different" will mean
more than e seconds apart, as measured by every process's clock.
In order to satisfy condition 3, it must be possible to discover if the process
that has acquired the resource has failed. For simplicity, we assume that a
nonfaulty process will release the resource within o0 seconds of when it acquires
it--where o~ is some arbitrary constant. Condition 3 can then be satisfied by
automatically releasing the resource w seconds after a process has acquired it.
ACM Transactions on Programming Languages and Systems, Vol. 6, No. 2, April 1984.

Using Time Instead of Timeout

•

263

DEFINE request[state, T, i] TO BE:
IF state.queue empty T H E N state.qtime := T;
output "i granted resource" FI
state.queue := insert[i, state.queue]
END DEFINITION
DEFINE release[state, T, i] TO BE:
IF head(state.queue) = i
T H E N state.queue := tail[state.queue];
IF state.queue empty
T H E N state.qtime := oo
ELSE state.qtime := T;
output "head(state.queue) granted resource"
FI FI
END DEFINITION
DEFINE: timeaction[state, T] TO BE:
IF T = state.qtime + o~ + A'
T H E N state.queue := tail[state.queue];
IF state.queue empty
T H E N state.qtime := oo
ELSE state.qtime := T;
output "head(state.queue) granted resource"
FI FI
END DEFINITION
Fig. 1.

Command definitions for resource allocation example.

To apply Algorithm III to this problem, we define a state machine having two
commands: requests and release--plus the timeaction and NULL commands. The
state consists of two components:

state.queue--a queue of process numbers, initially empty,
state.qtime--a time or the value 0% initially equal to oo.
The state.queue component has as its head the process that owns the resource,
and as its tail the queue of waiting processes. The value of state.qtime is the time
of the state machine step at which the resource was granted to its current owner,
or oo if no process currently owns the resource.
A request command issued by process i adds i to the end of the queue, and a
release command issued by i deletes i from the head of the queue. (A process
issuing a release command when it is not at the head of the queue must be faulty,
and its command is ignored.) The timeaction command effectively executes a
release command when a process has been at the head of the queue for too long.
(The process must be given an extra A' seconds to allow for the time it took to
be notified that it had acquired the resource.) When a new process reaches the
head of the queue, thereby acquiring the resource, this is indicated by the
appropriate output.
The precise definitions of the commands are given in Figure 1, where insert(item, queue) returns the new queue obtained by inserting item at the end of
queue, tail(queue) returns the new queue obtained by deleting the head of queue,
ACM Transactions on Programming Languages and Systems, Vol. 6, No. 2, April 1984.

264

Leslie Lamport
Action

Condition
1. (clock = T + A)
and
(there is a message in buffer timestamped T)

F O R j := 1 TO N DO
IF "T:j-request" message in buffer
THEN remove message from buffer;
execute request(state, T, j )
FI
IF "T:j-release" message in buffer
THEN remove message from buffer;
execute release(state, T, j )
FI OD

2. (clock = qtime + w + A' + A)

delete head of queue;
IF state.queue empty
THEN state.qtime := ~]
ELSE state.qtime := state.qtime + ~;
output "head(state.queue) granted resource" FI

3. (wants to request resource)
and (clock = T)

initiate broadcast of "T :/-request"
message to all processes

4. (wants to release resource)
and (clock -- T)

initiate broadcast of "T:/-release"
message to all processes

Fig. 2.

Process i's program for the resource allocation algorithm.

and head(queue) returns the head of queue. Of course, the NULL command is
always defined to be a "no-op"--i.e., by:
DEFINE NULL[state, T, i] TO BE:
SKIP
END DEFINITION

Each process executes Algorithm III using this state machine. Process i owns
the resource from the time it generates the "i granted resource" output until it
either executes a release command or fails. Note that the state machine output
is simply used to notify the process of the result of executing the state machine;
it is not sent from one process to another.
Thus far, we have described our solution in terms of Algorithm III, which
executes a state machine step every clock tick. We now show how it can be
implemented by a more practical "interrupt-driven" program, in which a process
performs an action only in response to one of the following events:
- - T h e receipt of a message. We assume that all messages are placed in a common
message buffer.
--A clock interrupt--generated by the clock reaching a specified value.
- - T h e process's "desire" to request or release the resource.
In our algorithm, the receipt of a message timestamped T causes the setting of
a clock interrupt for time T + A, which will cause the message to be processed at
that time. The actions taken in response to clock interrupts and "desires" are
described in Figure 2. Instead of thinking of the actions as initiated by events,
we think of them initiated by the holding of conditions. Thus, the first two
actions, which represent clock interrupts, occur when the value of the clock
ACM Transactions on Programming Languages and Systems, Vol. 6, No. 2, April 1984.

Using Time Instead of Timeout

•

265

satisfies certain conditions. Action 1 implements the execution of the commands
Cj,T, and action 2 implements the timeaction command. Actions 3 and 4 implement the generation and broadcasting of the request and release commands. We
assume that the condition of "wanting" to request or release a resource no longer
holds after the action is executed. We also assume that if two different conditions
hold at the same time, then the actions are performed in the indicated order.
We now prove the following result about this algorithm, where a "dead" process
is one that sends no messages.
PROPOSITION

5.

If

- - a process can fail only by "dying",
- - a non/ailed process will issue a release command within w seconds (on its clock)
o/when it is notified that it has been granted the resource,
then the algorithm of Figure 2 satisfies correctness conditions 1-3.
PROOF. It is easy to check that the program implements Algorithm III for
this particular state machine. Hence, by Proposition 4, every non/ailed process
executes the identical time T state machine step by time T + A' on its clock. We
can therefore prove that the program satisfies conditions 1-3 by the following
simple reasoning about the behavior of the state machine.
1. To prove condition 1, assume that process i is granted the resource by the
time T state machine step. No other process can be granted the resource until i
is not at the head of state.queue. This can happen only by i executing a release
command or by the timeaction command for the time T + w + A' state machine
step. The latter case can only occur if i did not release the resource by time T +
co + A'. Since i received notification by time T + A' that it had acquired the
resource, by hypothesis, this means that i must have failed by that time.
2. Requests are granted in the order of the time at which they are issued,
where that time is measured by the clock of the requesting process. Interpreting
"appoximately the same time" to mean "within e seconds", condition 2 follows
from Assumption UC2.
3. A request command places the requesting process at the tail of state.queue,
and processes can only be removed from the head of that queue. Since the current
head must be deleted within o~ + A' seconds, every requesting process must
eventually reach the head of the queue and acquire ownership of the resource. []
The event-driven program of Figure 2 therefore provides a very practical
implementation of Algorithm III for this particular choice of a state machine-an algorithm that may execute a billion state machine steps per second. The
implementation does this by doing nothing almost a billion times a second.
Thinking in terms of executing a state machine step at every clock tick provides
a method for writing distributed synchronization programs. One first defines the
state machine commands as in Figure 1, and then constructs an implementation
by an event-driven program as in Figure 2. This implementation can exploit the
details of the particular state machine. In our example we could have allowed
some commands to be executed out of order--for example, it does not matter in
which order a release and a request command are executed within a single state
machine step. This is the same kind of refinement done by optimizing compilers-ACM Transactions on Programming Languages and Systems, Vol. 6, No. 2, April 1984.

266

•

LeslieLamport

changing the way a program computes something without changing the results it
produces.
It is instructive to compare this algorithm with the one given in [6]. In the
latter algorithm, each message requesting the resource must be explicitly acknowledged. This acknowledgment 'is necessary for a time-independent asynchronous algorithm, since without it there would be no way for a requesting
process to be sure that there are no concurrently-issued conflicting requests. 4
The present algorithm does not need the acknowledgment because process i
knows that any requests conflicting with a request issued at time T must be
received by time T + A. Thus, the nonoccurrence of an event (the receipt of a
conflicting request) by a certain time conveys an important piece of information
(the absence of any conflicting requests) that in an asynchronous algorithm can
only be sent by an explicit message.
Although the present solution uses fewer messages, it does so at the cost of
• taking longer to process a request. With the present solution, a process requesting
the resource when it is free must wait A seconds before acquiring it. With the
solution of [6], it just waits for all the acknowledgments, which will usually take
less than A seconds if no process has failed. In Section 7, we show how our
method can incorporate the use of explicit acknowedgments to speed response in
the absence of failure. Of course, unlike the solution of [6], the present approach
works even if there are failures.
It is easy to modify our solution to the resource allocation problem by changing
the state machine definition. For example, although in principle our algorithm
works in the presence of arbitrary failures, in practice it would be defeated by a
failed process issuing a continual stream of request commands, since each
command adds another element to state.queue and allows the process to seize
the resource for an additional 00 seconds. To prevent this, we simply change the
definition of the request command so that it does nothing if the requesting
process is already on the queue. It should be obvious how to change the formal
definition of the request command to accomplish this.
This observation illustrates that there are two aspects to achieving faulttolerant synchronization with our method.
--Using a fault-tolerant broadcasting algorithm to achieve reliable execution of
the state machine.
--Designing the state machine to be tolerant of "bad" commands issued by faulty
processes.

5.2. Further Examples
Our method works not only for solving the resource allocation problem, but for
obtaining a fault-tolerant implementation of any desired form of synchronization
in a distributed system. We now give some other examples.

Distributed Semaphore. To implement a distributed semaphore, one lets the
state have two components--the value of the semaphore and a queue of waiting
processes. The P and V commands are then defined in the obvious way. By using
4 T h e use of t i m e s t a m p s m a k e s it easy to "piggy-back" t h e a c k n o w l e d g m e n t on a n o t h e r message, b u t
it m u s t be sent.
ACM Transactions on Programming Languages and Systems, Vol. 6, No. 2, April 1984.

Using Time Instead of Timeout

•

267

a sufficiently fault-tolerant broadcasting mechanism, one obtains a semaphore
that functions correctly despite the faults of individual components.
Replicated Database. Suppose that there are m data items D 1 , . . . , Din, which,
for robustness, are replicated at each of the N processes. Each process wants to
execute operations on these data items--for example, the "money transfer"
operation of subtracting a quantity from one data item and adding it to another.
The effect of executing all the processes' operations must be the same as if they
were executed in some specific order, which should be approximately the order
in which they were issued. (See [3] for a more detailed discussion of this problem.)
Using our method, we solve this problem by defining a state machine in which
the state consists of the values of the data items, and the commands are the
operations that the processes wish to perform. For example, the "money transfer"
command is defined by:
DEFINE transfer x [romp to q [state, T, i] TO BE:
IF state.Dp >_x
THEN state.Dp := state.Dp - x;
state.D¢ := state.Dq + x FI
END DEFINITION
(In the terminology we have been using, this is a separate command for each
triple of values (x, p, q).)
Using a broadcast mechanism that satisfies Condition IC produces a very
reliable database system. In this system, "atomicity" of the operations is maintained--each command is either performed or not performed. The failure of a
process while a command is being executed cannot leave the other processes'
versions of the database in an incorrect or inconsistent state.
Transaction Commit. A process wishes to issue a transaction complete request,
whereupon every other process issues either a commit or abort request. If all
processes issue a commit request, then the transaction must be committed,
otherwise it must be aborted. A nonresponse from a failed process must be treated
as an abort request.
To solve this problem, we let the state consist of a mapping from transaction
identifiers to time, status pairs. For any identifier I, state(I).time is the time at
which a transaction complete request was issued for transaction /, and
state(I).status denotes the status of that request. A status is an N-tuple whose
ith component equals "commit" or "undecided"--depending upon whether process i has issued a commit request or no request for that transaction. (A transaction
complete request is considered to include an implicit commit request.) Initially,
for any identifier/, state(I).time equals oo. The commands are defined as shown
in Figure 3. In order for its response to a transaction complete request issued at
time T to be counted, a process must respond by time T + ~. The constant
must be chosen greater than A' to ensure that the process will have learned about
the transaction request in time to issue its response. (A process can also issue an
implicit abort request by failing to issue any command.)
With this algorithm, the failure of any process means that all transactions will
be aborted. The problem can be generalized to allow a transaction to be committed
despite some failures. For example, with a replicated database, we might want
ACM Transactions on Programming Languages and Systems, Vol. 6, No. 2, April 1984.

268

•

Leslie Lamport

DEFINE transaction complete I [state, T, i] TO BE:
state(I).time := T;
state(I).status := ( "undecided", . . . . "undecided");
state(1).status.i := "commit"

END DEFINITION
DEFINE commit I [state, T, i} TO BE:
IF state(I).time ~s
THEN state(I).status.i := "commit";
IF state{I).status = ( "commit",..., "commit")
THEN output "transaction I committed";
state(I).time := ¢¢FI FI
END DEFINITION
DEFINE abort I [state, T, i] TO BE:
IF state(I}.time =~ oo
THEN state(I).time := ~;
output "transaction I aborted" FI
END DEFINITION
DEFINE timeaction[state, T] TO BE:
FOR ALL identifier I DO
IF state(I).time = T - ~
THEN state(I).time := oo;
output "transaction I aborted" FI OD
END DEFINITION
Fig. 3. The state machine commands for transaction commit.
the transaction to be c o m m i t t e d if each update can be performed to some copy
of the data. B y defining the appropriate state machine, it is easy to e x t e n d our
solution to the more general problem.
Note t h a t unlike more traditional transaction commit algorithms, our solutions
have no "window of vulnerability"; processes and communication lines can fail
at any time without delaying the system. Of course, we are assuming t h a t our
Byzantine Generals solution works, which requires t h a t enough processes and
communication lines keep working. In particular, handling a single arbitrary
failure requires t h a t each process be connected by at least three separate communication lines to the rest of the network.
Real-Time Process Control.
A real-time process control system can often be
designed as an iterative procedure, in which each step consists of first reading
input from sensors and t h e n generating o u t p u t to actuators. For reliability, one
wants the reading of input and generation of o u t p u t to be performed by physically
separate processors. T o apply our method, each processor is represented by a
process. A process's c o m m a n d consists of the input read by t h a t process, and the
state machine describes the algorithm for generating actuator o u t p u t from the
sensor inputs and the current system state. T h e interval between successive
"clock ticks" is the repetition period for the successive read input/generate output
steps--typically tens or hundreds of milliseconds. (These clocks will actually be
the higher-order bits of finer-grained clocks t h a t are m a i n t a i n e d by the system.)
T h e S I F T system [14] can be viewed as an elaboration of this basic idea.
ACMTransactionson ProgrammingLanguagesand Systems,Vol.6, No. 2, April 1984.

Using Time Instead of Timeout

•

269

Once again, we observe that there are two parts to the problem of achieving
fault-tolerance:
--Reliably executing the state machine, so each nonfaulty process produces the
same output.
--Designing a state machine that performs properly (generates the right actuator
outputs) despite the presence of some bad inputs (incorrect sensor readings
produced by faults in the sensors or in the processes reading them).
We hope that these examples will serve to substantiate our claim that the
method can be used to implement any desired form of distributed synchronization. We urge the reader to try applying the method to his own favorite distributed
synchronization problem.
6. FURTHER REFINEMENTS

A method as simple as ours cannot by itself provide practical solutions to all
synchronization problems; it requires further refinement to handle the unique
details of specific subclasses of problems. We now discuss three important
refinements.
6.1. Restart

Proposition 4 says nothing about the behavior of a process after it fails. In most
distributed systems, one must be able to restart failed processes without having
to restart the entire system. The basic idea behind restarting a process is simple-in order for the restarted process to resume normal operation at time T, by time
T + A it must be able to determine stateT-1. In order to determine stateT-1, it is
sufficient for there to be some earlier time T ' for which the process knows:

--stateT,,
--All commands issued between times T ' ÷ 1 and T - 1.
The restarted process must obtain from other processes the information it
needs for resuming execution of the state machine. How it does this will depend
upon the type of failure assumed and the details of the particular state machine.
If one assumes that the only kind of failure is simple "dying", then the restarted
process can obtain the state information from a single other process. With
arbitrary failures, it must use majority voting on information obtained from
several sources. If the state contains only a small amount of information, then
the restarting process can easily obtain the value of the entire state. If there is a
great deal of state information, then a process must periodically put a copy of
the state (a "checkpoint") onto a stable storage medium, and also save a log of
the commands that have been issued. {This log need not be on stable storage.) A
restarting process can then learn the current state by obtaining the log of all
commands issued since the last checkpoint recoverable from its stable storage.
The problem of restarting failed processes is also discussed in [12].
To handle arbitrary failures, one must also consider the possibility that a
process fails and restarts without any indication that it has failed--for example,
due to a transient malfunction. In that case, the process might be quite happily
executing the state machine using the wrong state, thereby producing incorrect
ACM Transactions on Programming Languages and Systems, Vol. 6, No. 2, April 1984.

270

•

Leslie Lamport

output. There are two possible ways to handle this problem:
1. Have each process periodically take a "checksum" of its state and compare its
result with that of the other processes.
2. Design the state machine to be "self-stabilizing", so that, for some value of r,
the state at any time T is completely determined by the input commands
issued at times T - T through T - 1.
How these approaches are implemented depends upon the details of the particular
application, and will not be discussed.
6.2. Reconfiguration

We have been assuming that the network of processes is fixed. However, in some
applications it is desirable to reconfigure the network to eliminate processes or
include new ones. This is done by having reconfiguration controlled by the state
machine. Formally, reconfiguration is performed by terminating the current Nprocess algorithm after execution of a time T state machine step, then starting a
new version of the algorithm at time T + 1 using a different network of processes.
The reconfiguration is initiated by a time T ' state machine execution step, with
T ' < T - A', which produces output notifying the processes that the reconfiguration will take place after the time T step. To perform the reconfiguration, each
process needs the following information.
- - T h e new network of processes.
- - T h e definition of the new state machine.
- - T h e initial state of the new state machine.
As an example, we can modify the above transaction commit solution by
allowing a process to eliminate itself from network by executing a quit command.
This command is defined so that when process j issues it at time T, it generates
the output "eliminate j after time T + A'". This causes a reconfiguration in
which process j is removed from the set of processes, the new state machine is
the same as the old one except that state(I).status is an (N - 1)-tuple instead of
an N-tuple, and the initial state of the new state machine is obtained from the
final state (stateT) of the old one by deleting the j t h component of each status N tuple. (The processes also need to be renumbered.)
If the reconfiguration adds a process, then the information needed to execute
the new state machine must be transmitted to that process by the processes in
the original network. For the initial state information, this is the same problem
encountered in restarting a failed process.
Note that changing the network of processes involves changing the broadcast
algorithm. That is why performing a reconfiguration requires knowing the new
network of processes, including the communication links, and not just the set of
processes. For example, one can easily define the new state machine when
reconfiguring the transaction commit solution to add a new process (numbered
N + 1). However, defining the new broadcast algorithm requires knowing precisely where the new process lies in the total network of processes and communication links.
ACM Transactions on Programming Languages and Systems, Vol. 6, No. 2, April 1984.

Using Time Instead of Timeout

°

271

6.3. Explicit Acknowledgments

In Algorithm III, a process always waits until time T + A before executing the
time T state machine step. The algorithm is easily modified to execute this step
as soon as (1) the time T - 1 step has been executed and (2) all the commands
Ci.T have been received. Condition IC guarantees that every command is received
within A seconds of when it is broadcast, so the modified algorithm always
executes the time T step by time T + A'. However, it will execute the step sooner
if commands are received earlier.
For systems like the real-time process control example in which nonfaulty
processes never issue NULL commands, no further modification of Algorithm
III is needed. However, a NULL command broadcast by not sending any messages
is detected only when A seconds have elapsed without the receipt of any messages,
so it always takes A seconds to receive such a command. Hence, to execute the
time T step before T + A', all commands issued before time T--including NULL
commands--should be explicitly broadcast.
There is no reason to hasten the execution of a state machine step that does
nothing. The time T step can perform an action only if one of the commands Ci,T
is nonNULL, or if the timeaction command does something at time T. As soon
as a process learns that this is the case, it can explicitly broadcast a single
message containing all the relevant NULL commands. This is done as follows:
Let the message IT', T]:m broadcast by process i denote that Ci.r = m
and C~T~= NULL for all T " with T ' < T " < T; and let T(i) be the last
time at which process i explicitly broadcast a command. If i discovers
at time T that a nonNULL state machine step is to occur at some time
T ' with T(i) < T ' <_ T--for example, by receiving a command timestamped T ' from another process--then it explicitly broadcasts the
command [T(i), T]:Ci, Tand sets T(i) equal to T.
Constructing this new algorithm is straightforward if we assume that processes
and communication links can only fail by "dying". Without this assumption, care
must be taken to handle the situation in which a faulty process issues a nonNULL
command at time T ' and then a [T", T]:m command with T" < T ' < T. This
requires modifying Algorithm II so that two different messages with the same
timestamp cannot be sent (implicitly or explicitly) over the same path. We will
not describe exactly how this is done, but simply assume the existence of the
modified version of Algorithm III, which we call Algorithm IV. The method
described in [7] may be viewed as a particular implementation of Algorithm IV.
The important thing to notice about Algorithm IV is that the earlier execution
of the state machine step is achieved by using explicit acknowledgment messages.
When the algorithm is applied to the resource allocation solution, receipt of a
request timestamped T causes processes to broadcast acknowledgment messages
of the form "[T', T]:NULL". However, unlike the acknowledgments in the
solution of [6] which are sent only to the requesting process, these are broadcast
to all processes. These extra messages are needed to prevent a process from
sending a NULL command to one process and a nonNULL command to another,
and are a necessary cost of achieving fault-tolerance.
ACM

Transactions on Programming Languages and Systems, Vol. 6, No. 2, April 1984.

272

•

LeslieLamport

7. REDUCING THE COST

Our state machine approach can be used to solve any distributed synchronization
problem. With a sufficiently robust "Byzantine Generals" algorithm for broadcasting commands, one can achieve any desired degree of fault-tolerance. In this
respect, our approach is superior to most previous work on distributed synchronization that has given ad hoc solutions, providing only limited fault-tolerance,
to specific problems. However, achieving fault-tolerance is expensive. In this
section, we investigate the cost of these fault-tolerant solutions, and how it can
be reduced.
7.1. What It Costs

Time. Almost all the time needed to perform an operation is spent in
achieving the interactive consistency condition IC. It has been shown that even
with a completely connected communication graph, achieving interactive consistency in the presence of [ faults requires a worst-case delay of ([ + 1)5 [4]. If n
is the diameter of the communication graph--i.e., the smallest number such that
any two processes are joined by a path of length at most n - - t h e n the minimum
worst-case delay is probably (f + n)5.
These worst-case delays seem to be necessary only with rather malicious
failures. By making assumptions about how processes fail, one can reduce the
delay A. For example, by assuming that processes can fail only by "dying", and
that two processes will not both die within 2(5 + e) seconds of one another, one
can obtain a solution with A = 2(5 + e).
Although the worst-case delay A may be quite large, some Byzantine Generals
solutions have a delay much less than A when there are no failures. For example,
the first algorithm of [11] requires only two message delays in the absence of
failure. We conjecture that for an arbitrary communication graph, the expected
delay when there is no failure can be made as small as n + 1 message delays,
where n is the diameter of the communication graph. This is important for
Algorithm IV, where it is the actual rather than the worst-case delay that
determines response time.
Number of Messages. With a completely connected communication graph,
the original Byzantine Generals algorithms of [11] require on the order of N r+l
messages to handle [ failures. They can be implemented so that, in the absence
of failure, they require about N 2 messages. An algorithm in [2] reduces the
number of messages to O([N) by adding an extra message delay, but requires the
use of digital signatures. The algorithm of [10] requires O(t ~) messages without
digital signatures, but doubles the amount of time required.
Amount of Replication. With our method, each of the N processes maintains
the complete system state. This implies replicated storage and redundant processing. Reliability requires redundancy. If a system is to tolerate fprocess failures,
then it must maintain at least f + 1 copies of its state information.
It is shown in [11] that a Byzantine Generals solution to handle f arbitrary
process failures requires more than 3f processes. Assuming digital signatures
permits a solution with as few as [ + 1 processes. If one assumes that processes
and communication links can fail only by dying, then such a Byzantine Generals
ACM Transactions on Programming Languages and Systems, Vol. 6, No. 2, April 1984.

Using Time Instead of Timeout

•

273

solution can be used to implement an ([ + 1)-process system that will tolerate [
failures. With more realistic assumptions, one probably needs at least 2 / + 1
processes to handle [ failures.

Clock Synchronization. In addition to the costs of executing Algorithm III,
we must also include the cost of synchronizing the clocks that the algorithm
depends upon. Real clocks do not run at precisely the same rates, so satisfying
UC2 requires periodically resynchronizing the clocks. Synchronizing clocks seems
to require that each process reliably broadcast its clock value to every other
process, making the cost of clock synchronization the same as that of solving the
Byzantine Generals problem. Each of the known Byzantine Generals solutions
has a related clock synchronization algorithm that uses the same number of
processes to handle the same number of failures. These algorithms are described
in [8].
The frequency with which the clocks need to be resynchronized depends upon
the accuracy of the clocks and the maximum permitted difference e. If the clocks
are accurate to one part in l0 G, then they can drift apart at the rate of about one
microsecond per second. For e on the order of a millisecond, this implies that
resynchronization must be performed several times per hour. As these figures
indicate, the cost of clock synchronization should be negligible for systems with
reasonably accurate clocks.
7.2. Reducing the Costs
There are two ways to reduce the cost of using our method to implement a
system: (i) modify Algorithm III and (ii) choose a different state machine. We
describe two ways of reducing the cost of Algorithm III, and then discuss the
choice of state machine.

Reducing the Cost of Broadcasting Commands. Most of the cost of Algorithm
III comes from the Byzantine Generals solution used to broadcast the commands.
As we described above, there are certain fundamental costs required of any
solution. The only way to reduce these costs is to reduce the degree of faulttolerance--by either decreasing the number of faults or restricting the class of
faults that can be handled.
Reducing the cost by limiting the class of faults was discussed above, and
invariably decreases the system's reliability. However, it is often possible to
reduce the number of faults that can be handled without seriously affecting the
overall system reliability. The Byzantine Generals algorithms to handle [ failures
work even when all the failures occur simultaneously. In many cases, processes
tend to fail one at a time, and multiple failures are highly improbable. One can
then implement an N-process system that can tolerate a single failure. When a
failure is detected, the system is reconfiguredwas described in Section 6--to
remove the failed process, forming an (N - D-process system that can again
tolerate a single failure. This approach works so long as no failure occurs before
the previous failure has been diagnosed and the reconfiguration accomplished.
How faulty processes are diagnosed will depend upon the details of the underlying
network of processors. Additional reliability is achieved by using a system that
can tolerate two failures. This approach is used in the SIFT system [14].
ACM Transactions on Programming Languages and Systems, Vol. 6, No. 2, April 1984.

274

•

Leslie Lamport

The need for a Byzantine Generals solution in achieving fault-tolerance has
not been widely recognized because most previous work considers only the kinds
of failure that make the problem trivial. These previous solutions therefore
appear much less costly than ours. However, by restricting ourselves to such
simple failure modes, we can use an inexpensive solution to the Byzantine
Generals Problem, obtaining a less costly implementation of Algorithm III. For
example, if we assume a perfectly reliable communication medium having a
broadcast facility--i.e., one solving the Byzantine Generals Problem for u s - then the delay A becomes equal to 5 + e. In this case, as we will see in Section 8,
our method may produce faster response than algorithms requiring explicit
acknowledgment messages. Many published algorithms make exactly this assumption about the communication medium.

Executing Commands Sooner. Using Algorithm IV instead of Algorithm III
can allow commands to be executed sooner, and in some cases will significantly
improve the system response time in the absence of failures. The response times
of the two algorithms are compared in Section 8. The response time of Algorithm
IV is the same as Algorithm III if any process issues implicit N U L L commands.
Since this is exactly what a failed process is likely to do, Algorithm IV should be
used with a reconfiguration scheme that removes failed processes.
Choosing the State Machine--the Kernel Approach. Algorithm III gives a very
reliable implementation of the functions performed by the state machine. To
minimize the cost, the state machine should only perform those system functions
which must be executed very reliably. Other functions should be performed by
some separate mechanism. Thus, instead of using the state machine to control
the entire system, one defines a state machine for a synchronizing kernel of the
system. The kernel machine is chosen to minimize the following:
- - T h e frequency with which nonNULL commands are issued.
- - T h e amount of information in the system state.
- - T h e number N of processes.
The use of a synchronizing kernel is illustrated with three examples.

Distributed File System. The replicated database solution given in Section
5 provides a simple way of implementing a distributed file system--we just let
the data items be the individual files. However, this would be too expensive for
the following reasons.
--There is too much state information. The state contains the contents of each
file, so every process must maintain a copy of every file, which requires too
much storage.
- - N o n N U L L commands are issued too frequently. Each read or write to a file
requires a state machine command.
Although multiple copies of some critical files might need to be kept, most files
probably need not be replicated. However, the directory of files would have to be
reliably maintained--without it no files could be accessed. One would therefore
use Algorithm III to implement a synchronization kernel that maintains the
directory. The directory would be part of the state of the kernel machine, and
ACM Transactions on Programming Languages and Systems, Vol. 6, No. 2, April 1984.

Using Time Instead of Timeout

•

275

state machine commands would be used to find the location of the current version
of a file and to update that information when a new version is created. The actual
reading and writing of the files would be external to the kernel system. Thus,
only the directory would be replicated, and state machine commands would be
needed only for opening and closing a file, not for individual read and write
operations.

Transaction Commit. In the transaction commit algorithm described above,
the amount of replication is determined by the total number of processes involved
in the transaction. However, the amount of replication should be determined by
the degree of fault-tolerance required. In many cases, less replication will suffice.
Assume that N-fold replication is required for reliability and that there are N
+ M processes in all. Processes numbered 1 through N are designated "commit
coordinators", and use Algorithm III to implement a "commit machine". Each of
the M other processes is connected to one or more commit coordinators. Any
commit coordinator can issue a transaction complete request, which is relayed by
the commit coordinators to the other processes. Each process then sends a
commit or abort request to one or more commit coordinators, who enter the
request as a state machine command. The final "commit" or "abort" decision is
relayed to the other processes by the commit coordinators.
The state of the commit machine is the same as before, except that
state(I).status is an (N + M)-tuple instead of an N-tuple. The transaction
complete, abort, and timeaction commands are the same as before, except with a
larger value of ~2. Instead of the simple commit command, a commit coordinator
can issue a command to register a commit request for itself and a subset of the
M other processes, causing the appropriate components of state (I).status to be
set to "commit". The details are left to the reader.
In this example, the synchronizing kernel consists of the algorithm implemented by the N commit coordinators. All communication between them and
the other M processes is external to the kernel system. We have effectively
defined a standard two-phase commit protocol, except with a highly reliable
distributed commit coordinator instead of a single central coordinator.
This kernel solution involves three more message delays than the solution of
Section 5--one to inform the other processes of the transaction complete request,
one to send their commit and abort requests to the commit coordinators, and the
third to inform them of the final decision. However, the number of messages has
been reduced from O([N + M] 2) to O(N 2 + M) (in the absence of failure).
Fault Detection. There are very simple methods for achieving a limited
degree of fault-tolerance that are based upon detecting when a process fails and
taking appropriate corrective action. For example, one can rely upon a single
controller process, and switch to a different controller if the current one fails.
This does not provide a high degree of fault-tolerance, since a failure can cause
incorrect behavior until corrective action is taken. However, this may be acceptable in some situations.
For these methods to work, there must be a reliable mechanism by which the
nonfaulty processes agree upon which processes are faulty and what corrective
action is to be taken. With a single controller process, it is crucial that all
ACM Transactions on Programming Languages and Systems, Vol. 6, No. 2, April 1984.

276

•

LeslieLamport

nonfaulty processes agree upon who the controller is. Our approach can be used
to achieve this agreement. A process issues a state machine command whenever
it detects a failure, and the state machine output indicates when a new controller
is selected. The precise specification of the state machine will depend upon the
mechanism by which processes recognize failures and the actions to be taken
after a failure is discovered. Using our state machine approach ensures that the
same decision is reached by all nonfaulty processes.
In this example, a highly fault-tolerant kernel is used to perform the most
critical function--maintaining agreement. The kernel machine has a small
amount of state information (the failure status of the processes), and commands
are issued infrequently (only when a fault is detected or a process restarted).

8. RESPONSE TIMES: TIME VERSUS TIMEOUT VERSUS
ACKNOWLEDGMENTS

We now compare Algorithm III with more traditional methods that use timeout
instead of absolute times. We know of no method using timeout that is comparable to Algorithm III in its generality and degree of fault-tolerance. However, it
is possible to draw some general conclusions about the comparative response
times of the two methods.
Algorithm III uses clocks to avoid sending explicit acknowledgment messages.
It is possible to do the same thing with timeouts, letting the absence of a response
within a certain length of time denote a particular NULL response. This kind of
algorithm will be called a "pure timeout" algorithm. A "traditional timeout"
algorithm uses explicit acknowledgments, with a timeout occurring only when
there is a failure. A traditional timeout algorithm will have the same response
time as Algorithm IV when the response is generated by the receipt of acknowledgments and not by a timeout. When a failure prevents the receipt of the
acknowledgments needed to generate the response, a traditional timeout algorithm has the same response time as a pure timeout one, and Algorithm IV has
the same response time as Algorithm III. There are thus three response times
that need to be compared: the response times for Algorithm III and pure timeout
algorithms, and the time required with acknowledgments in the absence of failure.
Recall that 5 is the maximum value for delay between the occurrence of an
event and the delivery of a message generated by that event. We define 5expto be
the expected value of this delay, and define 5vatto be the difference between the
maximum and minimum possible delays. Thus, an event at time T is processed
and generates a message that arrives at another process between times T + 5 5vatand T + 5, with an expected arrival time of T + 5exp.
To make our comparison, we consider the resource allocation problem under
the assumptions that there are no failures and the communication graph is
completely connected. We then have A = 5 + e in Algorithm III. Suppose that at
time T on its clock, process i issues a request for the resource. Furthermore,
suppose that the resource is free and no other process wants to acquire it.
With Algorithm III, process i learns by time T + 5 + e on its clock that there
are no conflicting requests, so it can acquire the resource then. (In all algorithms,
we ignore the time needed by a process to generate the response after it has the
necessary information.) With a pure timeout algorithm, process i can set a timer
and acquire the resource when a timeout occurs and no conflicting request has
ACM Transactions on Programming Languages and Systems, Vol. 6, No. 2, April 1984.

Using Time Instead of Timeout

•

277

arrived. Assumption UC1 implies that the timeout can be set to occur at time
T + 28 on its clock. With explicit acknowledgments, process i can acquire the
resource after receiving an acknowledgment from every other process. The
expected time for this to occur is about 2/~exp. Comparing the three methods, we
see that the times taken to acquire the resource are:
Algorithm
III

Pure Timeout

Acknowledgment

28

2~exp

+ ~

These times are for the resource allocation problem in a completely connected
communication network with no failures, but they seem to reflect the general
relation among the methods. All synchronization problems seem to require
waiting for explicit or implicit acknowledgments, and relaying messages along
p a t h s - - t o handle failures or missing communication links--just causes all the
delays to be multiplied by the path lengths. Of course, the delay for explicit
acknowledgments is only meaningful when there is no failure.
The delay in sending and processing a message depends upon the system level
of the processes being synchronized. For low-level processes having complete
access to and control over the hardware, it equals the actual hardware delay in
transmitting and processing a message. There tends to be relatively little variation
in these delays, so in this case we have:
Low-leveh

8vat<< ~o~p ~ ~.

For high-level processes, which must depend upon lower-level system services to
transmit messages and to allocate processing resources, the delay includes a great
deal of system overhead. There is a relatively large variation in the delays, so we
have:
High-level:

8e~p< 5v~r = 8.

Within a single system, the value of 8 can be two orders of magnitude greater for
high-level processes than for low-level ones.
The value of E depends upon how clock synchronization is performed. If clocks
are synchronized by some mechanism completely external to the system--e.g.,
the radio signals broadcast by the National Bureau of Standards--then there
need be no relation between ~ and any other system parameters. However, if
clock synchronization is done by sending messages over the communication links,
then there is a relation between ~ and ~.... Let n be the diameter of the
communication graph. Clock synchronization algorithms that can handle only
simple types of failure yield values of ~ approximately equal to n~.... The
sophisticated algorithms of [8] that can handle f completely arbitrary failures,
based upon the Byzantine Generals solutions, give a value for ~ of about
( f + n)5 .... 5
These considerations lead to the conclusion that when response time is important, Algorithm III should be used only if clock synchronization is performed
5We assume here that clocksare accurateenoughand resynchronizedoften enoughso that differences
in their rates of advance do not contribute significantlyto ~.
ACMTransactionson ProgrammingLanguagesand Systems,Vol.6, No. 2, April1984.

278

•

LeslieLamport

with low-level processes. This is not surprising, since one cannot accurately
synchronize clocks using high-level system calls. When low-level clock synchronization is used, the results of our analysis can be summarized as follows.
- - F o r synchronizing low-level processes: explicit acknowledgment algorithms
offer no advantage; and Algorithm III provides better response time than
timeout algorithms (except for very sparse communication graphs).
- - F o r synchronizing high-level processes: explicit acknowledgment algorithms
provide the fastest response (in the absence of failure); and Algorithm III
responds about twice as fast as pure timeout algorithms.
9. CONCLUSION

We have described a general method for implementing any desired form of
synchronization in a distributed system with any desired degree of fault-tolerance.
The synchronization is specified in terms of a state machine, and Algorithm III
is used to execute that state machine. The fault-tolerance of the Algorithm III is
obtained by the use of a solution to the Byzantine Generals problem.
Algorithm III is conceptually simplified by considering each process to issue a
command on every clock tick, and executing all these commands in sequence. By
using synchronized clocks that read absolute time, the algorithm can be implemented in such a way that most of the message transmissions and state machine
executions are performed by doing nothing. In practice, each process executes an
interrupt-driven program.
The idea of obtaining a general synchronization method by implementing an
arbitrary state machine was given in [6]. There, the state machine worked in
"logical time", and an explicit response from every other process was needed to
advance a process's logical clock. Failure of any process prevented further
progress, so no fault-tolerance was obtained. With Algorithm III, the operation
of the state machine occurs in real clock time, which advances regardless of the
actions of any other process. Hence, process failure does not impede system
progress.
The use of clocks allows the elimination of acknowledgment messages. Instead
of waiting for a response, the normal situation can become waiting for a nonresponse. However, this may increase the delay because when waiting for a nonresponse, a process always has to wait for the worst-case response time. Acknowledgment messages can also be eliminated by using timeout, but Algorithm III
gives smaller delays when good clock synchronization mechanisms are employed.
We have given a number of examples to substantiate our claim that Algorithm
III can be used to solve any synchronization problem, with any degree of faulttolerance. It is difficult to compare these solutions with previously published
ones. The Byzantine Generals algorithms that have been studied are highly faulttolerant, and using them in Algorithm III makes our solutions much more costly
than other solutions. However, all previous distributed synchronization algorithms we know of can be defeated by a single failure; they can at best tolerate a
restricted class of failure. It is not meaningful to compare them with our much
more fault-tolerant solutions. It is possible to reduce the cost of our solutions by
A C M Transactions on Programming Languages and Systems, Vol. 6, No. 2, April 1984.

Using Time Instead of Timeout

•

279

using a less fault-tolerant Byzantine Generals algorithm. We believe that the
resulting algorithms will be quite competitive with more traditional ones using
timeouts.
By assuming a solution to the Byzantine Generals Problem, we have assumed
that failures cannot partition the system into two noncommunicating components. This requires sufficient connectivity in the network of communication
links. The possibility of the system becoming partitioned complicates the problem. Just stating the properties required of a solution is quite difficult, as can be
seen from the discussion in [7].
Perhaps the most striking difference between algorithms obtained by our
method and ones based upon timeout is that using timeout produces a traditional
distributed algorithm in which the processes operate asynchronously, while our
method produces a globally synchronous one in which every process does the
same thing at (approximately) the same time. Our method seems to contradict
the whole purpose of distributed processing, which is to permit different processes
to operate independently and perform different functions. However, if a distributed system is really a single system, then the processes must be synchronized
in some way. Conceptually, the easiest way to synchronize processes is to get
them all to do the same thing at the same time. Therefore, our method is used to
implement a kernel that performs the necessary synchronization--for example,
making sure that two different processes do not try to modify a file at the same
time. Processes might spend only a small fraction of their time executing the
synchronizing kernel; the rest of the time, they can operate independently--e.g.,
accessing different files. This is an approach we have advocated even when faulttolerance is not required [6]. The method's basic simplicity makes it easier to
understand the precise properties of a system, which is crucial if one is to know
just how fault-tolerant the system is.
REFERENCES
1. DOLEV, D. The Byzantine Generals strike again. J. Algorithms 3, 1 (1982), 14-30.
2. D. DOLEV AND H.R. STRONG. Authenticated algorithms for Byzantine agreement. Rep.
RJ3416(40682), IBM Research Laboratory, San Jose, Calif., Mar. 1982. SIAM J. Comput. To be
published.
3. K.P. ESWARAN, J.N. GRAY, R.A. LORIE AND I.L. TRAIGER. The notions of consistency and
predicate locks in a database system. Commun ACM 19, 11 (Nov. 1976), 624-633.
4. FISCHER, M., AND LYNCH, N. A lower bound for the time to assure interactive consistency. In[.
Proc. Lett. 14, 4 (1982), 183-186.
5. FISCHER, M.J. LYNCH, N.A. AND PATERSON, M.S. Impossibility of distributed consensus with
one faulty process. Tech. Rep. MIT/LCS/TR-282, M.I.T., Cambridge, Mass., Sept. 1982.
6. LAMPORT, L. Time, clocks, and the ordering of events in a distributed system. Commun. ACM
21, 7 (July 1978), 558-565.
7. LAMPORT, L. The implementation of reliable distributed multiprocess systems. Comput. Networks 2 (1978), 95-114.
8. LAMPORT, L. AND MELLIAR-SMITH, P.M. Synchronizing clocks in the presence of faults
(submitted for publication).
9. LAMPORT, L., SHOSTAK, R. AND PEASE, M. The Byzantine Generals Problem. ACM Trans.
Prog. Lang. Syst. 4, 3 (July 1982), 382-401.
10. LYNCH, N.A., FISCHER, M.J. AND FOWLER, R. A simple and efficient Byzantine Generals
algorithm. In Proceedings of 2nd Symposium on Reliability in Distributed Software and Database
Systems (Pittsburgh, Pa., July 1982).
ACM Transactions on Programming Languages and Systems, Vol. 6, No. 2, April 1984.

280

•

Leslie Lamport

11. PEASE, M., SHOSTAK, R. AND LAMPORT, L. Reaching agreement in the presence of faults.J.
ACM 27, 2 (Apr. 1980), 228-234.
12. SCHNEIDER, F.B. Synchronization in distributed programs. Trans. Prog. Lang. Syst. 4, 2 (Apr.
1982), 125-148.
i~. SCHNEIDER, F.B., GRIES, D. AND SCHLICHTING, R.D. Fault-tolerantbroadcasts. In Science of
Computer Programming. To be published.
14. WENSLEY, J.H., LAMPORT, L., GOLDBERG, J., GREEN, S.W., LEVITT, K.N., MILLIAR-SMITH,
P.M., SHOSTAK, R.E., WEINSTOCK, C.B., AND BERSON, D. SIFT: Design and analysis of a
fault-tolerant computer for aircraft control. Proc. IEEE 66, 10 (Oct. 1978), 1240-1254.
Received March 1982; revised December 1982; accepted April 1983

ACM Transactions on Programming Languages and Systems, Vol. 6, No. 2, April 1984.