An Ecient and General Implementation of Futures on
Large Scale Shared-Memory Multiprocessors

A Dissertation

Presented to
The Faculty of the Graduate School of Arts and Sciences
Brandeis University
Department of Computer Science
James S. Miller, advisor

In Partial Ful llment
of the Requirements of the Degree of

Doctor of Philosophy
by

Marc Feeley
April 1993

This dissertation, directed and approved by the candidate's committee, has been accepted and approved by the Graduate Faculty of Brandeis University in partial ful llment of the requirements for the degree of
DOCTOR OF PHILOSOPHY

Dean, Graduate School of Arts and Sciences

Dissertation Committee

Dr. James S. Miller (chair)
(Digital Equipment Corporation)

Prof. Harry Mairson

Prof. Timothy Hickey

Prof. David Waltz

Dr. Robert H. Halstead, Jr.
(Digital Equipment Corporation)

Copyright by
Marc Feeley
1993

Abstract
An Ecient and General Implementation of Futures on Large Scale
Shared-Memory Multiprocessors
A dissertation presented to the Faculty of the Graduate School of
Arts and Sciences of Brandeis University, Waltham, Massachusetts
by Marc Feeley
This thesis describes a high-performance implementation technique for Multilisp's
\future" parallelism construct. This method addresses the non-uniform memory access
(NUMA) problem inherent in large scale shared-memory multiprocessors. The technique
is based on lazy task creation (LTC), a dynamic task partitioning mechanism that
dramatically reduces the cost of task creation and consequently makes it possible to
exploit ne grain parallelism. In LTC, idle processors get work to do by \stealing"
tasks from other processors. A previously proposed implementation of LTC is the
shared-memory (SM) protocol. The main disadvantage of the SM protocol is that
it requires the stack to be cached suboptimally on cache-incoherent machines. This
thesis proposes a new implementation technique for LTC that allows full caching of
the stack: the message-passing (MP) protocol. Idle processors ask for work by sending
\work request" messages to other processors. After receiving such a message a processor
checks its private stack and task queue and sends back a task if one is available. The
message passing protocol has the added bene ts of a lower task creation cost and simpler
algorithms. Extensive experiments evaluate the performance of both protocols on large
shared-memory multiprocessors: a 90 processor GP1000 and a 32 processor TC2000.
The results show that the MP protocol is consistently better than the SM protocol.
The di erence in performance is as high as a factor of two when a cache is available
and a factor of 1.2 when a cache is not available. In addition, the thesis shows that the
semantics of the Multilisp language does not have to be impoverished to attain good
performance. The laziness of LTC can be exploited to support at virtually no cost
several programming features including: the Katz-Weise continuation semantics with
legitimacy, dynamic scoping, and fairness.

Acknowledgements

Cette these est dediee a mes grandparents Rose et Emile
Monna pour l'amour
que j'ai pour eux.

I wish to thank my family, my friends, and colleagues without whom this thesis would
not have been possible.
Special thanks go to Jim Miller, my thesis advisor, for giving me the freedom to explore
my ideas at my own pace. He has gone beyond the call of duty to see me through with
my degree.
Bert Halstead's words of encouragement gave me the con dence that my ideas were
interesting and worth writing about. Thank you Bert.
Sabine Bergler deserves special thanks for taking care of me.
To Chris, Mauricio, Harry, Emmanuel, Don, Shyam, Larry, Xiru, Mary and Paulo,
thank you for making my stay at Brandeis so enjoyable.
Finally, I wish to thank the National Science and Engineering Research Council of
Canada and the Universite de Montreal for nancial support, and Michigan State University, Argonne National Laboratory, Lawrence Livermore National Laboratory, and
the MIT AI Laboratory for the use of their computers.

Contents
1 Introduction
1.1
1.2
1.3
1.4

1.5
1.6
1.7

1.8

Motivation : : : : : : : : : : : : : : : : :
Why Multilisp? : : : : : : : : : : : : : : :
Fundamental Issues : : : : : : : : : : : : :
Architecture : : : : : : : : : : : : : : : : :
1.4.1 Shared-Memory MIMD Computers
1.4.2 Non-Uniform Memory Access : : :
1.4.3 Sharing Data : : : : : : : : : : : :
1.4.4 Caches : : : : : : : : : : : : : : : :
1.4.5 Memory Consistency : : : : : : : :
The GP1000 and TC2000 Computers : : :
Memory Management : : : : : : : : : : :
Dynamic Partitioning : : : : : : : : : : :
1.7.1 Eager Task Creation : : : : : : : :
1.7.2 Lazy Task Creation : : : : : : : :
Overview : : : : : : : : : : : : : : : : : :

2 Background

2.1 Scheme's Legacy : : : : : : : : : : : : : :
2.2 First-Class Continuations : : : : : : : : :
2.2.1 Continuation Passing Style : : : :
2.2.2 Programming with Continuations :
2.3 Multilisp's Model of Parallelism : : : : : :
2.3.1 FUTURE and TOUCH : : : : : : : : :
i

:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:

:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:

:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:

:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:

:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:

:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:

:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:

:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:

:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:

:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:

:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:

:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:

:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:

:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:

:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:

:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:

:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:

1

2
3
4
4
5
6
7
7
9
12
13
15
18
18
20

21

21
25
25
26
28
28

2.4

2.5

2.6
2.7
2.8

2.9

2.3.2 Placeholders : : : : : : : : : : : : : : : : :
2.3.3 Spawning Trees : : : : : : : : : : : : : : : :
Types of Parallelism : : : : : : : : : : : : : : : : :
2.4.1 Pipeline Parallelism : : : : : : : : : : : : :
2.4.2 Fork-Join Parallelism : : : : : : : : : : : : :
2.4.3 Divide and Conquer Parallelism : : : : : : :
Implementing Eager Task Creation : : : : : : : : :
2.5.1 The Work Queue : : : : : : : : : : : : : : :
2.5.2 FUTURE and TOUCH : : : : : : : : : : : : : :
2.5.3 Scheme Encoding : : : : : : : : : : : : : : :
2.5.4 Chasing vs. No Chasing : : : : : : : : : : :
2.5.5 Critical Sections : : : : : : : : : : : : : : :
2.5.6 Centralized vs. Distributed Work Queue : :
Fairness of Scheduling : : : : : : : : : : : : : : : :
Dynamic Scoping : : : : : : : : : : : : : : : : : : :
Continuation Semantics : : : : : : : : : : : : : : :
2.8.1 Original Semantics : : : : : : : : : : : : : :
2.8.2 MultiScheme Semantics : : : : : : : : : : :
2.8.3 Katz-Weise Continuations : : : : : : : : : :
2.8.4 Katz-Weise Continuations with Legitimacy
2.8.5 Implementing Legitimacy : : : : : : : : : :
2.8.6 Speculation Barriers : : : : : : : : : : : : :
2.8.7 The Cost of Supporting Legitimacy : : : : :
Benchmark Programs : : : : : : : : : : : : : : : :
2.9.1 abisort : : : : : : : : : : : : : : : : : : : :
2.9.2 allpairs : : : : : : : : : : : : : : : : : : :
2.9.3 fib : : : : : : : : : : : : : : : : : : : : : :
2.9.4 mm : : : : : : : : : : : : : : : : : : : : : : :
2.9.5 mst : : : : : : : : : : : : : : : : : : : : : :
2.9.6 poly : : : : : : : : : : : : : : : : : : : : : :
2.9.7 qsort : : : : : : : : : : : : : : : : : : : : :
2.9.8 queens : : : : : : : : : : : : : : : : : : : :
2.9.9 rantree : : : : : : : : : : : : : : : : : : : :
ii

:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:

:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:

:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:

:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:

:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:

:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:

:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:

:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:

:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:

:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:

:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:

:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:

30
30
31
31
33
34
36
37
37
38
41
41
42
43
45
49
49
49
51
52
54
55
56
59
60
60
61
61
61
62
63
63
64

2.9.10 scan : : : : : : : :
2.9.11 sum : : : : : : : :
2.9.12 tridiag : : : : : :
2.10 The Performance of ETC

:
:
:
:

:
:
:
:

:
:
:
:

:
:
:
:

:
:
:
:

3 Lazy Task Creation

:
:
:
:

:
:
:
:

:
:
:
:

:
:
:
:

:
:
:
:

:
:
:
:

:
:
:
:

:
:
:
:

:
:
:
:

3.1 Overview of LTC Scheduling : : : : : : : : : : : :
3.1.1 Task Stealing Behavior : : : : : : : : : : : :
3.1.2 Task Suspension Behavior : : : : : : : : : :
3.2 Continuations for Futures : : : : : : : : : : : : : :
3.2.1 Procedure Calling Convention : : : : : : : :
3.2.2 Unlimited Extent Continuations : : : : : :
3.2.3 Continuation Heapi cation : : : : : : : : :
3.2.4 Parsing Continuations : : : : : : : : : : : :
3.2.5 Implementing First-Class Continuations : :
3.3 The LTC Mechanism : : : : : : : : : : : : : : : : :
3.3.1 The Lazy Task Queue : : : : : : : : : : : :
3.3.2 Pushing and Popping Lazy Tasks : : : : : :
3.3.3 Stealing Lazy Tasks : : : : : : : : : : : : :
3.3.4 The Dynamic Environment Queue : : : : :
3.3.5 The Problem of Over ow : : : : : : : : : :
3.3.6 The Heavyweight Task Queue : : : : : : : :
3.3.7 Supporting Weaker Continuation Semantics
3.4 Synchronizing Access to the Task Stack : : : : : :
3.5 The Shared-Memory Protocol : : : : : : : : : : : :
3.5.1 Avoiding Hardware Locks : : : : : : : : : :
3.5.2 Cost of a Future on GP1000 : : : : : : : : :
3.6 Impact of Memory Hierarchy on Performance : : :
3.7 The Message-Passing Protocol : : : : : : : : : : : :
3.7.1 Really Lazy Task Creation : : : : : : : : :
3.7.2 Communicating Steal Requests : : : : : : :
3.7.3 Potential Problems with the MP Protocol :
3.8 Code Generated for SM and MP Protocols : : : : :
iii

:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:

:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:

:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:

:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:

:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:

:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:

:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:

:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:

:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:

:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:

:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:

:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:

64
65
65
66

73

73
75
77
78
79
79
81
82
82
84
85
86
88
92
93
95
96
98
99
101
104
107
112
114
114
116
118

3.9 Summary : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 120

4 Polling Eciently

4.1 The Problem of Procedure Calls
4.1.1 Code Structure : : : : : :
4.1.2 Call-Return Polling : : :
4.2 Short Lived Procedures : : : : :
4.3 Balanced Polling : : : : : : : : :
4.3.1 Subproblem Calls : : : : :
4.3.2 Reduction Calls : : : : : :
4.3.3 Minimal Polling : : : : :
4.4 Handling Join Points : : : : : : :
4.5 Polling in Gambit : : : : : : : : :
4.6 Results : : : : : : : : : : : : : : :
4.7 Summary : : : : : : : : : : : : :

:
:
:
:
:
:
:
:
:
:
:
:

5 Experiments

5.1 Experimental Setting : : : : : : : :
5.2 Overhead of Exposing Parallelism :
5.2.1 Overhead on GP1000 : : :
5.2.2 Overhead on TC2000 : : :
5.3 Speedup Characteristics : : : : : :
5.3.1 Speedup on GP1000 : : : :
5.3.2 Speedup on TC2000 : : : :
5.4 E ect of Interrupt Latency : : : :
5.5 Cost of Supporting Legitimacy : :
5.6 Summary : : : : : : : : : : : : : :

:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:

:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:

:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:

:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:

:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:

:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:

:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:

:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:

:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:

:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:

:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:

:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:

:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:

:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:

:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:

:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:

:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:

:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:

:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:

:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:

:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:

123

125
125
126
127
129
129
131
132
135
135
136
139

141

142
144
145
145
145
157
158
159
162
163

6 Conclusion

165

A Source Code for Parallel Benchmarks

169

6.1 Future Work : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 167

A.1
A.2

: : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 172
allpairs : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 175

abisort

iv

A.3 fib : : :
A.4 mm : : :
A.5 mst : : :
A.6 poly : :
A.7 qsort :
A.8 queens :
A.9 rantree
A.10 scan : :
A.11 sum : : :
A.12 tridiag

:
:
:
:
:
:
:
:
:
:

:
:
:
:
:
:
:
:
:
:

:
:
:
:
:
:
:
:
:
:

:
:
:
:
:
:
:
:
:
:

:
:
:
:
:
:
:
:
:
:

:
:
:
:
:
:
:
:
:
:

:
:
:
:
:
:
:
:
:
:

:
:
:
:
:
:
:
:
:
:

:
:
:
:
:
:
:
:
:
:

:
:
:
:
:
:
:
:
:
:

:
:
:
:
:
:
:
:
:
:

:
:
:
:
:
:
:
:
:
:

:
:
:
:
:
:
:
:
:
:

:
:
:
:
:
:
:
:
:
:

:
:
:
:
:
:
:
:
:
:

:
:
:
:
:
:
:
:
:
:

:
:
:
:
:
:
:
:
:
:

:
:
:
:
:
:
:
:
:
:

:
:
:
:
:
:
:
:
:
:

:
:
:
:
:
:
:
:
:
:

B Execution Pro les for Parallel Benchmarks
B.1 abisort :
B.2 allpairs
B.3 fib : : : :
B.4 mm : : : :
B.5 mst : : : :
B.6 poly : : :
B.7 qsort : :
B.8 queens : :
B.9 rantree :
B.10 scan : : :
B.11 sum : : : :
B.12 tridiag :

:
:
:
:
:
:
:
:
:
:
:
:

:
:
:
:
:
:
:
:
:
:
:
:

:
:
:
:
:
:
:
:
:
:
:
:

:
:
:
:
:
:
:
:
:
:
:
:

:
:
:
:
:
:
:
:
:
:
:
:

:
:
:
:
:
:
:
:
:
:
:
:

:
:
:
:
:
:
:
:
:
:
:
:

:
:
:
:
:
:
:
:
:
:
:
:

:
:
:
:
:
:
:
:
:
:
:
:

:
:
:
:
:
:
:
:
:
:
:
:

:
:
:
:
:
:
:
:
:
:
:
:

:
:
:
:
:
:
:
:
:
:
:
:

:
:
:
:
:
:
:
:
:
:
:
:

:
:
:
:
:
:
:
:
:
:
:
:

v

:
:
:
:
:
:
:
:
:
:
:
:

:
:
:
:
:
:
:
:
:
:
:
:

:
:
:
:
:
:
:
:
:
:
:
:

:
:
:
:
:
:
:
:
:
:
:
:

:
:
:
:
:
:
:
:
:
:
:
:

:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:

:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:

:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:

:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:

:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:

:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:

:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:

:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:

:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:

:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:

:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:

:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:

:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:

:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:

:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:

:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:

176
177
178
181
182
183
184
185
186
187

191

194
195
196
197
198
199
200
201
202
203
204
205

vi

List of Tables
1.1 Costs of memory hierarchy for the GP1000 and the TC2000. : : : : : :

13

2.1 Characteristics of parallel benchmark programs running on GP1000. : :

69

3.1 Size of closure for each future in the benchmark programs. : : : : : : : : 87
3.2 Cost of operations involved in task stealing. : : : : : : : : : : : : : : : : 106
3.3 Measurements of memory access behavior of benchmark programs. : : : 109
4.1 Overhead of polling methods on GP1000. : : : : : : : : : : : : : : : : : 138
5.1
5.2
5.3
5.4
5.5
5.6
5.7

Performance of SM protocol on GP1000. : : : : : : : : : : : : : : : : : :
Performance of MP protocol on GP1000. : : : : : : : : : : : : : : : : : :
Performance of SM protocol on TC2000. : : : : : : : : : : : : : : : : : :
Performance of MP protocol on TC2000. : : : : : : : : : : : : : : : : : :
Performance of MP protocol on GP1000 with I = 2. : : : : : : : : : : :
Performance of MP protocol on GP1000 with I = 50. : : : : : : : : : : :
Overhead of supporting legitimacy, with and without speculation barrier
on GP1000. : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : :

vii

146
147
148
148
160
161
163

viii

List of Figures
1.1 The shared-memory MIMD computer used in this thesis. : : : : : : : : :

6

Non-local exit using call/cc. : : : : : : : : : : : : : : : : : : : : : : : :
Parallel map de nition and spawning trees. : : : : : : : : : : : : : : : : :
Parallel \vector" map. : : : : : : : : : : : : : : : : : : : : : : : : : : : :
Scheme encoding of Multilisp core. : : : : : : : : : : : : : : : : : : : : :
Procedures needed to support Multilisp core. : : : : : : : : : : : : : : :
Exception system based on dynamic scoping and call/cc. : : : : : : : :
Implementation of dynamic scoping with tail recursive call/cc. : : : :
MultiScheme's implementation of the future special form. : : : : : : : :
A sample use of futures and call/cc. : : : : : : : : : : : : : : : : : : :
A future body's continuation called multiple times. : : : : : : : : : : : :
Exception processing with futures. : : : : : : : : : : : : : : : : : : : : :
The Katz-Weise implementation of futures. : : : : : : : : : : : : : : : :
An application of speculation barriers. : : : : : : : : : : : : : : : : : : :
Fork-join algorithms and their legitimacy chain in the absence of chain
collapsing. : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : :
2.15 General case of legitimacy chain collapsing for fork-join algorithms. : : :
2.16 Fib and a poor variant obtained by unrolling the recursion. : : : : : : :

26
32
35
39
40
46
48
50
51
52
53
55
56

2.1
2.2
2.3
2.4
2.5
2.6
2.7
2.8
2.9
2.10
2.11
2.12
2.13
2.14

3.1
3.2
3.3
3.4

The task stack. : : : : : : : : : : : : : : : : :
Continuation representation and operations. :
Under ow and heapi cation algorithms. : : :
Resuming a heavyweight task. : : : : : : : : :
ix

:
:
:
:

:
:
:
:

:
:
:
:

:
:
:
:

:
:
:
:

:
:
:
:

:
:
:
:

:
:
:
:

:
:
:
:

:
:
:
:

:
:
:
:

:
:
:
:

:
:
:
:

:
:
:
:

:
:
:
:

58
59
70
75
80
83
88

3.5
3.6
3.7
3.8
3.9
3.10
3.11
3.12
3.13
3.14
3.15

The LTQ and the steal operation. : : : : : : : : : : : : : : : : : : : : : :
The task stealing mechanism. : : : : : : : : : : : : : : : : : : : : : : : :
The implementation of dyn-bind. : : : : : : : : : : : : : : : : : : : : : :
The DEQ and its use in recovering a stolen task's dynamic environment.
Code sequence for a future under the SM protocol. : : : : : : : : : : : :
Thief side of the SM protocol. : : : : : : : : : : : : : : : : : : : : : : : :
Victim side of the SM protocol. : : : : : : : : : : : : : : : : : : : : : : :
Relative importance of stack and heap accesses of benchmark programs.
Thief side of the MP protocol. : : : : : : : : : : : : : : : : : : : : : : : :
Victim side of the MP protocol. : : : : : : : : : : : : : : : : : : : : : : :
Assembly code generated for fib. : : : : : : : : : : : : : : : : : : : : : :

4.1
4.2
4.3
4.4
4.5
4.6

The for-each procedure and its corresponding code graph. : : : : : : : 126
Two instances of short lived procedures. : : : : : : : : : : : : : : : : : : 127
The maximal delta method. : : : : : : : : : : : : : : : : : : : : : : : : : 128
Procedure return invariants in balanced polling. : : : : : : : : : : : : : : 130
Compilation rules for balanced polling. : : : : : : : : : : : : : : : : : : : 133
Minimal polling for the recursive procedure sum and a tail recursive variant.134

5.1
5.2
5.3
5.4
5.5
5.6
5.7
5.8

Speedup curves for fib, queens, rantree and mm on GP1000. : : :
Speedup curves for scan, sum, tridiag and allpairs on GP1000.
Speedup curves for abisort, mst, qsort and poly on GP1000. : :
Speedup curves for fib, queens, rantree and mm on TC2000. : : :
Speedup curves for scan, sum, tridiag and allpairs on TC2000.
Speedup curves for abisort, mst, qsort and poly on TC2000. : :
Task creation behavior of MP protocol on GP1000. : : : : : : : : :
Task suspension behavior of MP protocol on GP1000. : : : : : : :

x

:
:
:
:
:
:
:
:

:
:
:
:
:
:
:
:

:
:
:
:
:
:
:
:

90
91
93
94
101
102
102
110
115
115
119

150
151
152
153
154
155
156
156

Chapter 1

Introduction
This work is about the design of an ecient implementation strategy for Multilisp's \future" parallelism construct on large shared-memory multiprocessors. A strategy known
as \lazy task creation" is used as a starting point for this work. Two implementations
of lazy task creation, one based on a shared-memory paradigm and the other based on a
message-passing paradigm, are explained and compared by extensive experiments with
a large number of benchmarks. The result can be summarized as follows
An implementation of lazy task creation based on a message-passing paradigm
is superior to one based on a shared-memory paradigm because it is

 simpler to implement,
 more exible and
 more ecient in nearly all situations because it allows full caching of
the stack on machines that lack coherent-caches (the di erence in performance is as much as a factor of two on the TC2000 multiprocessor).

In addition, this work shows how to eciently implement two important
language features in the presence of futures: dynamic scoping and rst-class
continuations. An ecient polling method designed to support messagepassing is also described and evaluated.
This thesis provides a detailed account of this result.
1

2

CHAPTER 1. INTRODUCTION

1.1 Motivation
As applications become bigger and more demanding, it is hard to resist the seductive
qualities associated with parallel processing. All too often however, application writers
are disillusioned when they discover that their carefully rewritten application running
on a parallel computer is barely faster, if not slower, than it was when running on a
cheaper uniprocessor machine.
Poor performance can be caused by a combination of factors. The degree of parallelism in the algorithms is one of the most important factors because it puts a strict
upper bound on the performance achievable by the program. Some algorithms have a
limited amount of parallelism and thus it is not possible to increase performance beyond
a certain size of machine. Moreover, even algorithms that scale up well with the size of
the machine, i.e. yield a speedup roughly equal to the number of processors, may still
have poor absolute performance if the parallel algorithm's \hidden constant" is large
when compared to a sequential algorithm.
Another factor is the \technological lag" that the hardware of parallel machines often
su ers. This is due to the smaller market and longer design times of parallel machines
when compared to mainstream uniprocessor machines. This lag can be expected to
decrease as parallel systems become more common.
The importance of these two factors can be minimized to some extent by careful
algorithm design and coding and the use of state of the art hardware. However, there
still remains another hurdle to overcome: the inherent ineciency of the language implementation. Clearly, the language features needed to support parallelism must be
implemented well to exploit the concurrency available in the application. It is just as
important, however, for the sequential constructs to be ecient since they account for a
high proportion of a program's code. There is little incentive to use a parallel machine
with 10 processors if the implementation runs sequential programs on one processor
10 times slower than when a non-parallel language is used. This explains the lack of
popularity of interpreter based implementations of Multilisp which run purely sequential code much slower than compiler based implementations of Lisp. Interestingly, the
language implementations with poor absolute performance usually have excellent relative performance (i.e. self-relative speedup). This is because the aspects of the system
that are critical to performance, such as memory latency and task spawning costs, are
masked by the huge overhead of interpretation (usually a factor of 10 to 100 times slower
than compiled code).

1.2. WHY MULTILISP?

3

Absolute performance is a major concern in this thesis. For this reason, the Multilisp
implementation techniques proposed here are evaluated in the context of a \production
quality" implementation. To perform experiments, a highly ecient Scheme compiler
called Gambit [Feeley and Miller, 1990] is used as a platform into which the implementation techniques are integrated and tested. This is to ensure that the setting is realistic
and that performance-critical issues are not overlooked. Typically the code generated by
Gambit for sequential programs is only about 20 percent slower (but sometimes faster)
than code generated by optimizing C compilers for equivalent C programs. Multilisp is
a suciently general programming language to be considered as a substitute for conventional languages for many sequential programming tasks. The results of this thesis
will make it even more attractive to choose Multilisp over other languages since it also
allows ecient parallel programming.

1.2 Why Multilisp?
Supercomputers have traditionally been employed for scienti c purposes so it isn't surprising that numerical applications have been the focus of most of the parallel processing
research. However, the need for high-performance is no longer bound exclusively to scienti c applications as time-consuming symbolic applications become more widespread.
These include applications such as expert systems, databases, simulation, typesetting,
compilation, CAD systems and user interfaces.
The growing need for high-performance parallel symbolic processing systems is the
initial motivation for this work. Multilisp suggests itself naturally since it is a member of
the Lisp family of symbolic processing languages. It was designed by Halstead [Halstead,
1984] as an extension of Scheme with a few additional constructs to deal with parallelism.
The most important of these is the future special form whose origin can be traced back
to [Baker and Hewitt, 1978].
From its inception, the purpose of Multilisp has been to provide a testbed for experimentation in the design and implementation of parallel symbolic processing systems. Through the years it has evolved along several distinct paths to accommodate
novel uses of the language. The rst implementation of Multilisp was \Concert Multilisp" which ran on a custom designed multiprocessor [Halstead, 1987, Halstead et al.,
1986]. Multilisp's model of parallel computation has become increasingly popular and
some of its features have now been adopted by other parallel Lisp systems. This includes both academic research systems such as QLisp [Gabriel and McCarthy, 1984,
Goldman and Gabriel, 1988], MultiScheme [Miller, 1987, Miller, 1988], Mul-T [Kranz et

CHAPTER 1. INTRODUCTION

4

al., 1989], Gambit [Feeley and Miller, 1990], PaiLisp [Ito and Matsui, 1990], Spur Lisp
[Zorn et al., 1988], Butter y portable standard lisp [Swanson et al., 1988] and Concurrent Scheme [Kessler and Swanson, 1990, Kessler et al., 1992] as well as commercially
available systems such as BBN Lisp [Steinberg et al., 1986], Allegro Common Lisp [Fra,
1990], and Top Level Common Lisp [Murray, 1990]. The future construct is actually
quite general and it has been used in more conventional languages such as C [Callahan
and Smith, 1989].

1.3 Fundamental Issues
Assuming that speed of computation is the main objective, the job of a Multilisp implementor can be seen as an optimization problem constrained by three factors
1. The semantics of the language.
2. The characteristics of the target machine.
3. The expected use of the system (i.e. applications).
Each instance of these factors de nes a particular implementation context . It is the
task of the designer to devise the most ecient implementation strategies that correctly
realize the given language semantics on the target machine. It is also important to
consider the target applications because it is through these that the features of the
system that are most critical for high performance can be identi ed. They also form
the ultimate measure of success of an implementation as a whole.
To explore the entire spectrum of implementation contexts for Multilisp would be a
daunting task well beyond the scope of this work. Rather, contexts that are most likely
to be useful in the present or the near future are examined. Emphasis is put on language
features, multiprocessor architectures and programming styles that have acquired some
popularity. The semantics of Multilisp and applications are discussed in greater depth
in Chapter 2.

1.4 Architecture
Inherent limitations of the target machine are inevitable facts of life for the implementor
of any language. To adequately address the issue of performance it is crucial to deter-

1.4. ARCHITECTURE

5

mine the salient features and weaknesses of the target architecture. This is especially
true for parallel machines because of the vast disparity in parallel architectures.

1.4.1 Shared-Memory MIMD Computers
The multiple instruction stream, multiple data stream (MIMD) shared-memory multiprocessor computer is used as the target architecture for this work. This choice is fueled
by on the one hand, the popularity and availability of these machines, and on the other,
the similarity with the programming model adopted by Multilisp.
There are two major architectural requirements imposed by Multilisp. The rst is
the possibility for processors to act independently from one another. This is needed
because Multilisp expresses parallelism through control parallelism, that is, it is possible to express concurrency between heterogeneous computations. Separate instruction
streams operating on separate data are thus needed to execute these computations in
parallel. The second requirement is the existence of a shared memory. In Multilisp,
as in most other Lisps, all objects exist in a single address space that is visible to all
parts of the program. There are no a priori restrictions on which procedure or tasks
can access a given object.
The shared-memory architecture has been severely criticized by some. The most
important objection is that the cost of accessing the shared memory must grow with the
size of the machine. Thus, large machines will su er from high latencies for references
to shared memory.
This fact is duly acknowledged but must be put in perspective. Programs which o er
a limited amount of parallelism only need to be run on machines whose size matches that
parallelism. Secondly, the existence of a shared memory does not imply that the programs make an important use of it. Message-passing paradigms can easily and eciently
be implemented on top of a shared memory (for example, see [LeBlanc and Markatos,
1992]). However, implementing shared memory on conventional message-passing machines is impractical because shared-memory operations are usually ne grained whereas
message-passing operations are typically optimized to manipulate large chunks of data.
Programs with irregular and dynamically changing communication patterns have a legitimate need for shared memory. These programs are often found in symbolic processing applications which need to traverse linked data structures such as lists, trees,
and graphs. Implementing these programs on a message-passing machine would be prohibitively expensive. Finally, it is expected that scalable caching techniques will hide
the high latencies of large shared memory to some extent. Caching issues are explored

CHAPTER 1. INTRODUCTION

6

Processor

Processor

Cache

Cache

Private
Memory

Shared
Memory

s

s

s

Private
Memory

Shared
Memory

Interconnection Network
Figure 1.1: The shared-memory MIMD computer used in this thesis.
later in this chapter.

1.4.2 Non-Uniform Memory Access
The model of the shared-memory MIMD architecture used in this thesis is shown in
Figure 1.1. A machine is composed of a number of processing nodes each of which has a
processor and three forms of memory: cache memory, private memory and shared memory. Each processor has direct access to its own private and shared memory (i.e. local
memory) and, through the use of the interconnection network, has access to the shared
memory of other processors (i.e. remote memory). The shared memory is physically
distributed across the machine while private memory is only visible to its associated
processor.
This is a non-uniform memory access (NUMA) architecture because the cost of
memory references is not constant. The cost depends on the type of memory being
referenced and its distance from the processor. A reference to the cache is thus cheaper
than a reference to local memory, which in turn is cheaper than a reference to remote
memory. The NUMA model is interesting because it re ects realistic properties of the
architecture as explained next.

1.4. ARCHITECTURE

7

1.4.3 Sharing Data
An important characteristic of data is the extent to which it must be shared. The
following classi cation will be used for the di erent types of data

 Private data is data that does not need to be communicated to other processors.
A simple example of private data is temporary values which are produced and
used by the same program section.

 Single writer shared data is accessible to more than one processor but it is
only mutated by a distinguished processor, the owner of the data.

 Multiple writer shared data is accessible to more than one processor and can
be mutated by any of these processors.

These types of data have di erent storage requirements. Private data is the least
restrictive (it could reside in the same storage as shared data) and multiple writer
shared data is the most restrictive. These di erences are a source of optimization for
the architecture which can implement each type in a di erent way (and at a di erent
cost). Thus, computers are often designed with various forms of private storage. Since
a processor has exclusive access to this storage it can be implemented eciently because
there is no need for an arbitration mechanism or multiple data paths. The processor's
registers are an extreme instance of private storage. Shared data is more expensive
because it must be stored in a location that is accessible to all processors. Single writer
and multiple writer shared data are distinguished because they o er di erent caching
possibilities.

1.4.4 Caches
Caches are a well known mechanism to enhance the performance of memory. A property
shared by almost all programs is that memory references are unevenly distributed. A
large proportion of all references are to a small proportion of the data. This observation
has lead to the design of multilevel memory systems. The idea is to place frequently
accessed data in a fast memory, a cache , in order to reduce the average time needed
for a reference. If the cache is large enough and the application's reference pattern is
well behaved then the cache will service most of the references. A memory hierarchy
can have several levels of caches but only a single one will be considered here.

8

CHAPTER 1. INTRODUCTION

Caches are quickly becoming a necessity to fully harness the power of modern processors. Current RISC processors have a cycle time that is much smaller than the fastest
memory chips. Processors with a 1 nanosecond cycle time will soon be available but it
is unlikely that the speed of large RAM chips will ever be close to that of the processor
(for example DRAM chips currently have a 25 nanosecond cycle time at best). Cache
memories are much faster than main memory because, due to their small size, they can
be put on the same chip as the processor (or at least close to it) and it is permissible
to use faster circuitry even if it is more expensive. The speed di erence between these
two types of memories varies from system to system but it is not uncommon for cache
memory to be 5 to 20 times faster than main memory. Clearly, it is a good idea to
design a system so that it maximizes cache usage. The bene ts of caching on a range
of programs is explored further in Chapter 3.
An important feature of caches is that they operate automatically. The programmer
does not have to explicitly state where a particular piece of data should go. The accesses
to memory are monitored and a copy of the frequently accessed data is kept in the cache.
The rst reference to a piece of data that is not in the cache (i.e. a cache miss ) actually
references the memory but subsequent references are potentially much faster because
a copy has been put in the cache. When space is needed in the cache, older pieces of
data are selectively purged from the cache according to a particular replacement policy
(e.g. random or least-recently used (LRU)).
The performance of a cache depends on h, the probability of a cache hit (also called
the hit rate ), and Lcache and Lmain , the latency of an access to the cache and to main
memory respectively. The average access latency Lmem is given by

Lmem = hLcache + (1 h)Lmain
Clearly, a high hit rate is advantageous since a value near one makes it appear as though
the memory can respond at the speed of the cache. There are many ways to improve the
hit rate. The size of the cache can be increased. Given the high cost of cache memory
this may be a cost e ective solution only up to a certain point. Another technique
is to reorganize the program so that data references to a particular datum are closer
in time. The probability of a datum being resident in the cache is higher if it has
been referenced recently (and even more so if LRU replacement is used). Finally, it is
sometimes preferable to disable the caching of data whose referencing pattern is such
that it does not gain much by caching. Caching such data is detrimental because it
causes the frequently used data to be purged from the cache, thus decreasing the hit
rate.

1.4. ARCHITECTURE

9

Two caching strategies have been popular in uniprocessor computers: copy-back and
write-through caching. These strategies di er in how writes to memory are handled.

 Copy-back caching handles a write by only modifying the copy in the cache.

The memory will eventually receive the correct value when the datum is purged
from the cache after a cache miss (this is called a writeback ). The expense of writes
is thus attributed to cache misses. If there are very few cache misses, writes to
memory are essentially the same cost as reads.

 Write-through caching bypasses the cache and performs the write to main

memory. However, the state of the cache is modi ed to re ect the new content
of memory. If the address being written to is resident in the cache it is simply
updated. Otherwise, the datum is added to the cache (most probably causing an
entry to be purged)1 . In addition to h, Lcache and Lmain , the performance of
write-through caching depends on the read ratio r (the proportion of all memory
references which are reads). The average access latency for write-through caching
is thus

Lmem = r(hLcache + (1 h)Lmain ) + (1 r)Lmain
= rhLcache + (1 rh)Lmain
Note that here h is the hit rate for reads only. The two caching methods have the
same performance when r = 1 but write-through caching quickly degrades as the
number of writes increases.

1.4.5 Memory Consistency
The notion of a single monolithic shared memory is a convenient abstraction to write
and reason about programs. However, caching if not done properly may violate this abstraction because memory consistency between processors is not preserved. For private
data there is no consistency problem caused by caching since all references go through
the cache. For single writer shared data it is possible to maintain consistency by using write-through caching. The processor owning the data uses write-through caching
and the readers disable the caching of the data. Consistency is preserved because the
memory always has the correct value for the datum and the readers always access the
1 The

datum could also be disregarded (i.e. not entered in the cache). This might be preferable for
applications which rarely read the locations recently written to (such as when initializing or updating
a large data structure).

10

CHAPTER 1. INTRODUCTION

memory when they reference the datum (of course, this means that only the owner
of the data bene ts from the cache). Unfortunately, write-through caching by itself is
not suciently powerful to maintain consistency for multiple writer shared data. The
problem is that the perception of the memory state can be di erent from processor to
processor if each one has cached the same datum in its own cache and mutated it in
a di erent way. For example, under copy-back and write-through caching, when two
processors A and B read variable x, a copy of x will exist in A's cache and another in
B's. If A then mutates x, B still believes that x has the original value.
There are two approaches to the memory consistency problem. The rst is to put
the responsibility of consistency on the programmer or compiler by providing a less rigid
consistency model. At appropriate points in the program special operations must be
added to ush or invalidate some of the entries in the caches. In the terminology of
[Gharachorloo et al., 1991], the strictest consistency model is sequential consistency . In
this model, memory behaves as though only one access is serviced at a time (i.e. accesses
are sequential). Thus any read request returns the last value written. In processor
consistency writes can be delayed an arbitrary (but nite) amount of time as long
as the writes from any given processor are performed in the same order as they were
issued by that processor (there is no ordering restrictions between processors). This
model can be implemented more eciently than sequential consistency because it allows
some form of pipelining and caching of the writes. Machines implementing processor
consistency usually have a \write barrier" instruction which waits until the memory has
processed all of that processor's writes. The weak consistency and release consistency
models [Dubois and Scheurich, 1990] are still weaker and more ecient. They guarantee
consistency only at synchronization points in the program. In other words, lock and
unlock operations (or similar synchronization operations) are barriers which wait until
the memory has processed all pending transactions. In these models, reads and writes
can be bu ered between synchronization operations.
An orthogonal approach to the consistency problem is to design specialized hardware
that maintains consistency between the caches and memory. In the previous example,
this would mean that when A mutates x, the new value for x is written to memory
(as in write-through caching) and B 's cache and any other cache holding a copy of x
is noti ed to either invalidate or update the appropriate entry. This is relatively easy
to perform on bus-based architectures because all caches and memory are immediately
aware of all transactions (they are directly connected to the shared bus). So called
snoopy-caches [Goodman, 1983] are based on this principle. Unfortunately, bus-based
architectures do not scale well because the bus has a limited bandwidth. Typically,
bus-based machines are designed with just enough processors to match the bandwidth

1.4. ARCHITECTURE

11

of the bus. For example the bus in the Encore Multimax can support up to 20 fairly
low-power processors2.
Maintaining consistency on scalable architectures is much harder. Currently, most
scalable cache designs are based on directories [Censier and Feautrier, 1978]. With
each datum is kept a list of the caches that are holding a copy of the datum and that
must be noti ed of any mutation. If n processors are holding a datum in their cache
then a mutation by one processor will require at least n 1 messages to be sent to
notify the caches. The moment at which these noti cations are sent depends on the
consistency model being used. Scalable cache designs usually do not implement strict
consistency in order to exploit bu ering and pipelining of writes. The main drawbacks of
directory based methods are the added memory needed for the directory and the added
inter-cache trac which reduces the e ective bandwidth of the interconnection network.
Fortunately, it seems that in typical applications most of the shared data is shared by
a very small number of processors [Lenoski et al., 1992, O'Krafka and Newton, 1990].
Limited directory caching methods, such as [Chaiken et al., 1991], take advantage of
this fact to reduce the space for the directory by only allowing a small number of copies
of a datum to exist at any given point in time.
However, there are certain forms of sharing that inevitably lead to poor cache performance. One such case is when two or more processors are very frequently writing to the
same memory location (perhaps to implement some kind of ne-grain communication
through shared memory). This causes thrashing in directory based methods because a
substantial amount of time is spent sending messages between the caches. This poor
performance is not surprising since caches are helpful only if there is locality of reference
to exploit. If the goal is to exchange data as quickly as possible between the processors,
caching is of little use since network latency will be unavoidable.
The moral here is that specialized hardware for memory consistency is not the
solution to all data sharing problems. Specialized hardware can only help if the program
has well behaved data usage patterns. When designing algorithms it is unreasonable to
assume an ecient consistent shared memory simply because the machine supports it
in hardware. The costs will vary according to how the data needs to be shared. As a
general rule, algorithms should be designed to promote locality of reference and rely as
little as possible on a strict consistency model and on multiple writer shared data.

2 It is interesting

consistency.

to note that even though it uses snoopy-caches the Multimax only implements weak

12

CHAPTER 1. INTRODUCTION

1.5 The GP1000 and TC2000 Computers
Data sharing issues play a central role in this thesis. The multilevel memory system
of the architectural model chosen here (i.e. Figure 1.1) re ects the importance of data
sharing issues by making the costs of sharing explicit. In this model caches do not
automatically preserve consistency. It is only by segregating the various types of data
and using the appropriate caching policy that consistency is maintained. It is assumed
that the caches can operate in copy-back and write-through caching on selected areas
of memory. Because private memory always contains private data, it is cached with the
most ecient caching policy: copy-back caching. Single writer shared data is cached
using write-through caching by the owner of the data and is not cached by the other
processors. Finally, multiple writer shared data is not cached in any way.
This model is attractive because building such a machine is relatively inexpensive
using current technology yet it has a high potential performance. Each node in the
architecture corresponds roughly to a modern uniprocessor computer. The only extra hardware needed to build a complete machine is that for the interconnect and its
interface to the processing nodes. The TC2000 computer [BBN, 1990], manufactured
by BBN Computers and introduced in 1989, matches this structure very closely. A
scalable multistage butter y network is used for the interconnection network. There
is a single local memory per node that is partitioned into shared and private sections
by system calls to the operating system. Other system calls allow the selection of the
caching policy for each memory block allocated. The GP1000 computer [BBN, 1989],
also by BBN, has a very similar architecture but uses older technology (the TC2000
uses M88000 processors rated at 20 MIPS whereas the GP1000 uses M68020 processors
rated at roughly 3 MIPS). The GP1000 also su ers from a slower interconnection network (approximately half the bandwidth of the TC2000) and the lack of a data cache3 .
These two computers are used throughout the thesis to do measurements and to compare di erent implementation strategies. Because scalability is an important issue, large
machines were used: a 94 processor GP1000 (at Michigan State University) and a 45
processor TC2000 (at Argonne National Laboratory). To serve as a guide, the costs of
the memory hierarchy for these computers is given in Table 1.1. The timings correspond
to the latency for referencing a single word for each level of the hierarchy4 . Note that the
3 However, each processor has a small instruction cache.
4 These costs were measured with benchmarks specially designed

to test the memory. As reported in
[BBN, 1990], the timing depends on many parameters such as the caching policy in use, the type of access
(read or write), the size of machine and the contention on the interconnection network. The timings in
the table are the average time between reads and writes, caching was inhibited when measuring local
and remote memory costs.

1.6. MEMORY MANAGEMENT

13

Latency in secs
Relative latency
Machine Cache Local Remote Cache Local Remote
GP1000
.475
5.750
1.0
12.1
TC2000
.150 .575
2.400
1.0
3.8
16.0
Table 1.1: Costs of memory hierarchy for the GP1000 and the TC2000.
cache on the TC2000 is faster than local memory by only a factor of 3.8. Many systems
currently have caches that perform much better than this. Also note that the latency
of a butter y network grows logarithmically with the number of processors. Machines
with several hundred processors would thus have roughly the same relative costs for the
memory hierarchy.

1.6 Memory Management
The design of a high-performance Multilisp system is a complex task where many, often
con icting, issues have to be addressed. Clearly an implementor must worry about how
to best implement the parallelism constructs themselves, but it is important to realize
that the support of parallelism has an impact on the sequential parts of the language
as well. High-performance techniques used in uniprocessor implementations of Lisp
cannot always be carried over to Multilisp as is, either because they become inecient
in a multiprocessor environment or, even worse, they do not work at all due to the
presence of concurrency.
As should be clear from the previous section, one of the most important problems
to tackle for a NUMA architecture is that of memory management. Lisp, and symbolic
processing in general, relies heavily on the manipulation of data structures and on their
dynamic creation. The costs of allocating, referencing and deallocating objects are
thus major components of the overall performance of the system. For a language like
Multilisp where data is implicitly shared, memory management is tricky to implement
eciently because, in general, data must be accessible to all the processors and be
mutable by all the processors.
In order to keep the reference costs low, a memory management policy for a NUMA
architecture must strive to physically locate the shared data close to the processor that
needs to access the data most frequently. For the TC2000, this means that data should
reside in the cache or the local memory of the processor most frequently accessing the

14

CHAPTER 1. INTRODUCTION

data. This is the proximity issue.
Another important goal is to arrange the data so that contention is minimized.
Contention occurs when more than one processor is trying to access the same shared resource (such as a memory bank or a path in the interconnection network). The resource
becomes a bottleneck to performance because requests must be serviced sequentially.
Contention can be inherent in the algorithm (when expressed explicitly as a critical
section) but it can also appear insidiously because of some particularity of the language
implementation or target machine. For example, a simple allocation strategy for vectors
is to reserve the space for all elements in a given memory bank. In such a situation, the
references to di erent elements of the vector are forced to be done sequentially even if
they are all logically concurrent. The same problem occurs when unrelated data values are referenced simultaneously and they happen to have been allocated in the same
memory bank. Certain shared-memory machines, such as the BBN Monarch [Rettberg
et al., 1990] and IBM RP3 [P ster et al., 1985], avoid some contention problems by
using \combining" networks which combine similar requests to the same memory location (e.g. read, clear, add a constant). However, combining networks are ine ective for
contention to unrelated data. A simple and general approach to minimize contention is
to scatter the data among all the memory banks. If the referencing pattern is uniformly
distributed, the probability that two references are to the same memory bank (out of n
memory banks) is n1 . Unfortunately this strategy compromises proximity because the
probability that a reference is to remote memory is nn 1 which approaches 1 for a large
machine.
There are basically two extreme ways in which the proximity and contention issues
can be handled. The placement of objects in memory can be left to the user or be
done automatically by the implementation. User controlled placement can be expressed
in several ways including declarations and the use of specialized data manipulation
operators. Automatic placement has the advantage of preserving the high-level nature
of the language, that is, the user does not need to know the details of the target machine.
However, there is just so much that can be expected of automatic techniques and, at least
for special purpose applications, the user can have knowledge of the memory reference
patterns that are next to impossible for the compiler to infer automatically.
It is important to distinguish two classes of data. User data is data explicitly created
and referenced by the data manipulation procedures of the language (e.g. cons, car and
set-car!). Internal data corresponds to data used internally by the implementation

1.7. DYNAMIC PARTITIONING

15

to support the language. Internal data includes










Environment frames
Continuation frames
Closures
Cells (for mutable variables)
Global variables
Tasks
Constants
Program code

Because these data structures are used in well de ned ways under the control of the
implementation, it is possible to design special purpose memory management policies
for them. For instance, local, contention free accesses to the program code and constants
are possible if they are copied to the private memory of each processor when the program
is loaded.
Both user data and internal data are important to optimize in a system. However,
this thesis concentrates on the management of internal data, and in particular the data
structures that are involved in dynamic partitioning. The placement of user data is not
considered here.

1.7 Dynamic Partitioning
One of the most fundamental operations performed by any parallel system is the distribution of work throughout the system. Each processor has to be aware of the computations it is required to do and at what time. The overall goal is to have the best usage
of the processing resources, that is to have the greatest number of processors doing useful things. Partitioning consists of dividing the program's total workload into smaller
tasks that can be assigned to the processors for concurrent execution. A prerequisite to
partitioning is of course knowing which pieces of the program can be done concurrently.
Since in Multilisp concurrency is stated explicitly by the user, it will be assumed here
that the only source of concurrency is the future construct5.
5 Thus,

in the expression (+ (* x 2) (* y 2)), the concurrency possible in the evaluation of the
arguments to + will be disregarded because it is not expressed with a future.

16

CHAPTER 1. INTRODUCTION

Partitioning can be done once and for all before the program is run. This static
partitioning has the advantage of being simple to conduct when the program naturally
decomposes into a xed number of equal sized tasks. It also permits some compilation
optimizations because important information, such as the particular assignment of tasks
to processors, the inter-task communication pattern, and the type of communication,
can sometimes be known at compile time. Programs with a regular computational
structure are good candidates for static partitioning.
Dynamic partitioning relegates the partitioning decisions to when the program is
running. This approach is more general because it can be applied to programs with
complex concurrency structures and also to programs whose concurrency is dependent
on the input data set. This generality is needed for Multilisp because the arbitrary
concurrency structures expressible with the future construct cannot be handled by static
partitioning methods. Another advantage is that better partitioning decisions can be
made because more information is available at run time. The size of the machine
(number of processors and memory size) is an important parameter that may not be
known at compile time. There are other equally important, but more subtle, partitioning
parameters that are only available at run time. For example, the number of active tasks
and idle processors at a given point in time are useful indicators of partitioning needs.

In a way, dynamic partitioning has the ability to adapt to its execution environment whereas static partitioning is stuck with irreversible compile time decisions that
are based on predictions of what the execution environment will be. Adaptability is
crucial to account for the varying computational nature of certain programs. Parallel sort is a good example to illustrate this point. The sort may have more or less
concurrency depending on the data set size (i.e. the number of items to sort) and the
cost of comparing two items. These parameters can vary in the same program if the
sort is called multiple times. Concurrency can also be a ected by the initial ordering
of the items. The sort algorithm might degenerate to a sequential algorithm for some
orderings and be perfectly parallel for others. Large programs add another dimension
to the argument. Large programs are typically composed of several smaller independent
modules. Concurrency can occur inside a module, between purely sequential modules,
and also between internally concurrent modules. It is quite possible that an internally
concurrent module, such as parallel sort, has to execute by itself at some point and
concurrently with other modules at some other point. The partitioning requirements
may vary greatly between these two cases. At one extreme, no partitioning is needed for
the sort if the other modules are doing long sequential computations and there happen
to be n 1 of them on an n processor machine.

1.7. DYNAMIC PARTITIONING

17

The main inconvenience of dynamic partitioning is that it adds a run time overhead. Dynamic partitioning is \administrative" work that gets added to the operations strictly required by the program (i.e. the mandatory work). Tasks are created to enable concurrent execution, but each task created adds a cost, in time and
space, because its state has to be maintained throughout its life (this includes task creation, activation, suspension and termination). A dynamic partitioning strategy must
nd some compromise between the bene t of added concurrency and the drawback of
added overhead. Some have avoided this problem to some extent by relying on specialized hardware to reduce the cost of managing tasks. Data ow machines [Srini, 1986,
Arvind and Nikhil, 1990] and multithreaded architectures [Halstead and Fujita, 1988,
Nikhil et al., 1991, Agarwal, 1991] fall in this category. However, software methods
are attractive because they o er portability and low hardware cost. This thesis explores software methods for lowering the cost of task management in the context of the
Multilisp language.
In a strict sense, partitioning only refers to the way the program gets divided up
into tasks. This de nition is not very useful for Multilisp because each evaluation of a
future leads to the creation of a new task; there are no partitioning decisions to be made.
However, choices are available at another level. There can be several representations
for tasks, each having its own set of features and management costs. The appropriate
representation for a particular task will depend on many factors but as a general rule
it will be best to select the one with the lowest cost that has all the required features.
Partitioning has a broad sense in this thesis. It refers to the choice of representation
that is used for the tasks in the program and the way that they are managed.
An important parameter a ecting the performance of dynamic partitioning is the
granularity of parallelism (G) of the program. G is de ned as the average duration of
a task

G = NTseq

task

Here Ntask is the total number of tasks created by the program and Tseq is the duration
of the program when all task operations are removed (i.e. Tseq is the mandatory work).
When the task operations are present, the work required for the program is Tseq plus
some task management overhead (Ttask ) for each task created

Tpar = Tseq + Ntask Ttask
Ttask contains the time to create, start and terminate a task. The total work required

CHAPTER 1. INTRODUCTION

18

to run the program on an n processor machine, Ttotal (n), will be Tpar plus some amount
that accounts for all other parallelism overheads including the costs of transferring tasks
between processors, synchronizing tasks, sharing user data, and being idle. The run time
on n processors is thus Ttotaln (n) . The eciency (E ) of the processors is the proportion
of the time they spend doing mandatory work. G and Ttask are important parameters
because they put an upper bound on eciency

E = T Tseq(n)  T + TNseq T = 1Ttask
total
seq
task task
1+ G
This equation suggests that eciency is a function of the relative size of G with respect
to Ttask . Higher eciency can be obtained either by increasing G or decreasing Ttask .

1.7.1 Eager Task Creation
A well known dynamic partitioning method is eager task creation (ETC). Its main
advantage is simplicity. Only a single representation for tasks exists in ETC: the heavyweight task object. Unfortunately, the task management cost for heavyweight tasks is
relatively high (on the order of hundreds of machine instructions). A coarse granularity is thus required to get good performance. For example, the granularity must be
at least in the hundreds of machine instructions to achieve better than 50% eciency.
This makes the programming task that much more dicult because granularity must
be taken into account when designing programs. Moreover, coarse grain programs have
less parallelism (fewer tasks) so there is a risk that they will only perform well on small
machines. Finally, some programs are hard to express with coarse grain parallelism.

1.7.2 Lazy Task Creation
A more ecient partitioning method called lazy task creation (LTC) is explored in this
thesis. In addition to the heavyweight task representation, LTC uses a much cheaper
lightweight representation. The method is described in detail in Chapter 3 but a general
description is given here to explain some of the issues.
LTC lowers the average task management cost by creating only as many heavyweight
tasks as necessary to keep all processors working. To do this, each processor maintains
a local data structure, the lazy task queue (LTQ), that indicates the availability of
tasks on that processor. When the program asks for the creation of a task, the LTQ is

1.7. DYNAMIC PARTITIONING

19

updated to indicate the presence of this new task. This operation is ecient because a
lightweight task representation is used. A lightweight task preserves enough information
to recreate the heavyweight task later on, if needed. Each entry in the LTQ is a pointer
into the stack, marking the boundary of that task's stack. The beauty of LTC is that,
when the processor becomes idle it can get work from its own LTQ at a low cost and
completely avoid the creation of a heavyweight task. When the LTQ is empty, the
processor must instead nd a task to resume from some other processor's LTQ. It is
only in this case that a high cost is paid to create a heavyweight task and transfer it
between processors.

Shared-Memory Protocol
But how exactly does this interaction take place? The protocol adopted in [Mohr, 1991]
uses a shared-memory paradigm. The stack and LTQ of all processors are directly
accessible to all processors (i.e. they are shared data). When processor A needs to get
work from processor B , it directly manipulates B 's LTQ and stack to extract a task.
This approach has unfortunate consequences. First of all, access and mutation of the
LTQ must be arbitrated because several processors may be competing for access. This
means that the cost of lightweight task creation is higher than might have been expected
because synchronization operations are needed to ensure that accesses to the LTQ are
mutually exclusive. This may be tolerable in certain contexts since the overhead cost will
be high only for parallel programs with ne grain parallelism. The second consequence
is much more serious. The protocol assumes that the stack and LTQ are in consistent
memory. Therefore, they cannot be cached as eciently as private data. This can have
a severe impact on performance because the stack is one of the most intensively used
internal data structures. The cost is also unrelated to the use of parallelism; sequential
programs will su er just as much as parallel ones. It is preferable for the stack to be
a private resource so that copy-back caching can be used (as is the case for sequential
implementations of Lisp).

Message-Passing Protocol
The stack and LTQ can be made private by adopting a message-passing protocol for
work distribution. When A needs to get work from B , it sends a request for work to
B. Upon receiving this message, B checks its LTQ for an available task and, if one is
available, sends it back to A. Since the LTQ and stack are only accessed locally there is
no need for synchronization operations when updating them. Lightweight task creation

20

CHAPTER 1. INTRODUCTION

is thus cheaper than with the shared-memory protocol. This allows very ne grain
parallelism to be ecient. Sequential code also bene ts because copy-back caching can
now be used for the stack.
Although it is promising, the message-passing protocol introduces some new issues.
How is the communication mechanism implemented and what is its cost? The latency of
the communication is also a factor. Can the processor respond fast enough to minimize
the idle time of the requesting processor?

1.8 Overview
The thesis is organized in 6 chapters. Chapter 2 gives a description of the Multilisp
language and its traditional implementation using ETC. Some ne points of its semantics
are discussed to clarify the constraints that must be met by the partitioning methods.
Finally, the benchmarks used for later experiments are presented.
Chapter 3 provides a detailed description of the shared-memory and message-passing
implementations of LTC. It is shown how support for dynamic scoping, continuations
and fairness can be added to LTC. This chapter also examines the memory usage characteristics of the benchmark programs to evaluate the bene ts of caching.
Chapter 4 concentrates on the communication mechanism required by the messagepassing protocol. An ecient software implementation is described and evaluated.
Chapter 5 compares the two LTC protocols. The performance of both protocols is
measured on several benchmarks and under numerous conditions.
The closing chapter summarizes the results of the thesis and suggests some future
lines of research.

Chapter 2

Background
Before discussing the implementation of the future construct, it is necessary to establish
the set of features that must be supported by the implementation. This is particularly
important because there is no formal standard for the Multilisp language; nearly every
implementation has its own peculiarities. This thesis takes the pragmatic view that
Multilisp is de ned by the set of features common to a number of implementations.
Choosing the set of supported features is a delicate process that is similar in many
ways to language design itself. The set should not be limited to the features that
are strictly common to all implementations as this would be ridiculously restrictive.
Features that have acquired a certain level of acceptance in the eld should also be
included. On the other hand, it is wise to select a small set of features that interact in a
coherent, well de ned way in order to provide a programming model with few surprises.
The chapter starts o by giving a de nition of the Multilisp semantics targetted by
this work. This includes the future construct common to all Multilisp implementations
and also two useful features of sequential Lisps which pose special problems in a parallel
setting: dynamic scoping and rst-class continuations. The ETC implementation of
this semantics is then presented. The chapter ends with a description of some Multilisp
programs later used to evaluate and compare various implementation strategies.

2.1 Scheme's Legacy
Multilisp inherits its sequential programming features from the Scheme dialect of Lisp
[IEEE Std 1178-1990, 1991]. Scheme was designed to be a relatively small and simple
21

22

CHAPTER 2. BACKGROUND

language with exceptional expressive power. There are few rules and restrictions for
forming expressions in Scheme, yet most of the major programming paradigms can
conveniently be expressed with it. This is not surprising since the language is based on
the theory of the lambda calculus.
There are six basic types of expressions in Scheme: constant, variable reference,
assignment, conditional, procedure abstraction (lambda-expression) and procedure call.
All the other types of expressions can be derived from the basic types and this is in fact
how they are de ned in the standard [IEEE Std 1178-1990, 1991, R4RS, 1991]. Being
able to reduce a program to the basic expressions is helpful both as an implementation
technique and as a means to understand programs and prove some of their properties.
It is also a considerable advantage for any extension e ort, such as Multilisp, because
the interaction of the extensions with the language can be more carefully analyzed by
limiting the study to the basic types of expressions.
Scheme o ers a rich set of data types including numbers, symbols, lists, vectors,
procedures, characters and strings. There are also several prede ned primitives to operate on these data types including procedures to create, destructure and mutate data.
Although Lisp-like languages have a historical inclination towards symbolic processing
applications, the elaborate support of numerical types in Scheme makes it a candidate
for numerical applications as well.
There has been an e ort in Scheme to make the language as uniform as possible.
All types of objects in Scheme share some basic properties that make them rst-class
values. Any object can be: used as an argument to procedures, returned as the result
of procedures, stored in data structures, and assigned to variables. Departing from
Lisp tradition, Scheme evaluates the operator position of procedure calls like any other
expression and does not impose any particular ordering on the evaluation of arguments
to procedures. The let and let* special forms are handy to force a particular ordering
when it is needed (this is what is done in the examples).
Objects have unlimited extent. They conceptually exist forever after they have been
created. In general this means that objects must be allocated in the heap. When there
is no space left in the heap, the system automatically invokes the process of garbage
collection to reclaim the heap space allocated to objects that are no longer needed for
the rest of the computation. In certain circumstances it is possible at compile time to
detect that an object is no longer needed past a certain point in the program. The
compiler can then use a specialized allocation policy (such as a stack) and explicitly
perform the deallocation. This reduces the frequency and cost of garbage collection.

2.1. SCHEME'S LEGACY

23

Scheme relies solely on static scoping as a method to resolve variable names. An
identi er refers to the variable with the same name in the innermost block that lexically
contains the reference and declares the variable. If no such block exists, the identi er
refers to a variable in the global environment. This naming rule corresponds to that
of block structured languages such as Pascal and Algol 60. Dynamic scoping is an
alternative method that has been traditionally used in other Lisps. The identity of
variables is not based purely on the lexical characteristics of the program (available at
compile time), but rather depends on the control path taken by the program at run time.
Although dynamic scoping has its specialized uses (e.g. see Section 2.7), its pervasive use
is not generally viewed as promoting modularity. In addition, ecient implementation
of dynamic scoping is often based on shallow binding, a strategy that is not well suited
for parallel execution. Static scoping permits the use of certain compilation techniques,
such as data ow analysis, that are dicult or impossible to perform with dynamically
scoped variables because the analysis would have to be done on the entire program.
In Scheme, procedures are viewed as rst-class values and thus have the same basic
properties as the other data types. With rst-class procedures many programming
techniques are easily implemented. Higher order functions, lazy evaluation, streams and
object-oriented programming can all be done using rst-class procedures (for example
see [Adams and Rees, 1988, Friedman et al., 1992]). Procedures created by lambdaexpressions are usually called closures to distinguish them from prede ned procedures.
The static scoping rules require all closures to carry, at least conceptually, the set of
variables to which they might refer (the closed variables). Consequently, variables have
unlimited extent and cannot generally be allocated in a stack-like fashion as in more
conventional languages. Closures pose additional problems in a parallel setting. Because
closures are just another data structure, contention may happen if several processors
are simultaneously calling the same closure. A typical situation would be the parallel
application of a closure to a set of values. Some optimizations can avoid contention
in some cases. For example, closures with no closed variables, such as globally de ned
procedures, are essentially constant so they can be created and copied to all processors
when the program is loaded. Lambda-lifting can also eliminate the need to create
closures by explicitly passing the closed variables between procedures. Both of these
techniques are used in Gambit. However, the general case remains hard to solve as it
is equivalent to the problem of data sharing. For this reason, true closures have been
avoided as much as possible in the benchmarks.
In accord with the goal of simplicity, the only way to transfer control in Scheme is
through the use of procedure calls. All types of recursion, whether they correspond to
an iteration or not, are expressed as procedure calls. There are two types of calls. If the

CHAPTER 2. BACKGROUND

24

value returned by a call is immediately returned by the procedure containing the call
it is a reduction call. Otherwise the call is a subproblem call. All implementations are
required to be properly tail recursive. That is, they must guarantee that loops expressed
recursively do not cause the program to run out of memory. In implementation terms,
this means that reduction calls must not retain the current procedure's activation frame
(the local variables and return address) past the actual transfer of control to the called
procedure.
Scheme is a call by value or applicative order language. The evaluation of the program is forced to follow an ordering that evaluates all arguments to a procedure before
the procedure is entered. The opposite policy, call by need or normal order evaluation,
doesn't evaluate any of the arguments to a procedure when the procedure is called.
Evaluation occurs when a strict operator, such as addition, needs the actual value.
Data transfer operations such as parameter passing and creation of data-structures are
not considered to be strict. Both policies have advantages. Programs using normal
order evaluation sometimes terminate when their applicative order counterparts do not.
On the other hand, applicative order is often more ecient. In Scheme, it is possible to
get the equivalent of normal order evaluation by using the delay special form to delay
evaluation and by rede ning the primitive procedures so that they force the evaluation
of the arguments in which they are strict1. The future construct is the dual of the delay
special form giving eager evaluation instead of lazy evaluation.
Scheme supports various avors of side-e ects such as assignment, data structure
mutation and input/output operations. Thus, it is considered to be an imperative programming language where sequencing of operations is a necessary concept. Nevertheless,
Scheme contains a powerful functional subset which can be used for purely functional
programming. Some algorithms are naturally expressed in a functional way, some others are expressed better with the use of side-e ects. In Scheme, both paradigms can
appear in the same program and the programmer can choose which best matches his
needs at any given point. It is however a good idea to limit the scope of side-e ects by
hiding them through abstraction barriers. For example, a sorting procedure can have a
functional speci cation even if it uses side-e ects internally. In practice, it seems that
Scheme favors a \mostly" functional style of programming where side-e ects are used
with discretion. This style of programming lends itself well to parallelism because subproblems are often independent and are thus possible targets for concurrent evaluation.

1 Delay only

exists in R4 RS [R4RS, 1991].

2.2. FIRST-CLASS CONTINUATIONS

25

2.2 First-Class Continuations
Perhaps Scheme's most unusual features is the availability of rst-class continuation
objects. Continuations have been used in the past to express the denotational semantics of programming languages such as Algol60 and Scheme itself [R3RS, 1986,
Clinger, 1984]. Most programming languages use continuations but they are usually
hidden whereas in Scheme they can be manipulated explicitly. First-class continuations
are useful to implement advanced control structures that would be hard to express
otherwise.
Intuitively, a continuation represents the state of a suspended computation. The
power of continuations stems from the ability to reinstate a computation at any moment
and possibly multiple times. It is convenient to think of a continuation as a procedure
that restores the corresponding computation when it is called. Often it is necessary to
in uence the computation that is being restored. This is done by passing parameters
to the continuation. Continuations typically have a single parameter, the return value,
but some continuations may take none or more than one parameter.

2.2.1 Continuation Passing Style
Continuations are best understood by examining the underlying mechanism of evaluation. Each expression in the program is the producer of a value that is to be consumed
by some computation: the expression's continuation. For example, in (f x), the procedure f is the consumer of the value produced by the expression x. Each expression
can be viewed as being implemented by an \internal" procedure whose purpose is to
compute the value of the expression and send it to the consumer computation. Thus,
one of the parameters of this internal procedure is a continuation which takes a single
argument: the value of the expression.
This model of evaluation gives rise to a programming style called continuation passing style , or CPS. CPS was originally used as a compilation technique for Scheme [Steele,
1978] but CPS is equally useful to explain how continuations work. The interest of CPS
is that programs written in this style are expressed in a restricted variant of Scheme
yet all Scheme programs can be converted to CPS. An important byproduct of CPS
conversion is that procedure calls never have to return (they are always reductions) and
can thus be viewed as jumps that pass arguments.
The CPS conversion process consists of adding a continuation as an extra argu-

CHAPTER 2. BACKGROUND

26

(define (map-sqrt lst)
(call-with-current-continuation
(lambda (cont)
(map (lambda (x) (if (negative? x) (cont #f) (sqrt x)))
lst))))

Figure 2.1: Non-local exit using call/cc.
ment to each procedure call and adding a corresponding parameter to all procedures.
Primitive procedures must also be rede ned to obey this protocol. The continuation
argument speci es the computation that will consume the result of the procedure being
called. For subproblem calls, the continuation argument is a single argument closure
representing the computation that remains to be done by the caller when the called
procedure logically returns. For reduction calls, the continuation argument is the same
as the caller's continuation (thus implementing proper tail recursion). Wherever a procedure would normally return a value other than by a reduction call, a jump to the
continuation argument is performed instead.
In Scheme, access to the implicit continuation is provided by the prede ned procedure call-with-current-continuation, abbreviated call/cc. A single argument
procedure must be passed as the sole argument of call/cc. When it is called, call/cc
takes its own implicit continuation, converts it into a Scheme procedure and passes it
to its procedure argument. The CPS de nition of call/cc is simply
CPS-call/cc

 (lambda

(k proc) (proc k (lambda (dummy-k x) (k x))))

Note that there are two ways in which the captured continuation k can be invoked.
Either proc calls the continuation it was passed as an argument or proc returns normally.

2.2.2 Programming with Continuations
Several control constructs can be built around call/cc. A typical application is for nonlocal exit and exception processing, which are normally done in Lisp using the special
forms catch and throw. In Scheme, this can be done by saving the current continuation
before entering a block of code. An exit from the block occurs either when the block
terminates normally or when the saved continuation is called. An example of this is
given in Figure 2.1. The procedure map-sqrt returns a list containing the square root

2.2. FIRST-CLASS CONTINUATIONS

27

of every item in a list but only if they are all non-negative. The value #f is returned
if any item is negative. To do this, map-sqrt binds its continuation to cont. A call to
cont thus corresponds to a return from map-sqrt. When a negative value is detected
by map-sqrt the processing of the rest of the list is bypassed by the call (cont #f)
which immediately causes map-sqrt to return #f.
however is more versatile than Lisp's catch and throw because it does not
restrict the transfer of control to a parent computation. Thus it is possible to directly
transfer control between two di erent branches of the call tree. This characteristic can
be exploited to implement specialized control structures such as backtracking [Haynes,
1986], coroutines [Haynes et al., 1984] and multitasking [Wand, 1980]. A less frequent,
but possible, use of continuations is to reenter a computation that has already completed
(see [Rozas, 1987] for an application).
Call/cc

The generality of rst-class continuations comes at a price: a more complex programming model. In many languages, including Lisp, procedure calls have dynamic
extent. This means that every entry of a procedure is balanced by a corresponding exit
(normal or not). This is not the case in Scheme because the computation performed in
a procedure can be restarted multiple times, and thus a procedure can exit more than
once even if it is called only once. Because the programmer's intuition often fails when
dealing directly with continuations it is sometimes helpful to build abstraction barriers
that o er restricted versions of call/cc (for example see [Friedman and Haynes, 1985]).
First-class continuations also cause an implementation problem. If procedures have
dynamic extent, continuations can easily be represented by a single stack of control
frames (i.e. return addresses). Control frames get allocated when procedures are called
and deallocated when procedures return in a last-in rst-out (LIFO) fashion. This
form of garbage collection is possible because control frames cannot be referenced after
the corresponding procedure returns. The unlimited extent of continuations in Scheme
means that a more general garbage collection mechanism for control frames must be used
because a procedure's control frame might still be needed after the procedure returns.
At least in some cases, control frames must be allocated on the heap. A common
implementation strategy is to allocate all control frames on the stack as though they
had dynamic extent and to move them to the heap only when their extent is no longer
known to be purely dynamic (usually at the moment a continuation is captured by a
call/cc). This way, the eciency of stack allocation is obtained for programs that
do not make use of rst-class continuations. This strategy is described in detail in
Section 3.2.
The next section examines the problems that arise when continuations are used in

CHAPTER 2. BACKGROUND

28
a parallel setting.

2.3 Multilisp's Model of Parallelism
Parallel programming languages can be classi ed according to the level of awareness of
parallelism required by the programmer when writing programs. At one end of the scale,
there are languages with implicit parallelism that rely exclusively on the ability of the
system to detect and exploit the parallelism available in programs. In these languages
the compiler must analyze the program to determine what parts can and should be
executed concurrently. In general this is a hard task for imperative languages because
of the existence of side-e ects. Even in the absence of side-e ects, the compilation may
be dicult if an algorithmic transformation is required to obtain a suciently parallel
algorithm.
Multilisp is at the other end of the scale. Parallelism is explicitly introduced by the
programmer through the use of the \future" construct. The future construct marks the
parts of the program where concurrent evaluation is allowed. Of course this style has its
price: the burden put on the programmer for specifying concurrency and the possibility
of error (i.e. incorrectly specifying concurrency). The advantage of this approach is that
it provides more control over the program's execution. The programmer can specify
concurrency at places which might escape an automatic analysis and can choose to
disregard some forms of concurrency if it is judged that the cost of exploiting the
concurrency is greater than what is gained.
This level of control is useful for the programmer wanting to experiment with various
ways of parallelizing a program. It is also appropriate when Multilisp is considered as
the \object code" of a compiler for a higher level parallel language. Such a compiler
could be aware of where parallelism is both possible and desirable and emit code with
appropriately placed futures ([Gray, 1986] is a good example of this application).

2.3.1

FUTURE

and TOUCH

Futures are expressed as (FUTURE expr) where expr is called the future's body . The
future construct behaves like the identity function in the sense that its value is the
value of its body. However, the body is conceptually evaluated concurrently with the
future's continuation. The only restriction to this concurrency comes as a result of the
ordering dependencies imposed by the strict operations in the program. When the value

2.3. MULTILISP'S MODEL OF PARALLELISM

29

of a future is used in a strict operation, the operation can only be performed after the
evaluation of the future's body. For example, in the expression
(let ((x (FUTURE (f 1))))
(g (+ x (f 2))))

the evaluation of (f 1) is done concurrently with the evaluation of (f 2)2 . Because +
is a strict operation in both of its arguments, the addition and the call of the procedure
g can only occur after the evaluation of (f 1) has completed.
As long as they respect the temporal ordering imposed by the strict operations, the
operations required to compute the body of a future are subject to arbitrary interleaving
with the operations performed by the future's continuation. Because Multilisp allows
unrestricted side-e ects, it is an indeterminate language. Separate runs of the same
program can potentially generate di erent results. As a simple example consider the
expression
(let ((x 0))
(FUTURE (set! x 1))
x)

The evaluation of this expression can either return 0 or 1 depending on whether the
reference to x happens to be done before or after the assignment to x3 .
In certain circumstances a program needs to impose special control dependencies in
addition to those given by the data dependencies of the program. Such control dependencies are only required in imperative parts of the program to enforce a certain ordering
of side-e ects. For example, it might be important to guarantee that some restructuring
of a database has completed before some other processing of the database is performed.
For this purpose, Multilisp provides the primitive procedure TOUCH that behaves like a
strict identity function. TOUCH can be viewed as the fundamental \strictness" operation.
All other strict operations use TOUCH internally.
In order to show clearly where the TOUCH operations are needed, the code examples
and benchmark programs that follow include explicit calls to TOUCH.
2 To be

precise, the steps required to bind x, evaluate g, + and x, and enter the + procedure can also
be done concurrently with the evaluation of (f 1).
3 Indeterminacy also exists in Scheme, but at a di erent level. In a procedure call, arguments and
the operator position can be evaluated in any order, but sequentially (that is with no overlap in time).
The following expression has 2 possible values: 0 and 1.
(let ((x 0))
(car (cons x (set! x 1))))

CHAPTER 2. BACKGROUND

30

2.3.2 Placeholders
A more traditional description of futures consists of introducing a new type of object,
the placeholder , that is used to synchronize the computation of a future's body with the
touching of its value [Miller, 1987]. When a future is evaluated it returns a placeholder
as a representative of the value of the body. A placeholder can be in one of two states.
It is undetermined initially and for as long as the evaluation of the future's body has not
completed. When the evaluation of the body is nished, the resulting value is stored in
the placeholder object which is then said to be determined . Using placeholder objects,
TOUCH has an obvious de nition: if the argument is not a placeholder just return it,
otherwise, wait until the placeholder is determined and then return its value.
It is important to understand that placeholders are used here as an arti ce to explain
how futures work. Although placeholders are commonly used in Multilisp systems,
an implementation is free to choose any method that gives the same result. Even if
placeholders are present in the system, the user can be totally unaware of their existence
if the implementation does not provide constructs to manipulate them directly. This is
the view adopted by Gambit.

2.3.3 Spawning Trees
It is sometimes useful to represent the e ects of evaluating futures and touching placeholders by a diagram, the spawning tree , which shows the state of the concurrent computations as a function of time. A spawning tree resulting from the evaluation of a
single future looks like
. . .

z

Continuation
}|
s




| {z }
Body

{

. . .

Time -

A computation is represented by a horizontal line whose extent corresponds to its duration. A dashed vertical line marks the evaluation of the future. At that point, a new
computation corresponding to the body of the future is started. Arrows are used to
express the data dependencies introduced by the TOUCH operation. An arrow links the
computation that determined a placeholder with the computation(s) that touch(es) it
(a computation can point to several others). The tail of an arrow indicates the point
where a placeholder was determined whereas the head indicates the point where the

2.4. TYPES OF PARALLELISM

31

was requested. If an undetermined placeholder was touched, the arrow will point
backwards in time (indicating that the touching computation had to wait).
TOUCH

A second representation of spawning trees used here is as a rooted tree. Each node
of the tree represents a future and the children of a node are the futures dynamically
nested in the body of the corresponding future. The root of the tree corresponds to a
virtual future in which the program is executed.

2.4 Types of Parallelism
Parallelism comes in many avors. Control parallelism occurs when di erent parts of
an algorithm can be done simultaneously. Data parallelism occurs when di erent data
values can be processed concurrently. The advantage of data parallelism is that it scales
well. Larger data sets will o er more parallelism and thus provide better opportunities
for speedup. In control parallelism the degree of parallelism is in principle limited by
the structure of the algorithm. For this reason data parallelism is more useful than
control parallelism for large scale computations.
The future construct is appealing because it can be used to express several types of
parallelism.

2.4.1 Pipeline Parallelism
Pipeline parallelism is a special case of control parallelism where the processing of data
is overlapped with the processing of the result. Pipeline parallelism is the primitive
form of parallelism provided by the future construct. It enables the production of a
value by the future's body to be done concurrently with the consumption of the value
by the future's continuation.

Pipeline parallelism is particularly useful when processing a data structure built
incrementally (such as a list of values). At any given point in time, the part of the
data structure that has been computed by the producer is available for processing by
the consumer computation. An example of this is the procedure pmap as de ned in

CHAPTER 2. BACKGROUND

32

(define (pmap proc lst)
(if (pair? lst)
(let ((tail (FUTURE (pmap proc (cdr lst)))))
(let ((val (proc (car lst))))
(cons val tail)))
'()))

a) basic de nition

. . .

continuation
}|

z

(

cons)

(f 1)
(

(f 3)

)

(

s

*

 





 ) 


s

{

. . .

cons)

(f 2)
(

s



cons

b) spawning tree for basic de nition
. . .

continuation
}|

z

(

cons)

s

(f 1)



3

}
Z

 Z


cons)

(

s


(f 3)


{

s


 


s

. . .


)

c) spawning tree for variant with (FUTURE
. . .


cons)

(

s




(f 2)
(

s

(proc (car lst)))

continuation
z }| {

(

(f 1)
(

(f
(

)
X
y
XX
XXX cons
X
)
2)
Y cons
H
HH
1cons)
(f 3) 


s

. . .

s

s

(

)

d) spawning tree for variant with (cons

val (TOUCH tail))

Figure 2.2: Parallel map de nition and spawning trees.

2.4. TYPES OF PARALLELISM

33

Figure 2.24. Pmap is a parallel version of map which applies a procedure to each element
of a list and returns the list of results. Parallelism has been introduced by allowing the
tail of the resulting list to be generated while the rst element is computed and used
by pmap's caller. Because cons is a non-strict operator, it immediately returns a pair
with a placeholder as its tail (after proc has been called on the rst element). The rst
element is thus immediately available for processing by the consumer. It is only when
the consumer needs to access the tail that a synchronization must take place, possibly
suspending the consumer until the next pair in the list is generated.
A variant of pmap with even more potential for parallelism is obtained by also wrapping a future around the call to proc. This allows the computation of the rst element
to overlap pmap's continuation. The di erence in behavior is best visualized by examining the spawning tree for these two variants of pmap. Figure 2.2 shows the spawning
trees for the call (pmap f '(1 2 3)). Parentheses have been added in these diagrams
to indicate entry and exit of pmap. As is clear from the two upper spawning trees, the
extra future allows more computations to overlap. Whether this added parallelism is
actually bene cial will depend on the task granularity, the spawning cost, the number
of processors and the way in which pmap's result is used by the continuation.
Pmap's

parallelism is not easy to classify. At rst glance it seems that it is an
instance of control parallelism because it expresses concurrency between two di erent
computations (the continuation and the application of the procedure to an element of
the list). However, this control parallelism is not static. Pmap calls itself recursively so
the parallelism varies with the length of the list. When viewed globally, pmap exhibits
data parallelism because it expresses the parallel application of a procedure to a set of
values. If the task granularity is large enough, the processing of longer lists will o er
more parallelism.

2.4.2 Fork-Join Parallelism
The above variants of pmap are said to export concurrency because some of the work
logically started \inside" pmap may be in progress after the procedure has returned.
4 The

shorter de nition
(define (pmap proc lst)
(if (pair? lst)
(cons (proc (car lst)) (FUTURE (pmap proc (cdr lst))))
'()))

is not equivalent because the two possible orderings of the evaluation of the arguments to cons do not
give the same parallelism behavior.

CHAPTER 2. BACKGROUND

34

Exported concurrency is a nuisance for some programming styles. If proc performs
some side-e ects on a global state, the computation following pmap cannot assume that
they have all been done. Some explicit synchronization is needed to guarantee that
all of pmap's futures are done. In the simple case where proc does not itself export
any concurrency, this synchronization can be done by walking the resulting list and
touching all values that are the result of a future. A more elegant solution is to include
the required synchronization inside pmap. This is easily achieved by having the future's
extent match that of the procedure's body. In other words, the procedure is written so
that each future (the fork) is balanced with a corresponding TOUCH (the join) executed
before the procedure returns. This is a trivial change to pmap: a TOUCH is added around
the second argument to cons (i.e. (cons val (TOUCH tail))). The spawning tree
resulting from this variant of pmap is shown in Figure 2.2 (d).

2.4.3 Divide and Conquer Parallelism
An unfortunate characteristic of pmap is that it scales poorly due to the inherently sequential nature of lists. The processing of an n element list requires at least n sequential
steps just to traverse the list. No matter how quickly each element can be processed,
the time required to process n elements will be (n). This may be of little consequence
when task granularity is large and lists are short but massively parallel applications are
bound to su er more.
For this reason, it is preferable to use scalable data structures such as trees and
arrays when lists would create a bottleneck. But this is not the only step to take. As
long as futures are started sequentially, such as in a loop, a bottleneck will be present.
A divide and conquer paradigm (DAC) can be used to start futures faster, allowing n
futures to be started in (log n) time. This is actually the best that can be expected of
the future construct because each future splits a thread of computation into two.
Pvmap!,

shown in Figure 2.3, is a DAC version of pmap that works on vectors. The
input elements are stored in a vector which is mutated to construct the result. The
vector is divided in two and the mapping is performed recursively on both parts. When
a single element is obtained, the mapped procedure is applied to the value and the result
is stored back in the vector. To avoid allocating new vectors, subvectors are represented
by two indices, lo and hi, which denote the subvector's extent. Because it uses a forkjoin paradigm, all side-e ects will be nished when pvmap! returns. Note also that the
TOUCH is used only for synchronization. The actual value of sync is irrelevant.
Multilisp programs are frequently organized around DAC parallelism. Not only is it

2.4. TYPES OF PARALLELISM

35

(define (pvmap! proc vect)
(define (map-range! proc lo hi)
(if (= lo hi)
(vector-set! vect lo (proc (vector-ref vect lo)))
(let ((mid (quotient (+ lo hi) 2)))
(let ((sync (FUTURE (map-range! proc (+ mid 1) hi))))
(map-range! proc lo mid)
(TOUCH sync)))))
(map-range! proc 0 (- (vector-length vect) 1))
vect)

a) de nition

. . .

(

(

(
(

(

(
(

(
(
(
(
(
(
(
(

b) spawning tree for (pvmap!

(f 1)

(f 8)

f v)

)
)
) .... )
..



)

) .. .. ) 

.
)
)
) .. . )
..


)
) .. .. ) 
.
)

with v = #(1

Figure 2.3: Parallel \vector" map.

. . .

2 3 4 5 6 7 8)

CHAPTER 2. BACKGROUND

36

a fundamental technique for constructing parallel algorithms [Mou, 1990], it also blends
naturally with the recursive algorithms and data structures commonly found in Lisp
and symbolic processing. Several of the parallel benchmarks used in this thesis (see
Section 2.9) are based on DAC parallelism.

2.5 Implementing Eager Task Creation
This section describes the eager task creation (ETC) implementation of futures. It will
serve both as a reference implementation and as a basis on which lazy task creation is
built. A few implementation details have been omitted for the sake of clarity. A more
elaborate description can be found in [Miller, 1987].
As might be expected, the implementation of a Multilisp system is in many ways
similar to that of a multitasking operating system. At the heart of both are utilities to
support the management of various processing resources. For the management of the
processors, an important concept is that of the task which is an abstract representation
of a computation in progress. A program rst starts out with a single root task in
charge of performing the computation required by the program. Tasks are created and
terminated dynamically as the computation progresses, possibly causing the number of
tasks to exceed the number of processors in the machine.
The task abstraction is supported by the scheduler whose job is to run tasks by
assigning them to processors. A task can be in one of three states. It is running when
it is being executed by some processor. It is ready or runnable if it is only waiting for
the scheduler to assign it to a processor. Finally, it is blocked if some event must occur
before it is allowed to run.
Eager task creation (ETC) is a straightforward dynamic partitioning method that
has been used in several implementations of Multilisp [Halstead, 1984, Miller, 1988,
Swanson et al., 1988, Kranz et al., 1989]. With ETC there is a single representation
for tasks: the heavyweight task object5 . This is a heap allocated object with a number
of elds that describe the state of the computation associated with the task. When
the task needs to be started or resumed its state is restored by reading the elds of
the corresponding task object. When a task needs to be suspended, the task object
is updated to re ect the current state of the task. The most important information
5 The de

nition of heavyweight tasks used here is not the same as the common meaning in operating
systems (i.e. a process with its own address space). Here heavyweight task simply means a representation
that is more expensive than the one used for lazy task creation.

2.5. IMPLEMENTING EAGER TASK CREATION

37

retained in a task object is the continuation. It indicates where control must return
when the task is resumed. Task continuations di er from rst-class continuations in
that they do not need to be given a \result" to continue with. They are zero argument
procedures. Also, the full generality of rst-class continuations is not necessary for
task continuations since they are invoked at most once. Other elds can be added to
task objects to support special language features but they are not strictly required for
implementing futures. In fact, an implementation could simply use continuations to
represent tasks. Nevertheless, task objects will be used here to make the algorithms
more general.

2.5.1 The Work Queue
ETC lends itself well to self scheduling , where each processor is responsible for scheduling tasks to itself. All processors share a global queue, the work queue , that contains
the set of runnable tasks. When a processor becomes idle, typically after a task blocks
or terminates, it removes a task from the work queue and starts running it. If there are
none available, the processor just keeps on trying until one is added to the work queue
by some other processor. Self scheduling has the advantage of automatically balancing
the load across the processors. As explained in Section 2.5.6, the work queue can be
distributed but for now it is assumed to be a single centralized queue.

2.5.2

FUTURE

and TOUCH

Tasks are created through the evaluation of futures. When a task, the parent , evaluates
(FUTURE expr), it creates a placeholder object to represent the value of expr and then
creates a child task whose role is to compute expr and determine the placeholder with
the resulting value. The child task is added to the work queue to make it runnable and
the placeholder is returned as the result of the future. Thus, the parent task immediately
starts working on the continuation using the placeholder as a substitute for the value
of expr while the child task waits in the work queue until it can be started by an idle
processor.
Placeholder objects can be represented by a structure containing three slots: the
state, the value and the waiting queue. The meaning of the state and value slots is
obvious. The waiting queue is used to record the tasks that have become blocked
because they need to wait until the placeholder has a value. When the placeholder gets
determined, the tasks that are in the waiting queue are transferred to the work queue

CHAPTER 2. BACKGROUND

38

because they are now ready to run. When a task touches an undetermined placeholder
it is suspended and added to the placeholder's waiting queue. The processor is now idle
and must nd a new task to run from the work queue. When the blocked task later
resumes (inside the TOUCH), the placeholder's value is fetched and returned.

2.5.3 Scheme Encoding
A Scheme encoding of these algorithms is given in Figure 2.4 and the de nition of the
support procedures is given in Figure 2.5. Note that the code in Figure 2.4 is schematic
and does not address all atomicity issues.
is the procedure that is run by processors in need of work. When the program
starts up, all processors call idle, except for the single processor that is running the
root task. Idle continually tries to remove a ready task from the work queue. To
implement TOUCH, each processor must keep track of its currently running task. When
a task is found, resume-task is called. The task becomes the \current task" of that
processor and it is restarted by calling its associated continuation. It is assumed that
each processor has a private storage area to store the currently running task. The
procedures current-task and current-task-set! access this storage.
Idle

The future special form can be thought of as a derived form that expands into a call
to make-FUTURE. Its only argument is a nullary procedure (a thunk ) that contains the
future's body. The expression (FUTURE expr) is really an abbreviation for the procedure
call (make-FUTURE (lambda () expr)). Make-FUTURE rst creates an undetermined
placeholder to represent the body's value and then creates a child task. The child task
is set up so that its continuation, when called by resume-task, will compute the value
of the body by calling the thunk. The procedure end-body contains the work to be
done after the body is computed. End-body calls test-and-determine! to determine
the result placeholder with the body's value. Control then goes back to idle. Note
that end-body signals an error when a placeholder is determined more than once. This
might happen if a continuation captured by a call/cc in the body is invoked after the
body has already returned.
is an atomic operation similar in spirit to the traditional
\test-and-set" operation. It tests if a placeholder is determined and if it isn't, the placeholder gets determined to the second parameter and true is returned to indicate success.
Otherwise the placeholder remains as is and false is returned. When a placeholder is determined, the tasks on its waiting queue are transferred to the work queue, thus making
them runnable.
Test-and-determine!

2.5. IMPLEMENTING EAGER TASK CREATION
(define (idle)
(if (queue-empty? (work-queue))
(idle)
(resume-task (queue-get! (work-queue)))))
(define (resume-task task)
(current-task-set! task)
((task-continuation task)))
(define (make-FUTURE thunk)
(let ((res-ph (make-ph)))
(let ((child (make-task
(lambda () (end-body res-ph (thunk))))))
(queue-put! (work-queue) child)
res-ph)))
(define (end-body res-ph result)
(if (test-and-determine! res-ph (TOUCH result)) ; 1
(idle)
(error "placeholder previously determined")))
(define (test-and-determine! ph val)
(if (ph-determined? ph)
#f
(begin
(determine! ph val)
#t)))
(define (determine! ph val)
(ph-value-set! ph val)
(ph-determined?-set! ph #t)
(queue-append! (work-queue) (ph-queue ph)))
(define (TOUCH x)
(if (ph? x)
(if (ph-determined? x) (ph-value x) (TOUCH-undet x)) ; 2
x))
(define (TOUCH-undet ph)
(call-with-current-continuation
(lambda (cont)
(let ((task (current-task)))
(task-continuation-set! task
(lambda ()
(cont
(if (ph? ph) (ph-value ph) ph)))) ; 3
(queue-put! (ph-queue ph) task)
(idle)))))

Figure 2.4: Scheme encoding of Multilisp core.

39

CHAPTER 2. BACKGROUND

40

Operations on queues:

q)
q)
(queue-put! q x)
(queue-append! q1 q2)

Tests if q is empty.
Removes and returns the item at q's head.
Adds x to q's tail.
Transfers all items from q2 to q1's tail.

(queue-empty?
(queue-get!

Operations on placeholders:

(make-ph)
(ph?

x)

(ph-determined?

ph)

(ph-determined?-set!
(ph-value

ph)

(ph-value-set!
(ph-queue

ph)

ph x)

ph x)

Operations on tasks:

c

(make-task )

t

(task-continuation )
(task-continuation-set!

t c)

Creates and returns an undetermined placeholder.
Tests if x is a placeholder.
Tests the state of ph.
Sets the state of ph.
Returns the value of ph.
Sets the value of ph.
Returns the waiting queue of ph.
Creates and returns a task whose continuation is c.
Returns t's continuation.
Sets t's continuation to c.

Operations on the processor's local state:

(current-task)

t

(current-task-set! )
Other operations:

(work-queue)

Returns the task currently running on the processor.
Sets the task currently running on the processor to t.
Returns the work queue.

Figure 2.5: Procedures needed to support Multilisp core.

2.5. IMPLEMENTING EAGER TASK CREATION

41

Touching is implemented by TOUCH and TOUCH-undet. TOUCH-undet handles the
case where the value to be touched is an undetermined placeholder. When an undetermined placeholder is being touched, the current task must be suspended and put
on the placeholder's waiting queue. This is done by a call to call/cc which captures
TOUCH's continuation. Note that since this continuation is guaranteed to be called at
most once, a less general but more ecient version of call/cc could be used. The task
is then put on the placeholder's waiting queue so that it can later be made runnable
by test-and-determine!. As the current task is now blocked, control is transferred to
idle to move on to some other piece of work. When the task is resumed, the placeholder's value will be returned to TOUCH's continuation.

2.5.4 Chasing vs. No Chasing
An interesting issue is whether placeholders should be allowed to be determined with
other placeholders. If this is permitted, the touching of a placeholder must perform the
recursive touching of its value. This chasing process can be expensive if the chain of
placeholders is long. This happens in programs where the future bodies often return
placeholders and placeholders are touched multiple times.
The alternative strict method requires that placeholders be only determined with
non-placeholders. The code in Figure 2.4 implements the strict method. A chasing
implementation is obtained by removing the TOUCH on line 1 , adding a TOUCH around
line 2 and replacing line 3 by ph. The drawback of the strict method is that the number
of blocked tasks will increase in the cases where chasing would be required. It may also
restrict concurrency because it has an additional control dependency. None of these
methods is clearly superior to the other in all contexts. Fortunately, both methods can
coexist in the same system as long as the two types of placeholders are distinguished
and the appropriate touching and determining mechanisms are called. Having two types
of placeholders is useful to implement legitimacy (see Section 2.8.4).
Unless otherwise noted, the strict method will be assumed because it is conceptually
simpler (i.e. determined placeholders are guaranteed to have a non-placeholder value)
and it gives a shorter code sequence for inline calls to TOUCH.

2.5.5 Critical Sections
Various implementation details have been omitted from the above description. One
problem that must be addressed is the possible race conditions in these algorithms.

42

CHAPTER 2. BACKGROUND

Several processors may simultaneously attempt to mutate the work queue or a placeholder. To preserve the integrity of these data structures, some operations must appear
to be mutually exclusive. This is usually done by introducing locks in the data structures to control access to them. Spin locks are sucient because the critical sections
consist of only a few instructions. The operations that must be protected are
1. Testing and removing a task from the work queue (when a processor is idle).
2. Adding a task to the work queue (when a future is evaluated).
3. Checking the state of a placeholder and adding a task to a placeholder's waiting
queue (when an undetermined placeholder is touched).
4. Changing the state and value of a placeholder (when a placeholder gets determined).
Garbage collection adds another complication. If the value of placeholders is assumed
to be immutable, it is perfectly valid to replace any reference to a determined placeholder
by the placeholder's value. This optimization, called splicing , can in principle be done
at any moment but usually it is performed by the garbage collector. The advantage
of splicing is that subsequent calls to TOUCH will be faster because the dereferencing of
the placeholder is avoided (this is particularly helpful to reduce the cost of chasing).
Consequently, the implementation must prevent the splicing of the placeholder currently
being manipulated. Several techniques are possible such as temporarily disabling the
garbage collector or temporarily marking the placeholder as non-spliceable. The test at
line 3 in TOUCH-undet is needed to account for the splicing of the touched placeholder.
Aside from this test, the code in Figure 2.4 does not include the operations required to
prevent splicing.

2.5.6 Centralized vs. Distributed Work Queue
A potential source of ineciency in the scheduler is caused by the centralized work
queue accessed by all processors. The contention for the work queue may become an
important bottleneck as the number of processors is increased. Each access to the work
queue is mutually exclusive so all operations on the work queue get sequentialized. The
time it takes to add and remove a task from the work queue puts an upper bound on
the rate at which tasks can be created and resumed. Clearly, it would be preferable if
this rate scaled up with the number of processors.

2.6. FAIRNESS OF SCHEDULING

43

A common solution is to distribute the work queue. Each processor has its own work
queue which it uses to make tasks runnable. These work queues are accessible from all
processors. When a processor is looking for work, it rst looks for runnable tasks in
its own work queue and goes on to search the work queue of other processors only if
its work queue is empty. This reduces contention and remote memory trac and also
improves locality since tasks restarted from the local work queue are likely to have been
created locally.

2.6 Fairness of Scheduling
Another important consideration is fairness of scheduling. In a fair system, a task's
computation is guaranteed to progress as long as the task is runnable. In other words,
there is a nite amount of time between a task becoming runnable and it actually
running on a processor.
Fairness can be implemented by preventing a task from running longer than a certain
stretch of time (quantum ) without giving all other runnable tasks a chance to run as
well. The scheduler e ectively cycles through all runnable tasks giving each of them a
quantum of time to advance their computation. At regular time intervals all processors
receive a preemption interrupt to signal that the quantum has expired. Upon receiving
this interrupt, a processor suspends the currently running task, puts it at the tail of the
work queue and then resumes the task at the head.
In a system with a centralized work queue at least min(n; r) tasks are resumed every
quantum (where n is the number of processors and r is the number of runnable tasks)6.
It follows that a task will start running in no more than br0=nc quantums, where r0 is
the number of runnable tasks at the time the task was made runnable. If r0 does not
vary much, the tasks will get an even share of the processors (roughly the power of n=r0
processor per task if r0 > n).
In a system with a distributed work queue at least one task is resumed from every
work queue every quantum. A task will thus start running in no more than q + 1
quantums, where q is the length of the local work queue at the time the task was made
runnable. Thus, the processing power given to tasks residing on a processor is evenly
distributed but the processing power of tasks residing on di erent processors may be
substantially di erent.
6 It is assumed

are negligible.

that the quantum is large enough so that the e ects of contention on the work queue

44

CHAPTER 2. BACKGROUND

The original Multilisp semantics [Halstead, 1985] had a scheduling policy that was
fair as long as all tasks were of nite duration. The only guarantee made by the scheduler
was that a runnable task would run if there were no other runnable tasks. Under the
nite task assumption, this implies that all tasks will eventually run. Finiteness is
a reasonable assumption for Multilisp programs since it is common to design parallel
programs by annotating terminating sequential programs with futures. In sequential
programs, all expressions evaluated correspond to mandatory work that needs to be
done to compute the result of the program. Any execution order for the tasks will
compute the correct result as long as it respects the basic ordering imposed by the
strict operations. However, there are special situations where true fairness is useful.
Programs are sometimes organized around tasks that conceptually never terminate.
One example is the client/server model where each task implements a particular service
for some clients. Server tasks receive requests from the clients and send back a reply for
each request serviced. Each server task is in an in nite receive-compute-respond loop.
Without a fair scheduler, a set of server tasks could monopolize all the processors if
they continually have requests to service. Other server tasks would never get a chance
to run. A multi-user Multilisp system can be viewed as an instance of this model (the
clients are the users and the server tasks are the read-eval-print loops).
Another application of fairness is to support speculative computation. A computation is speculative if it is not yet known to contribute to the program's result. Speculative
computation arises naturally in search problems where multiple solutions may exist but
only one is needed. Several search paths can be explored in parallel and as soon as a
solution is found the search can be stopped. This form of computation, which Osborne
[Osborne, 1989] calls multiple approach speculative computation, is known in parallel
logic programming as OR-parallel . If the likelihood of nding a solution in any given
path is fairly similar, then it is reasonable to spend an equal e ort searching each path.
This is easily approximated by a fair scheduler which timeslices tasks from a centralized
work queue.
However the solutions are typically not distributed equally among the search paths.
The paths that are likely to lead quickly to a solution should be searched more eagerly
than others. Thus a system aimed at general speculative computation should provide
some ner level of control over the scheduler (such as a mechanism to assign priorities
to the speculative tasks). Because there is currently no consensus as to which level
of control is best, this thesis does not investigate the implementation of such priority
mechanisms. Fairness of scheduling plays a minor role in this thesis. Chapter 3 shows
that lazy task creation can support fairness.

2.7. DYNAMIC SCOPING

45

2.7 Dynamic Scoping
Multilisp uses static scoping as its primary variable management discipline. Static
scoping has the advantage of clarity because the identity of a variable only depends on
the program's local structure, not its runtime behavior. With the exception of global
variables, a variable can only be accessed by an expression textually contained in the
binding form that declares the variable.
Static scoping is not well suited for certain applications. Sometimes it is necessary
to pass an argument to one or several procedures far down in the call tree (such as the
default output port or the exception handler). Such arguments must either be passed in
global variables or be passed as explicit arguments from each procedure to the next in
the call chain. The rst solution is not appropriate in a parallel system because of the
possible con ict between tasks. The second solution clearly lacks modularity because
each procedure must be aware of the arguments passed from parent procedures to all
its descendants.
Dynamic scoping o ers an elegant solution. A dynamically scoped variable can be
accessed by any computation performed during the evaluation of the body of the binding
form that declares the variable. In a sense, dynamic variables are implicit parameters
to all procedures. The set of bindings (the dynamic environment ) is passed implicitly
by each procedure to its children in the call tree. A given binding is thus only visible in
the call tree that stems from the binding form with the exception of the subtrees where
the binding is shadowed by a new binding to the same variable.
There are several possible constructs to express dynamic scoping. For the sake
of simplicity two special forms are used here7 . The form (dyn-bind id val body )
introduces a new binding of the dynamic variable id to the value val for the duration
of the body. The form (dyn-ref id) returns the value of the dynamic variable id in
the current dynamic environment. Note that id is not evaluated and that lexically
scoped variables and dynamic variables exist in separate namespaces. Figure 2.6 shows
a typical use of dynamic scoping to implement a simple exception system. The dynamic
variable EXCEPTION-HANDLER contains a single argument procedure that is called with
an error message when an error is detected. The procedure catch-exceptions takes a
thunk as argument and calls it in a dynamic environment where EXCEPTION-HANDLER
is bound to the continuation of catch-exceptions. Thus, the call to the exception
handler in raise-exception will immediately exit from catch-exceptions with the
error message as its result (for example, the call (map-sqrt '(1 -2 5)) returns the
7 An obvious

extension would be an assignment construct.

CHAPTER 2. BACKGROUND

46

(define (catch-exceptions thunk)
(call-with-current-continuation
(lambda (abort)
(dyn-bind EXCEPTION-HANDLER abort (thunk)))))
(define (raise-exception msg)
((dyn-ref EXCEPTION-HANDLER) msg))
(define (square-root x)
(if (negative? x)
(raise-exception "domain error")
(sqrt x)))
(define (map-sqrt lst)
(catch-exceptions
(lambda () (map square-root lst))))

Figure 2.6: Exception system based on dynamic scoping and call/cc.
string "domain

error").

An implication of the above semantics is that dynamic environments are associated
with continuations. All continuations carry with them the dynamic environment that
was in e ect when they were created (i.e. due to the evaluation of some subproblem
call). When a continuation is invoked, the captured dynamic environment becomes
the current dynamic environment. Dyn-bind creates a new dynamic environment for
the evaluation of the body simply by adding a new binding to the current dynamic
environment. This new binding remains in e ect only for the duration of the body
because the continuation invoked to exit the body (normally dyn-bind's continuation
but possibly some continuation captured with call/cc outside the body) will restore the
dynamic environment to the appropriate value. In implementation terms, this implies
that each subproblem call must save the dynamic environment on the stack prior to the
call and restore it upon return.
Because the save/restore pair is added to all subproblem calls, this may result in an
unacceptably high overhead. Notice that in normal situations the dynamic environment
does not actually change when a continuation is invoked. Only dyn-bind's continuation
and continuations captured by call/cc might be invoked from a di erent dynamic
environment. An alternative approach is thus to put the save/restore pair only around
the evaluation of dyn-bind's body and around calls to call/cc. This approach o ers

2.7. DYNAMIC SCOPING

47

more ecient subproblem calls but also has the unfortunate consequence that call/cc
and dyn-bind are no longer properly tail-recursive. Call/cc's procedure argument
and dyn-bind's body are not reductions because their continuation contains a new
continuation frame8 . The loss of proper tail recursion for dyn-bind is probably not
very troublesome (most Lisp systems implement the dynamic binding construct with
similar save/restore pairs). However it is harder to justify for call/cc.
To preserve call/cc's tail recursive property, call/cc can be rede ned as shown
in Figure 2.7. It is assumed that the state of the dynamic environment is maintained
in a global data structure accessible through the procedures current-dyn-env and
current-dyn-env-set!. The implementation exploits the invariant that procedures always invoke their implicit continuation with the same dynamic environment that existed
when they were called. Thus a normal return from the call to proc in call/cc invokes
the captured continuation with the correct dynamic environment. An abnormal return
to cont is only possible by calling the closure passed to proc. This closure explicitly
restores the correct dynamic environment before invoking the captured continuation.
Parallel processing raises additional implementation issues. In order for the future
construct's semantics to be as non-intrusive as possible, the dynamic environment used
for the evaluation of the future's body should be the same as the one in e ect when
the future itself was evaluated. Consequently, the parent task must save the dynamic
environment into the child task and the child task must restore this environment when
it starts running. This adds an overhead to task creation, suspension and resumption.
Another issue is the representation of dynamic environments. A popular approach in
uniprocessor Lisps is shallow binding . The environment is represented as a table of cells.
Each cell holds the current value of a dynamic variable. A new binding is introduced
by saving the current value of the cell on a stack and assigning the new value to the
cell. Upon exit from the binding construct, the previous binding is restored by popping
the old value o the stack. Thus dyn-bind and dyn-ref are constant time operations.
However, saving the entire dynamic environment (i.e. the operation current-dyn-env)
is expensive because it implies a copy of the binding table. An alternative approach
(shown in Figure 2.7) is deep binding . The dynamic environment is represented as a
stack of bindings (i.e. an association list). Dyn-bind simply adds a new binding at the
head of the list and dyn-ref searches the list for the most recent binding of the variable.
Unfortunately the cost of dyn-ref is O(b) where b is the number of bindings in the
environment. This may be expensive if b is large and the variables looked up are those
8 The

following procedure will thus run out of memory when it is called
(define (loop) (call-with-current-continuation (lambda (k) (loop))))

CHAPTER 2. BACKGROUND

48

(define (call-with-current-continuation proc)
(primitive-call-with-current-continuation
(lambda (cont)
(proc (let ((env (current-dyn-env)))
(lambda (val)
(current-dyn-env-set! env)
(cont val)))))))

The special forms dyn-ref and dyn-bind expand into:
(dyn-ref

id) !

(dyn-bind

(current-dyn-env-lookup '

id val body) !

id)

(begin
(current-dyn-env-push! '
(let ((result
))
(current-dyn-env-pop!)
result))

body

id val)

De nitions for deep binding:
(define (current-dyn-env-lookup id)
(cdr (assq id (current-dyn-env))))
(define (current-dyn-env-push! id val)
(current-dyn-env-set! (cons (cons id val) (current-dyn-env))))
(define (current-dyn-env-pop!)
(current-dyn-env-set! (cdr (current-dyn-env))))

Figure 2.7: Implementation of dynamic scoping with tail recursive call/cc.

2.8. CONTINUATION SEMANTICS

49

that were bound early9 . On the other hand, current-dyn-env only requires a single
pointer copy so the overhead for call/cc and task operations is minimal. Deep binding
is adequate when dynamic variables are referenced infrequently, for example if their
main purpose is to support the exception processing system. Yet another approach is to
represent environments with 2-3 or AVL search trees, thus permitting O(log n) cost for
dyn-bind and dyn-ref, where n is the number of variables bound in the environment,
and constant cost for current-dyn-env and current-dyn-env-set!. It isn't clear
which of these last two representations is most ecient in practice. The deep binding
approach has been used in this work for simplicity but the implementation strategies
explained in the next chapter are equally applicable to the search tree representation.

2.8 Continuation Semantics
Continuations also present special problems in a parallel setting. It isn't clear what the
terminal continuation of a child task should be. This continuation is the one that is
passed to the body of the future. In other words, what should be done with the value
returned by the body? This is an important question because the approach chosen will
specify the behavior of rst-class continuations in the presence of futures.

2.8.1 Original Semantics
Several approaches have been proposed. In the original Multilisp de nition [Halstead,
1985] the body's value was used to determine the placeholder created for the future
and the task was simply terminated. This is the semantics implemented by the code in
Figure 2.410.

2.8.2 MultiScheme Semantics
MultiScheme adopted a subtly di erent model for continuations. The child task and
placeholder created by a future are conceptually linked. The placeholder is called the
goal of the task and the task is the placeholder's owner 11. This linkage was introduced
9 Eciency

can be improved somewhat by adding a cache to hold the value of recently accessed
variables (for example see [Rozas and Miller, 1991]).
10 Multilisp was not designed to support rst-class continuations so it isn't surprising that the original
semantics
does not interact well with them.
11 The term
\motivated task" was used in [Miller, 1987].

50

CHAPTER 2. BACKGROUND
(define (make-FUTURE thunk)
(let ((res-ph (make-ph)))
(let ((child (make-task
(lambda () (end-body (thunk)))
res-ph)))
(queue-put! (work-queue) child)
res-ph)))
(define (end-body result)
(let ((res-ph (task-goal-ph (current-task))))
(if (test-and-determine! res-ph (TOUCH result))
(idle)
(error "placeholder previously determined"))))

Figure 2.8: MultiScheme's implementation of the future special form.
to permit the garbage collection of tasks. Finding the value of the future's body is seen
as the task's sole reason of existence. Since the goal placeholder is the representative
of this value, the owner task can safely be terminated if the placeholder is known to be
unnecessary for the rest of the computation.
The implementation of this semantics is given in Figure 2.8. Note that the procedure make-task now takes two arguments: the continuation and the goal placeholder.
Also note that end-body takes only one argument because the placeholder to determine
implicitly comes from the task executing end-body (i.e. the current task). The goal
placeholder is now embeded in the child task instead of the terminal continuation (as
is done in the original semantics). This is an important distinction because a task can
replace its current continuation by a completely di erent one by calling a continuation
created by call/cc. However, the goal placeholder never changes. Interestingly, the
original and MultiScheme implementations are equivalent in the absence of call/cc.
This is because in such a case the only task that can execute a given continuation is
the task created with that continuation. Taking the placeholder to determine from the
continuation (as in the original semantics) or from the task object (as in MultiScheme)
will give the same placeholder because of the one-to-one correspondence between continuations and tasks.
Figure 2.9 gives an example where the two implementations di er. Here two tasks
(T1 and T2 ) are involved in addition to the root task. The corresponding placeholders are
Ph1 and Ph2 . The call to call/cc binds k to T1's continuation. Thus, k corresponds to
a call to end-body. With the original implementation of futures, k contains an implicit

2.8. CONTINUATION SEMANTICS

51

(define x
(TOUCH (FUTURE1
(call-with-current-continuation
(lambda (k)
(+ 1 (TOUCH (FUTURE2 (k 0)))))))))

Figure 2.9: A sample use of futures and call/cc.
reference to Ph1 . When T2 calls k, Ph1 gets determined to 0. Following this, the root
task can return from the rst TOUCH and consequently x gets bound to 0. Note that T1
is suspended inde nitely on the second TOUCH because Ph2 never gets determined.
With MultiScheme's implementation of futures, a call to k determines the goal placeholder of the current task. Since it is T2 that is calling k, Ph2 gets determined to 0. T1
then proceeds from the second TOUCH, adds 1 and calls k with 1 (the lambda-expression's
body implicitly calls k). This time, it is T1 that is calling k, so Ph1 gets determined to
1. Finally, the root task can return from the rst TOUCH, binding x to 1.

2.8.3 Katz-Weise Continuations
A nice feature of futures is that, in typical purely functional programs, they can be added
around any expression without changing the result of the program. In other words,
futures are equivalent to an identity operator when only the result of the computation
is considered. Futures only a ect the order of evaluation. This suggests an attractive
mode of programming: rst write a correct functional program without any futures and
then explore various placements of futures to turn the program into an ecient parallel
one.
Unfortunately the original and MultiScheme semantics for continuations do not permit this for all purely functional programs because inserting futures in a program that
uses call/cc can alter the result computed. For MultiScheme, this should be clear
from the previous example. For the original semantics all is ne as long as the future
body's continuation is invoked at most once, including the normal return from the body.
To explain what happens when the continuation is called multiple times, consider the
contrived expression in Figure 2.10. In this expression, the continuation created by
call/cc is called exactly twice. Assume for the moment that the TOUCH and FUTURE
operations are not present. Y will get bound to the continuation created by call/cc;
the continuation that takes a value and binds y to it. Since at this point y is not a

52

CHAPTER 2. BACKGROUND
(define x
(let ((y (TOUCH (FUTURE
(call-with-current-continuation
(lambda (k) k))))))
(if (number? y)
y
(y 123))))

Figure 2.10: A future body's continuation called multiple times.
number, the continuation is restarted with 123 thus binding y to 123. Since y is now a
number it is returned and x gets de ned to 123.
When TOUCH and FUTURE are present, an undetermined placeholder will be created
and a child task created to evaluate the call/cc. The continuation captured here
(i.e. k) corresponds to the task's continuation, that is a call to end-body. The placeholder will get determined to this continuation and, through the TOUCH, y gets bound
to it. However, when this continuation is called an attempt is made to determine the
placeholder a second time (this time with 123) and then to terminate the current task.
This is clearly an error because a placeholder cannot represent more than one value and
deadlock would occur (since all tasks would have terminated).
An interesting implementation of futures that solves this problem was proposed by
Katz and Weise [Katz and Weise, 1990]. The idea is to preserve the link between the
future body's continuation and the future's continuation. On the rst return to the
body's continuation, the placeholder gets determined and the task is terminated (as in
the original semantics). However, on every other return the body's continuation acts
exactly like the future's continuation, as if the future had never existed.

2.8.4 Katz-Weise Continuations with Legitimacy
Unfortunately, this approach does not solve all interaction problems between rst-class
continuations and futures. It is still possible to write purely functional programs that do
not return the same value when futures are added. Consider the program in Figure 2.11
which is a simpli ed form of exception processing. If the future special form is not
present, a value of 0 is returned because the call (abort 0) is done rst, bypassing the
body of the let and the binding of dummy. With the future, a child task is created
to evaluate (abort 0) and the parent task implicitly returns 1 to abort. Each task

2.8. CONTINUATION SEMANTICS

53

(call-with-current-continuation
(lambda (abort)
(let ((dummy (FUTURE (abort 0))))
1)))

Figure 2.11: Exception processing with futures.
exits the call/cc with its own belief of the result: the parent task with 1 and the child
task with 0. In general, this means that multiple tasks may return to the program's
root continuation. One of these tasks has the right result (i.e. the same result as a
sequential version of the program) but which task? Choosing the rst task to arrive at
the program's root continuation is not a valid technique because of the race condition
involved.
The solution proposed in [Katz and Weise, 1990] introduces the concept of legitimacy . A particular sequence of evaluation steps (a thread ) is legitimate if and only if it
is executed by the sequential version of the program. Legitimacy is thus a characteristic
that depends on the control ow of the program. It can be derived from 1) the fact that
the root thread is legitimate and 2) the causality rules inherent in the sequential subset
of the language. In particular, if a thread is legitimate and it returns from expr with
the value v , then the thread corresponding to the execution of expr 's continuation with
the value v is also legitimate. This rule naturally extends to the future special form
by attaching legitimacy to tasks: after a child task is spawned by (FUTURE expr), the
parent task is legitimate if and only if the corresponding placeholder gets determined
by a legitimate task. The parent task's legitimacy is thus equal to the legitimacy of
the task that gets to determine the placeholder. Note that the child task inherits the
legitimacy of its parent at the moment of the task spawn. As an example, consider the
following program which involves three tasks (T1, T2, and the root task Troot )
(let* ((x (FUTURE1
(y (FUTURE2
3)

expr

expr1))
expr2)))

After spawning the tasks T1 and T2 the root task will evaluate expr3. The root task
is legitimate if and only if the rst task to return from expr2 is legitimate. This fact
can be expressed by the constraint

Legit(Troot ) = Legit(Det(PhT2 ))
That is, the legitimacy of the root task is equal to the legitimacy of the task that

CHAPTER 2. BACKGROUND

54

determines the placeholder created for T2 . Similarly, task T2 is legitimate if and only if
the rst task to return from expr1 is legitimate

Legit(T2) = Legit(Det(PhT1 ))
In the event that it is T2 that returns rst from expr2 (i.e. Det(PhT2 ) = T2 ), the
root task's legitimacy will become equal to the legitimacy of the rst task returning
from expr1. That is

Legit(Troot ) = Legit(T2) = Legit(Det(PhT1 ))
This illustrates that a task's legitimacy at a given point in time is represented by a
chain that models the legitimacy dependencies inferred up to that point. Initially the
links between tasks are unknown and, as tasks terminate (and determine placeholders),
the links get lled in. The gaps in the chain correspond to future bodies that have not
yet returned normally. Abnormal exits from the body of a future can create independent
chains that never get connected to the legitimate chain. Note that there is at all times
exactly one legitimate task in the system. All other tasks can be viewed as being
speculative tasks because there is no guarantee that they actually contribute to the
computation at hand. At the moment of its death, the legitimate task will turn one of
the speculative tasks into the legitimate task.

2.8.5 Implementing Legitimacy
An implementation of the Katz-Weise semantics with legitimacy is shown in Figure 2.12.
The legitimacy chain is conveniently implemented with placeholders. Each task has a
legitimacy ag represented by a placeholder. The root task is initially legitimate so its
legitimacy ag is a non-placeholder. When a child task is created its legitimacy ag is
taken from the parent task. Since the parent task is going to invoke the future's continuation, its legitimacy ag is replaced by a newly created undetermined placeholder,
leg-ph, which represents the as of yet unknown legitimacy of the rst task to return
from the future's body (which might not be the child). Leg-ph must also be embeded
in the body's continuation. When this continuation is returned to, which corresponds
to a call to end-body, the result placeholder gets determined and the legitimacy chain
is extended by unifying leg-ph with the current task's legitimacy ag.

2.8. CONTINUATION SEMANTICS

55

(define (make-FUTURE thunk)
(call-with-current-continuation
(lambda (k)
(let ((res-ph (make-ph))
(leg-ph (make-ph))
(parent (current-task)))
(let ((child (make-task
(lambda () (end-body k res-ph leg-ph (thunk)))
(task-legitimacy parent))))
(task-legitimacy-set! parent leg-ph)
(queue-put! (work-queue) child)
res-ph)))))
(define (end-body k res-ph leg-ph result)
(if (test-and-determine! res-ph (TOUCH result))
(begin
(determine! leg-ph (task-legitimacy (current-task))) ; 1
(idle))
(k result)))
(define (speculation-barrier)
(TOUCH (task-legitimacy (current-task))))

Figure 2.12: The Katz-Weise implementation of futures.

2.8.6 Speculation Barriers
A straightforward use of legitimacy is to prevent speculative tasks from terminating
the program and only allowing the legitimate task to do this. This speculation barrier
can be accomplished simply by touching the task's legitimacy ag at the program's
terminal continuation. Conceptually, this touch walks down as far as it can in the
task's legitimacy chain and blocks until the task is known to be legitimate. Only the
legitimate task is allowed to proceed beyond the touch, the other tasks are suspended
inde nitely.
Using a speculation barrier at the very tail of a program guarantees that the correct
result will be returned but it does little to prevent speculative tasks from consuming
processing resources. It is possible to add speculation barriers at well chosen places in
the program to limit the extent of speculative parallelism. Even though this reduces the
amount of parallelism in the program, it may yield a more ecient program because a
higher proportion of the time will be spent doing mandatory work. A case where this
might be useful is given in Figure 2.13. For simplicity, it is assumed that map processes

CHAPTER 2. BACKGROUND

56

(define (map-sqrt lst)
(call-with-current-continuation
(lambda (abort)
(map (lambda (x)
(FUTURE
(if (negative? x) (abort x) (sqrt x))))
lst))))
(define (map-sqrt-with-barrier lst)
(let ((result (map-sqrt lst)))
(speculation-barrier)
result))

Figure 2.13: An application of speculation barriers.
the values from head to tail12 . For each value in the list, map-sqrt spawns a task to
compute the square root of the value and returns a list of the results. In a sequential
version of the program (i.e. if the future is absent), the rst negative value is returned by
map-sqrt. In the parallel version, the root task and all tasks processing negative values
will return from map-sqrt. Map-sqrt-with-barrier obtains the same result as the
sequential version by using a speculation barrier after the call to map-sqrt. Only the
task processing the rst negative value will be legitimate and will cross the barrier. Since
this task bypasses the determining of its result placeholder, its parent's legitimacy ag
will remain undetermined forever. All the tasks spawned by the parent and its children
after the legitimate task will have undetermined legitimacy ags. Consequently, these
tasks will get suspended when they reach the barrier.

2.8.7 The Cost of Supporting Legitimacy
The cost of supporting legitimacy is an important issue. Speculation barriers are certainly useful to express some programs, but many programs have no need for them, in
particular those that only contain mandatory tasks. Consequently, it is important to
evaluate the cost of supporting legitimacy in both contexts.
For programs which contain speculation barriers, one concern is the space occupied
by tasks suspended at barriers. A careful study of Figure 2.12 reveals that these tasks are
only retained if they might become legitimate. These tasks are suspended on leg-ph
12 The

Scheme language does not impose a particular ordering.

2.8. CONTINUATION SEMANTICS

57

which is only accessible through the child's terminal continuation. In the previous
example (Figure 2.13) this continuation was discarded when abort was called by the
child. Since leg-ph is unreachable it will eventually get garbage collected along with the
tasks suspended on it. On the other hand, if the child's continuation had been saved
prior to the call to abort (by calling call/cc and saving the continuation away), it
would not be possible to garbage collect the suspended tasks because leg-ph would still
be reachable. This is clearly the correct behavior since any number of the suspended
tasks could still become legitimate (for example, if the saved continuation is invoked by
the legitimate task).
Two other costs are legitimacy testing and propagation. The cost of legitimacy
propagation is particularly important because it is paid even by programs that do not
use legitimacy (or that use it infrequently). In Figure 2.12, the current task's legitimacy
placeholder is propagated directly to the next task in the chain (line 1 in end-body).
Legitimacy propagation is thus constant cost but legitimacy testing can be expensive.
A program which spawns n mandatory tasks, thus creating a legitimacy chain with n
placeholders, will require O(n) time to test legitimacy at the program's termination (the
task spawning strategy, whether it is a sequential loop or DAC loop, is irrelevant).
Another approach is to touch the current task's legitimacy (on line 1 ) before propagating it to the next task. In other words the task waits to be legitimate before marking
the next task as legitimate. Legitimacy testing is then constant cost but legitimacy propagation is expensive for two reasons: it is inherently sequential and it produces frequent
task switches. Because of the touch, a particular legitimacy placeholder in the chain
can only be determined after the previous legitimacy placeholder has been determined.
This implies that the last task will at best be marked as legitimate (n) time after the
rst task. Also, any task terminating before its predecessor in the chain will have to be
suspended and eventually resumed, just to set the next legitimacy placeholder.
A better strategy is to shrink the legitimacy chain as the computation progresses.
All the links in the chain will have to be followed but this can be done in parallel.
The method uses a \collapse" operation that walks a chain of placeholders and returns
its tail element (i.e. either an undetermined placeholder or a non-placeholder). This
operation is added to line 1 so that the current task propagates its collapsed legitimacy
chain to the next task. Nothing is gained if a task terminates before its predecessor
but if it terminates afterwards, one or more links in the chain will get removed for the
bene t of the successor tasks. But how frequently will it be possible to collapse the
chain?
Clearly, the order of task termination has a direct in uence on the collapsing of the

CHAPTER 2. BACKGROUND

58
(define (fj1 n)
(if (= n 0)
1
(let* ((l (FUTURE (fj1 (- n 1))))
(r (fj1 (- n 1))))
(+ (TOUCH l) r))))

. .8.........S..
w
...
..  .S
..
.
.
.
.
.
.
.
.
.
4 .. S
6 ..  7
.
YH
H
.
. .S
w H
.
...  .... HH... 
.
2 ...  3
5
... 
.
1
~

~

~

.....
......

~

~

~

~

~

(define (fj2 n)
(if (= n 0)
1
(let* ((l (FUTURE (fj2 (- n 1))))
(r (FUTURE (fj2 (- n 1)))))
(+ (TOUCH l) (TOUCH r)))))

....X
.....X..X
........15..X
.
z
.
..X
.
.
.........
.
.
.
.
.
.
i
P
.
.
P
.
.
Z
.
.
.. Z..~
P
.....
.
.
.
.
.
.
P
~
Z
.
.
.
7
....
...
... 14 Z
.. ...A..
.
.
.
S
.
.
A
A
.
.
.
Z
}
}
Z
6 . S 10 .
3.
13A..
. .  AU.. ZZ. . AU.. S. .  AU.. ZZ. . AU..
{

{

{

{

1

{

2

{

{

{

4

{

5

{

{

8

{

9

{

{

11

{

12

Figure 2.14: Fork-join algorithms and their legitimacy chain in the absence of chain
collapsing.
chain. An important case to consider is fork-join parallel algorithms which impose a
strict termination order on tasks. In fork-join algorithms, a parent task P sequentially
spawns a certain number of children (C1 to Ck ) and later touches the result of the children before terminating. In the absence of collapsing, the legitimacy chain corresponds
to a post x walk of the spawning tree. Figure 2.14 illustrates this for two fork-join
procedures (fj1 and fj2). Each node corresponds to a task in the spawning tree. The
nodes are numbered according to a post x walk of the tree (the left child is spawned
rst) and the arrows represent links of the legitimacy chain (e.g. task 8 is legitimate if
task 7 is legitimate). Note that the link coming out of task i + 1 is only lled in when
task i terminates. Due to the fork-join nature of the program, all tasks in the spawning
tree rooted at task i will have terminated when task i terminates. This implies that
when task i terminates, all links of the legitimacy chain enclosed in task i's spawning
tree are known and can be collapsed. In the worst case, this collapsing will stop at Li ,
the leftmost task in task i's spawning tree. In other words, task i will set task i + 1's
legitimacy link to Li . But, as shown in Figure 2.15, if i = Cj (i.e. i is the j th child of P ),
then either i + 1 = P or i + 1 = LCj+1 . It follows that the collapsing of the links in the
legitimacy chain between P and LP takes at most k sequential steps after all children
are done. Given that the spawning of the children by P takes (k) time anyway, the
cost of propagating legitimacy does not change the complexity of the program. There is
only a constant overhead per task created. This overhead is rather low since it amounts

2.9. BENCHMARK PROGRAMS

.
..

C1

.

.
..

..

59

.
..

P . . .@.. .

..

~

. @. .
.. . .
. @@ . . .
.. @ . .
.. @ . . .
.
Ck 1 . @@ Ck .
~

BB
 B
 B

B
B


BB
 B
 B

B

B

~

.....
I..

~

~

q

q

q

.....
I..

B

~

BB
B
 B
B
 B
BN 
B

B
~

Figure 2.15: General case of legitimacy chain collapsing for fork-join algorithms.
to following one link of the legitimacy chain per task spawned. This result holds for any
fork-join algorithm regardless of how well balanced the spawning tree is (including the
fork-join DAC procedures fj1 and fj2 above as well as the \linear" fork-join procedure
pmap in Section 2.4.2).

2.9 Benchmark Programs
In order to guide the design process and provide a basis for evaluating and comparing the performance of the implementation strategies, it is important to identify the
salient characteristics of the target applications. Following common practice, a set of
benchmark programs were selected as representives of \typical" applications of Multilisp. These benchmark programs are used throughout the thesis for various evaluation
purposes.
The biggest aw of these benchmarks is their small size. Real applications will
probably be much longer and more complex. Characteristics such as locality of reference,
paging, task granularity and available parallelism may be substantially di erent. Small
programs are no substitute for the real thing. They can only serve as rough models of
real applications. The main advantage of small programs is that they usually stress a
well de ned part of the system, so the measurement can be interpreted more readily.
Both sequential and parallel benchmarks were used. The sequential benchmarks are
mostly taken from the Gabriel suite [Gabriel, 1985] which has traditionally been used

CHAPTER 2. BACKGROUND

60

to evaluate implementations of Lisp. To these benchmarks were added four sequential
benchmarks: compiler (the Gambit compiler), conform (a type checker), earley (a
parser) and peval (a partial evaluator). These are sizeable programs that achieve some
useful purpose (compiler contains more than 15,000 lines of Scheme code). Note that
for some measurements it was not possible to run compiler due to lack of memory.
There are twelve parallel benchmarks. Half of these were originally written in MulT by Eric Mohr as part of his PhD thesis work [Mohr, 1991]. To these were added
a few classical parallel programs (matrix multiplication, parallel pre x and parallel
reduction) and programs based on pipeline parallelism (polynomial multiplication and
quicksort). A general description of the parallel benchmarks is given next. None of the
benchmarks require the Katz-Weise continuation semantics or legitimacy (Chapter 5
evaluates their cost in another way). Appendix A contains some additional details
including the source code and compilation options. Appendix B contains execution
pro les for the benchmarks. These indicate the activity of the processors as a function
of time, thus allowing a better visualization of the program's behavior.

2.9.1

abisort

This program sorts n = 16384 integers using the adaptive bitonic sort algorithm [Bilardi
and Nicolau, 1989]. This algorithm is optimal in the sense that, on the PRAM-EREW13
n
theoretical model, it runs in O( n log
p ) time, where p is the number of processors and
1  p  2blognlog nc . To achieve this performance, abisort stores the sequence of elements
in a bitonic tree which is a full binary tree with the property that many elements can be
logically exchanged by a small number of pointer exchanges. To sort a tree, both subtrees
are rst sorted recursively in parallel and then they are merged. The advantage of this
algorithm over mergesort is that the merging of bitonic trees can be done in parallel.
Both the recursive sorting phase and the merging phase are based on parallel fork-join
DAC algorithms. Abisort puts high demands on the memory interconnect because it
frequently references and mutates the shared bitonic tree data structure.

2.9.2

allpairs

This program computes the shortest paths between all pairs of n = 117 nodes using a
parallel version of Floyd's algorithm. The input is a square distance matrix D where
Dij is the length of the edge between nodes i and j . The algorithm goes through n
13 Parallel

Random Access Machine with Exclusive Read Exclusive Write memory.

2.9. BENCHMARK PROGRAMS

61

steps, each of which updates D in place based on its current state. At the beginning of
the kth step, Dij represents the length of the shortest path from i to j that does not
go through any node greater or equal to k. The update operation consists of replacing
for each possible i and j , Dij by Dik + Dkj if that value is smaller. Since Dkk is always
0, neither row k or column k of D will change during the kth step. Consequently, all
update operations of a given step can be done concurrently. Parallelizing both the loop
on i and j would have resulted in an unnecessarily ne task granularity so only the
outermost of the two loops was done in parallel (by a parallel fork-join DAC loop).
The computation thus consists of a sequence of 117 steps, each of which contains 117
tasks. The execution pro le for this program looks like a \comb" where each \tooth"
corresponds to one step of the outer loop. Allpairs has the coarsest task granularity
and the highest run time of all the benchmarks.

2.9.3

fib

This program computes F25, the 25th bonacci number, using the straightforward (but
obviously inecient) doubly recursive algorithm. It is a very compute intensive benchmark which does not reference any heap allocated data. Fib is interesting to examine
because it can serve as a model for ne grain fork-join DAC algorithms. Fib has the
nest task granularity of all the benchmarks. The spawning tree is fairly bushy but
is not perfectly balanced. The imbalance follows the golden ratio: each subtree has
roughly 62% more tasks on the fat branch than on the other branch.

2.9.4

mm

This program multiplies two matrices of integers (50 by 50). The standard algorithm
with three nested loops is used. All these loops can be parallelized but only the two
outermost loops were turned into parallel fork-join DAC loops. The program thus
involves 2500 fairly coarse grain tasks, each of which is in charge of computing one of
the entries in the result matrix.

2.9.5

mst

This program computes the minimum spanning tree of an n = 1000 node graph. A
parallel version of Prim's algorithm is used. The input is a symmetric distance matrix
D where Dij is the length of the edge between node i and node j . The algorithm

CHAPTER 2. BACKGROUND

62

constructs the minimum spanning tree incrementally in n 1 steps. It starts with a set
of nodes containing a single node and at each step it adds to this set the node not yet
in the set that is closest to one of the nodes in the set. In order to nd the closest node
quickly, each node not yet in the set remembers the shortest edge that connects it to the
set. This \shortest connecting edge" must be recomputed when a new node is added
to the set. The kth step is a loop over n k nodes that rst recomputes each node's
shortest connecting edge based on the last node added to the set and then nds the
shortest of these edges. Mst performs this loop in parallel using a parallel fork-join DAC
loop. Note that the degree of parallelism decreases with time (this is clearly visible in
the execution pro le). The kth step involves n k tasks.

2.9.6

poly

This program computes the square of a 200 term polynomial of x (with integer coecients). The resulting polynomial is then evaluated for a certain value of x. This ensures that the computation of all coecients has nished. Polynomials are represented
as a list of coecients. The product of two polynomials P and Q with coecients
(P1 ; : : :; Pn ) and (Q1 ; : : :; Qm ) is obtained by rst computing the product of P and
Q0 = (Q2; : : :; Qm) and then adding the result shifted by one position to P scaled by
Q1 . The following diagram shows the unfolded recursion for computing R = PQ when
n = 4 and m = 3

P  Q3
P  Q2
P  Q1

0
0

P1 Q3 P2 Q3 P3 Q3 P4 Q3
HH
HH
HH
.j
...... HHH
j
j
H
j
H
HH

P1 Q2 P2 Q2 P3 Q2 P4 Q2
H
HH
HH
HH
.j
...... HHH
j
j
H
j
H
HH HHH
P1 Q1 P2 Q1 P3 Q1 P4 Q1
?

R1

?

R2

?

R3

?

R4

?

R5

?

R6

This algorithm is coded with two loops. The inner loop does the operations corresponding to a row in the above diagram. It combines the scaling and summing operations in a single multiply-and-add step. The result of the inner loop is the list of
coecients to be added by the next row. Poly exploits the parallelism available in the
inner loop in a way similar to the procedure pmap of Figure 2.2. The multiply-and-add
step corresponding to Pi Qj is done after spawning a task to process the rest of row j .
Consequently, there is one task per multiply-and-add step. Moreover, the processing

2.9. BENCHMARK PROGRAMS

63

of the rows is pipelined (the processing of row j + 1 can start before the processing of
row j is nished). An alternative algorithm is to spawn a task for each coecient of R.
Task k computes

Rk =

X

min(k;m)

j =max(1;k m)

Pk

j +1 Qj

Because it spawns fewer tasks (O(n + m) instead of O(nm)), this algorithm is probably more ecient. However, the rst algorithm was chosen because it is more representative of applications with ne grain pipeline parallelism.

2.9.7

qsort

This program sorts a list of 1000 randomly ordered integers using a parallel version
of the Quicksort algorithm. The list's head element is used to construct two sublists
with the remaining elements: a list of the smaller values and a list of the not smaller
values. The two partitions are then sorted in parallel. The partitioning procedure uses
a pipeline parallelism technique similar to the procedure pmap. The beginning of the
partition is available to the continuation before the rest of the list has been partitioned.
This means that the sorting of the partition can start as soon as the rst element of
the partition is generated. Although there are more ecient parallel sorting algorithms
(e.g. abisort), qsort is interesting to consider because it combines pipeline parallelism
and DAC parallelism.

2.9.8

queens

This program computes the number of solutions to the n-queens problem, with n = 10.
It is based on a recursive procedure which, given a placement of k queens on the rst
k rows, computes the number of legal ways the remaining n k queens can be placed
(a queen must not be on the same row, column or diagonal as another queen). For
each valid position of a queen on row k + 1, the procedure spawns a task that calls the
procedure recursively with the new placement. The number of solutions in each branch
is nally summed up. Bit vectors are used to eciently encode the current placement
of queens. As a consequence, queens does not access any heap allocated data structure.
The call tree is not well balanced. Most branches of the search tree lead to dead ends
quickly. Queens is a good model for combinatorial search problems such as the traveling
salesman problem and the searching of game trees.

CHAPTER 2. BACKGROUND

64

2.9.9

rantree

This program models the traversal of a random binary tree with on the order of 32768
nodes. The branching factor is 50%. This means that the subnodes of a node are uniformly distributed in the left and right branches. The average length of the paths from
the root is 36. Path length roughly follows a normal curve distribution extending from
a length of 1 to a length of 73. Like queens, rantree uses fork-join DAC parallelism,
it does not access any heap allocated data and the call tree is not well balanced.

2.9.10

scan

This program computes the parallel pre x sum of a vector of 32768 integers. The vector
is modi ed in place. A given element is replaced by the sum of itself and all preceding elements in the vector. Scan is based on the \odd-even" parallel pre x algorithm
illustrated by the following diagram
0 1 2 3 4 5 6 7
A
A
A
A
A
A
A
A
AU ?
U ?
A
U ?
A
U ?
A
0

1

2

5

4

9

6 13

?

?

?

?

?

?

?

?

Parallel Pre x Sum

0

1 2 6 4 15 6 28
A
A
A
A
A
A
AU ?
U ?
A
U ?
A

0

1

3

6 10 15 21 28

The rst step is to sum every element at an odd index with its immediate predecessor.
The parallel pre x algorithm is then applied recursively to the subvector consisting of
the elements with an odd index. Finally every element with an even index is summed
with the preceding element (if it exists). When the recursion is unfolded, this algorithm
consists of two passes over the vector using tree-like reference patterns. In the Multilisp
encoding, the rst pass is performed by the combining phase of a parallel fork-join DAC
loop whereas the second pass is performed by the dividing phase of a second parallel
fork-join DAC loop. These two passes are clearly visible on the execution pro le.

2.9. BENCHMARK PROGRAMS

2.9.11

65

sum

This program computes the reduction (using +) of 32768 integers stored in a vector.
A parallel fork-join DAC algorithm is used. The vector is logically subdivided in two,
both halves are then processed recursively in parallel and nally the two resulting sums
are added. Sum is the nest grain program that accesses heap allocated data. It serves
as a model for ne grain data parallel computations such as the reduction of a set of
values or the mapping of a function on a set of values.

2.9.12

tridiag

This program solves a tridiagonal system of 32767 equations. The computation proceeds
in two sequential phases: the reduction of the system by the method of cyclic reduction
[Hockney and Jesshope, 1988] followed by backsubstitution. Cyclic reduction takes a
tridiagonal system of order n = 2k 1 (i.e. n equations over the variables x0 to xn 1 ) and
produces a reduced tridiagonal system of order (n + 1)=2 1. For each odd numbered
equation i, the equations i 1, i and i + 1 are combined in such a way as to eliminate
variables xi 1 and xi+1 . The resulting equation only contains variables xi 2 , xi and
xi+2 as shown here
0

A1 x0
A2 x1
A3 x2
A4 x3

Tridiagonal system
+ B0 x0 + C0 x1
+ B1 x1 + C1 x2
+ B2 x2 + C2 x3
+ B3 x3 + C3 x4
+ B4 x4 + C4 x5
..
.

=
=
=
=
=

Reduced system

Y0
Y1
Y2
Y3
Y4

An 2 xn 3 + Bn 2 xn 2 + Cn 2 xn 1 = Yn
An 1 xn 2 + Bn 1 xn 1 + 0 = Yn

=)
=)

0

+ B10 x1 + C10 x3 = Y10

A03 x1 + B30 x3 + C30 x5 = Y30
..
.

2
1

=) A0n 2 xn 4 + Bn0 2 xn 2 + 0 = Yn0 2

The reduction process is applied to the reduced system until a single equation of
the form bx(n+1)=2 1 = y is obtained (this takes k 1 reductions). Note that because
equation i will not be needed later it can be replaced by the new equation (in other
words, the k 1 reductions produce an equivalent set of n 1 equations). The solution
to x(n+1)=2 1 is then backsubstituted to nd the value of x(n+1)=4 1 and x3(n+1)=4 1 and
so on recursively. After k backsubtitutions, the value of all variables is obtained.

CHAPTER 2. BACKGROUND

66

The backsubstitution is implemented with a single tree-like DAC method. The reductions could be directly parallelized by performing a sequence of k 1 parallel fork-join
DAC loop, but tridiag uses a clever tree-like method that has fewer synchronization
constraints.

2.10 The Performance of ETC
The main problem with ETC is the high cost of manipulating heavyweight tasks. This
section evaluates the best performance that can be expected of ETC for typical programs.
The total work performed by a Multilisp program when run on an n processor
machine (i.e. the product of the run time and n) is

Ttotal (n) = Tseq Oexpose Oexploit (n)
Tseq , Oexpose and Oexploit (n) all depend on the program. Tseq corresponds to the

run time of a sequential version of the program (the parallel program with futures and
touches removed). The overhead of parallelism is split into two components14. Oexpose
represents the overhead of exposing the parallelism to the system. It re ects the extra
work performed by the futures and touches in the program with respect to the sequential
version. The product Tseq Oexpose is thus the run time of the parallel program on one
processor (i.e. Tpar ). The extra work is the sum of the costs for each future and touch
executed by the program

Oexpose = 1 +

PNfuture Tfuture + PNtouch Ttouch
i=1

i

Tseq

i=1

i

Nfuture and Ntouch are respectively the number of futures and touches evaluated
by the program. Tfuture i and Ttouch i are respectively the cost of the ith future and
touch operations when only one processor is being used. In general, the costs of these
operations are not constant because they depend on several factors including the task
scheduling order (which might vary from one run to the next), the compiler's ability to
14

Overheads are expressed as multipliers. An overhead of x indicates that the amount of work (or
other measure) is larger by a factor of x. Consequently, an overhead below 1 indicates a decrease. The
x .
term \an overhead of x%" is used to denote small overheads. It means an overhead of 1 + 100

2.10. THE PERFORMANCE OF ETC

67

generate special case code for the operation given its particular location in the program,
and the complexity of the task to be created, suspended or resumed. For evaluating
best case performance, it is useful to de ne a minimum cost for futures and touches:
Tfuture min and Ttouch min respectively. This leads to the following lower bound on
Oexpose (expressed as a function of Tfuture min and the program's granularity)

Oexpose  1 + Nfuture Tfuture

min

+ Ntouch Ttouch

Tseq

min

 1 + TfutureG min

G is a measure of the program's granularity. It is the average amount of computation
seq ).
performed by each task (G = NTfuture

The second part of the parallelism overhead, Oexploit (n), indicates how well the
program's parallelism is exploited by the system. It corresponds to the additional work
performed when running the parallel program on an n processor machine. Oexploit (n)
contains the following costs not present in Oexpose : memory interconnect contention and
processor starvation (i.e. lack of tasks to run). Processor starvation is both dependent on
the program's degree of parallelism and on the scheduler's speed at assigning runnable
tasks to idle processors. In addition, Oexploit (n) re ects the variation in scheduling
order which might cause an increase or decrease in the number of tasks suspended and
resumed. By de nition, Oexploit (1) = 1.

In ETC, Tfuture min is relatively high. If it is assumed that all tasks created eventually run and terminate, Tfuture min is the cost of creating, starting and terminating
a heavyweight task15. The bare minimum work caused by the evaluation of a future
corresponds to the following sequence
(1) Creating a closure for the future's body.
In make-FUTURE:
(2) Creating the result placeholder, associated lock and waiting queue.
(3) Creating the child's initial continuation.
(4) Creating the child task object.
(5) Locking the work queue.
(6) Enqueuing the child on the work queue.
(7) Unlocking the work queue.
15

All tasks terminate in programs with mandatory tasks (those that perform all the work of their
sequential counterpart). This is the case for all the parallel benchmarks.

68

CHAPTER 2. BACKGROUND

In idle:
(8) Locking the work queue.
(9) Dequeuing the child from the work queue.
(10) Unlocking the work queue.
(11) Restoring the child's continuation.
In determine!:
(12) Locking the result placeholder.
(13) Setting the placeholder's value and determined? ag.
(14) Checking for suspended tasks to reactivate.
(15) Unlocking the placeholder.
This sequence does not include the operations for dynamic scoping, Katz-Weise
continuation semantics and legitimacy. A few tricks can be used to improve the eciency
of this sequence. The heap allocations of steps 1 through 4 can be combined to reduce
the cost of checking for heap over ow. In fact, nothing prevents the closure, placeholder,
task object and initial continuation to be the same physical object. This reduces the
e ectiveness of garbage collection (all objects are retained for as long as any of them
is reachable) but it does lessen the object formatting overhead. The use of local work
queues also permits some optimization of the locking and unlocking of the work queue.
To simplify step 2, 13 and the touch operation, a special value can be assigned to the
placeholder's value slot to indicate that it is undetermined.
Even with all these optimizations, the sequence and the associated control ow
instructions will translate into a moderate number of instructions, probably around 50 to
100 machine instructions. The performance of previous implementations of ETC seem to
con rm this lower bound. The Mul-T system was carefully designed to minimize the cost
of ETC [Kranz et al., 1989]. When run on an Encore Multimax, Mul-T requires roughly
130 machine instructions to implement the sequence (the actual cost depends on the
number of closed variables, their location, etc.). Other compiler based systems require
even more instructions. Portable Standard Lisp on the GP1000 [Swanson et al., 1988]
takes 480 secs (about 1440 instructions given that each processor gives out 3 MIPS)
and QLisp on an Alliant FX/8 [Goldman and Gabriel, 1988] takes 1400 instructions.
With this lower bound on Tfuture min it is possible to get a lower bound on Oexpose
from the value of G. The left part of Table 2.1 gives the value of G, Tseq , Nfuture and
Ntouch measured for the benchmark programs when run on the GP1000 with a single
processor. The benchmarks have been ordered by increasing granularity. Note that the

2.10. THE PERFORMANCE OF ETC

Program
fib
sum
qsort
scan
queens
rantree
abisort
poly
mst
tridiag
mm
allpairs

G in secs

Tseq

7
.819
12
.392
16
.210
16 1.061
31 1.092
42
.394
44 4.734
58 2.308
94 23.414
161 3.958
624 1.558
1831 24.852

69

Nfuture

Ntouch

121392
32767
13318
65534
34814
9487
106496
39801
249001
24574
2499
13572

121392
32767
27637
65534
34814
9487
106496
40200
249001
24574
2499
13572

Lower bound on Oexpose when
Tfuture min in sec is
1
5
25
125
625
1.15 1.74 4.71 19.54 93.69
1.08 1.42 3.09 11.45 53.24
1.06 1.32 2.59
8.94 40.71
1.06 1.31 2.54
8.72 39.60
1.03 1.16 1.80
4.99 20.93
1.02 1.12 1.60
4.01 16.05
1.02 1.11 1.56
3.81 15.06
1.02 1.09 1.43
3.16 11.82
1.01 1.05 1.27
2.33
7.65
1.01 1.03 1.16
1.78
4.88
1.00 1.01 1.04
1.20
2.00
1.00 1.00 1.01
1.07
1.34

Table 2.1: Characteristics of parallel benchmark programs running on GP1000.
number of futures is equal to the number of touches for all benchmarks based on forkjoin parallelism (all benchmarks except qsort and poly). The right part of the table
gives the lower bound on Oexpose computed from G and various values of Tfuture min .
According to this table, an optimized version of ETC (i.e. one with Tfuture min =
25secs = 75 machine instructions) will have an overhead that spans a range from
essentially nonexistent to fairly sizeable. As the granularity decreases, the overhead
increases and almost reaches a factor of 5 for ne grain programs. This overhead is
a conservative estimate. Mul-T's implementation of ETC gives a measured value of
Oexpose = 8:9 for fib [Mohr, 1991]. Whether this is an acceptable overhead or not
for \typical" programs is of course a subjective matter. However, it is clear that a
high overhead for ne grain programs will have an impact on the style of programming
adopted by users.
There will be a high incentive to design programs with coarse grain parallelism
even if there exists a natural ne grain solution. Frequently it is possible to manually
transform a ne grain program into a coarser grain program by grouping several small
tasks into a single one that executes them sequentially (this is akin to unrolling loops
by hand in sequential languages to reduce the loop management overhead). This type
of transformation has several drawbacks. If the task grouping is arti cial, the program

CHAPTER 2. BACKGROUND

70

(define (fib n)
(if (< n 2)
n
(+ (fib2 (- n 1)) (fib2 (- n 2)))))

(define (fib n)
(if (< n 2)
n
(let* ((x (FUTURE (fib (- n 1))))
(y (fib (- n 2))))
(+ (TOUCH x) y))))

File: "fib.elog"

(define (fib2 n)
(if (< n 2)
n
(let* ((x (FUTURE (fib (- n 1))))
(y (fib (- n 2))))
(+ (TOUCH x) y))))

Processors: 32

File: "fib-unroll.elog"

Processors: 32

100

100

80

80

60

60
%

%

40

40

20

20

0
0

0

5

10

10

20

interrupt

15

30
working

20

40

25

50
idle

60
touch

30

70

35

80
determine

40

90

0
msec

100
stealing

%

0

0

10

20

10

30

20

interrupt

40

50

30
working

60

70

80

40

50
idle

90

100

110

60
touch

120

70

130

140

80
determine

150

90

160

msec

100

%

stealing

Figure 2.16: Fib and a poor variant obtained by unrolling the recursion.
becomes more complex and harder to maintain. An overhead cost must also be expected
if task grouping is managed dynamically by user code (as is the case for the depth and
height cuto methods proposed for tree-like computations by Weening [Weening, 1989]).
The transformation is also error prone. Logical bugs as well as performance problems
can be introduced by the user. For example, the recursion of fib can be unrolled once
as shown in Figure 2.16 to double the task granularity. One might expect the program
to be more ecient because of the lower task management overhead but in reality it
performs poorly because a sequential dependency has been introduced (this can be seen
clearly in the execution pro les). Finally, the program will be less portable because the
selection of an appropriate granularity depends on several parameters of the run time
environment (number of processors, task operation costs, shared memory costs, etc).
The problem with a high task management cost is not so much that it prevents
the user from attaining good performance. The problem is that the language cannot
realistically be viewed as a high-level language because the user must program at a
low-level to attain good performance. Selecting the right granularity for a program can
quickly become the user's overriding concern.
The next chapter explores a more ecient approach to task management called
lazy task creation . The cost of evaluating a future with this approach is very small
(Tfuture min on the order of 1 sec on the GP1000). Table 2.1 can be used to approximate

2.10. THE PERFORMANCE OF ETC

71

the overhead of this approach. The nest grain program (i.e. fib) should have a value
of Oexpose close to 15%. Note that the table gives a lower bound and that the actual
overhead will be somewhat larger. Chapter 5 contains the measured value of Oexpose
for the benchmarks. With such a small overhead, the user has virtually no incentive to
avoid ne grain tasks and thus has added liberty in the programming styles that can be
used.

72

CHAPTER 2. BACKGROUND

Chapter 3

Lazy Task Creation
Several plausible semantics for Multilisp were compared in the preceding chapter. The
Katz-Weise semantics with legitimacy is attractive because it provides an elegant interaction between futures and continuations. In addition, dynamic scoping and fairness of
scheduling are desirable features. Unfortunately, ETC is not an adequate implementation of futures because its performance is poor on ne grain programs.
This chapter explores lazy task creation (LTC), an alternative task creation mechanism that is more ecient than ETC; especially for ne grain programs. The LTC
mechanism described here supports the Multilisp semantics given above. Two variants
of LTC are examined: one that assumes an ecient shared memory and one that does
not. As con rmed in Chapter 5, both variants have roughly the same performance when
consistent shared memory is ecient but when this is not the case, for example on large
scale multiprocessors, the later variant permits a more ecient execution (faster by as
much as a factor of 2 on the TC2000).
In this chapter, algorithms are given in pseudo-C. Assembly code is also used to
explain the details of the code sequences generated by the compiler.

3.1 Overview of LTC Scheduling
This section explains the scheduling policy adopted by LTC and its bene ts.
Task execution order has a direct impact on performance. The implementation
must choose an ordering that minimizes the task management overheads. There are
73

CHAPTER 3. LAZY TASK CREATION

74

four places where an implementation has liberty as to which task to run next






Task spawning.
Task termination.
Task suspension.
Preemption interruption.

Only the rst two situations are examined here (the last two are discussed in later
sections). Any runnable task can be run next in these four situations. However, only
the subsets of runnable tasks that are most promising are considered in the following
discussion. In particular, the task to run next is preferentially selected from the local
work queue because this will promote locality and reduce contention. When the local work queue is empty a task must be stolen from another processor's work queue.
Task stealing is the only way for work to get distributed between processors. The two
processors involved in a task steal are the thief processor and the victim processor.
When a task is spawned, one of two tasks can be run next by the spawning processor:
the child task or the parent task. The ETC implementation described in the preceding
chapter uses parent rst scheduling. When a future is evaluated the child task is made
to wait for an available processor whereas the parent task immediately starts executing
the future's continuation. LTC uses the reverse scheduling order, child rst scheduling.
The child's execution is started immediately by the spawning processor and the parent
is delayed until a processor is ready to run it.
The use of child rst scheduling in Multilisp has important advantages. First, it
tends to reduce the number of task suspensions caused by touches. The child is computing a value that is used by the future's continuation. Since the parent gets delayed
with respect to the child there is a higher likelihood that the child will have completed
when its result is rst touched by the parent or one of its other descendants.
When a task terminates however, there is no incentive to delay its parent any further.
In fact, now that the task's result is known, it makes sense to execute the parent next.
Since the parent consumes the value just computed, it is less likely that it will get
suspended. This policy will be called parent next scheduling.
Child rst scheduling combines naturally with parent next scheduling to give an
ecient stack-like scheduling policy: LIFO scheduling. The set of runnable tasks on
a processor is kept in a stack, the task stack , associated with that processor (see Figure 3.1). The main operations available on the task stack are: task push, task pop, and

3.1. OVERVIEW OF LTC SCHEDULING

75

PUSH 6 ?POP
Youngest task
Oldest task
Figure 3.1: The task stack.
task steal. When a task is spawned, the parent is simply pushed onto the task stack
and control goes to the child. When a task terminates, the parent is necessarily on top
of the task stack if it hasn't been run yet (this assumes that processors can steal but
cannot push a task onto another processor's task stack). If the parent is still there, it
gets popped from the task stack and executed by the same processor that pushed it.
LIFO scheduling yields a task execution order very similar to that of the program
with futures removed. In fact, the execution order is identical when no task is ever
stolen from the task stack. This happens for example when the machine has a single
processor or when all processors have enough local work to keep them busy. In this
situation, there are no task suspensions because the only computation that might touch
the task's placeholder (i.e. the continuation) necessarily follows the termination of the
task.

3.1.1 Task Stealing Behavior
Under LIFO scheduling, tasks could be stolen from either end of the task stack. Tasks
are always stolen from the task stack's bottom in LTC. It is interesting to see why this
bottom stealing is preferable to top stealing . Top stealing might seem better for the
same reason as child rst scheduling. Favoring the execution of younger tasks should
reduce the likelihood of suspension in older tasks.
However, this analysis does not take into account that older tasks generally run
longer before termination or suspension than younger tasks. For DAC programs with
balanced spawning trees, the task size will decrease geometrically with the task stack
depth. When a child task is pushed onto the task stack, the amount of work it contains
is a fraction (f ) of the amount remaining in the parent1. Thus, in a DAC program, the
1

The amount of work remaining in a task is all the work remaining before its termination including
the work contained in the tasks that it will spawn. In a well balanced binary DAC program, such as

76

CHAPTER 3. LAZY TASK CREATION

ith removed child from a task has f i times the work of that
task and collectively, a task
d+1
P
d
and the d descendants below it on the task stack have i=0 f i = 1 1f f the amount of

work. This means that the amount of work in the oldest task is approximately equal to
that of its youngest d0 = d + 1 logf (1 f ) descendants2 . Consequently, the amount
of work Toldest remaining in the oldest task is equal to the work in all other tasks on
the task stack except a constant number of the oldest tasks. The task stealing overhead
will be higher for top stealing because it requires at least d0 times more task steals than
bottom stealing to distribute Toldest units of work. In reality, the number of steals will
be higher than d0 because the victim is continuously replenishing the task stack with
small tasks as the thief is stealing them. The probability of stealing a task close to the
leaves of the spawning tree is relatively high.

Individual task steals are also faster with bottom stealing because there are two
nearly independent ways to access the task stack. A processor can push or pop a task
from its local task stack while some other processor is simultaneously stealing a task.
This parallelism, which is no more than a degree of 2, enables tasks to be created and
started faster. In addition, better caching of the task stack top is possible because it is
single writer shared data (as opposed to multiple writer shared data for top stealing).
Mohr [Mohr, 1991] has analyzed the task stealing behavior of bottom stealing for
tree-like DAC parallel programs. He has derived an upper bound of p2 h task steals
for programs with binary spawning trees of height h running on a machine with p
processors. This upper bound relies on the use of polite stealing . In polite stealing a
processor whose last steal was from victim V must try to steal from all other processors
before stealing again from V . An outline of Mohr's proof follows.
At any given point in time, a processor i is either idle (and is trying to steal a task)
or is in charge of running the tasks in some subtree of the spawning tree. Call hi the
height of processor i's subtree (hi = 0 when it is idle) and H the maximum height of all
subtrees (H = maxpi=1 hi ). After a task is stolen from processor i, both the victim and
the thief will be in charge of subtrees of height hi 1. Note that to decrease H by one it
is necessary to steal a task from all processors i with hi = H . Polite stealing guarantees
that all these processors will have been tried by a given processor in no more than p
steals (or steal attempts). Because up to p processors might be attempting to steal
tasks, it will take no more than p2 steals to steal at least one task from each processor
with hi = H . When H reaches zero no tasks are left to steal. Consequently, no more
sum,

f will be close to 21 . For fib, which has an imbalanced spawning tree, f is about :618. An f close

P

to 1 approximates loop based parallel algorithms such as pmap.
1 f d+1 1 f k+1 = f k+1
2 This result is obtained by solving for d0 = d k in 1 = d
i
1 f
1
i=k+1 f = 1 f

f d+1 .
f

3.1. OVERVIEW OF LTC SCHEDULING

77

than p2 h steals can occur.
In the absence of polite stealing O(2h ) steals can occur (potentially all tasks are
stolen). Although polite stealing insures the upper bound of p2h steals it isn't clear that
this makes a di erence in practice. Mohr ran programs with and without polite stealing
for a wide range of values of h and p. The number of steals was comparable (usually
within 10% to 30%) and only in extreme cases was there a noticeable advantage to use
polite stealing (a factor of 2 to 3 for high h and p). Gambit uses polite stealing with
the particularity that each processor has a probing order generated randomly when the
system is loaded. This was done in an e ort to reduce interference between competing
thief processors. With a sequential probing order there is a potential loss of parallelism
because several thieves might become synchronized, following each other in lockstep.

3.1.2 Task Suspension Behavior
Bottom stealing also leads to fewer task suspensions. To simplify the analysis, it is
assumed that tasks touch the value of their children just before termination and that
there are only two processors.
When bottom stealing, Toldest time units will elapse before the rst touch that might
cause a suspension. The d0 youngest tasks are not a ected by the steal so in this time
period they will have a suspension-free execution. When f  21 there is necessarily
no task suspension because all the descendants have terminated when the touch is
performed. A single suspension occurs when f > 12 and the steal happened not too late
after the rst descendant was spawned.
When top stealing, there are d0 tasks (at least) that might suspend in the same
time period. The likelihood of suspension increases with the depth of the task due to a
combination of two factors. First, deeper tasks have less work and second, it is faster
to remove tasks from the local task stack than to steal them from other processors (the
costs are respectively Tlocal and Tsteal ). Let Ttask be the amount of work remaining
in the stolen task and Tchild the work remaining in its currently running child. The
stolen task will terminate (or get suspended) in Tsteal + Ttask time whereas its parent
will touch its value in Tchild + Tlocal + Ttask
f time (the processor will nish executing the
child and then locally resume the stolen task's parent). A suspension occurs in either
of the following cases
1. Tsteal + Ttask < Tchild
) stolen task gets suspended
T
task
2. Tsteal + Ttask > Tchild + Tlocal + f ) stolen task's parent gets suspended

CHAPTER 3. LAZY TASK CREATION

78

The second case is highly likely for ne grain DAC programs because, as the depth
of the task increases, Ttask and Tchild become negligible when compared to Tsteal and it
is always the task closest to the leaves of the spawning tree that is being stolen.

3.2 Continuations for Futures
Continuations play a central role in the implementation of futures. A task's state is
mostly composed of a continuation. In addition, the Katz-Weise semantics as de ned
in Figure 2.12 requires that the future's continuation be captured and shared between
the child and parent tasks. Consequently, the eciency of continuation operations and
futures are intimately tied. This section describes the implementation of continuations
on top of which LTC will be implemented.
Conceptually, a continuation is a chain of frames. Each frame corresponds to some
subproblem call that is currently pending completion. A frame contains the context
required to perform the computation that follows the corresponding subproblem call.
The frame includes temporary values and variables (or alternatively an environment
pointer) and also contains a parent continuation. The parent continuation is used when
the procedure containing the subproblem call exits (by a normal return or a reduction
call). This link is what gives the stack structure to continuations. Note that in some
situations the parent continuation is never used and could be removed from the frame by
a smart compiler3 . For simplicity, it is assumed that the parent continuation is always
present in the frame. The oldest frame's parent is the root continuation which is special
in that it has no parent. The root continuation symbolizes the end of the program.
Several strategies for implementing continuations have been described and compared
by [Clinger et al., 1988]. Their results suggest that the incremental stack /heap strategy
is more ecient than the other strategies in most cases and not noticeably slower than
the other strategies in extreme cases. With the exception of a few details, this is the
strategy used by Gambit.
3 This

is permissible if the subproblem call is done inside an in nite loop. For example, in the
following de nition, the frame for the subproblem call to g need not contain f's continuation because f
never returns.
(define (f)
(g)
(f))

3.2. CONTINUATIONS FOR FUTURES

79

3.2.1 Procedure Calling Convention
Since continuations are manipulated at every procedure call and return, it is important
to have ecient support for these common operations. The incremental stack/heap
strategy puts very few constraints on procedure calling conventions. This means that
the presence of unlimited extent continuations in the language does not impose a special
runtime overhead4.
Parameters can be passed in any location (typically in registers and/or on the stack)
and a procedure can return simply by jumping to the return address passed to the
procedure by the caller. Within a procedure, the stack can be used freely to allocate
temporary values and local variable bindings.
Continuation frames, created at subproblem calls, are always allocated from the run
time stack (as is normally done for other languages). The procedure that allocated a
frame is responsible for its deallocation from the stack. Deallocation occurs at some
point before the procedure is exited (by a normal return or a reduction call). This
insures that at the subproblem call's return point, the continuation frame created for
the call is still topmost on the stack. A procedure's continuation is thus a combination
of two values: the return address and the value of the stack pointer. Note that the
return address passed to a procedure is always contained in any continuation frame it
creates.

3.2.2 Unlimited Extent Continuations
This implementation can be extended to support unlimited extent continuations. The
continuation is split into two parts. The most recently created frames of a continuation
are on the stack and the oldest frames reside in the heap. This situation is depicted
in Figure 3.2 (where frame i is created by procedure pi and reti is the return address
into pi). The implicit continuation passed to a procedure is represented by a triplet:
(SP,RET,UNDERFLOW_CONT). The stack pointer SP points to the topmost frame on the
stack and the return register RET contains the return address5 . UNDERFLOW_CONT corresponds to the heap continuation and it contains two elds: link (a pointer to the
4 Note

that the semantics of continuations in Scheme require that there be only one instance of
any variable allocated. To support this, it is common to create a cell in the heap for each mutable
variable. The extra dereference needed to access mutable variables adds an overhead whose importance
will depend on the program. However, there is no overhead for functional programs.
5 RET could also be passed on the stack but it is simpler to think of it as being contained in a dedicated
register. Gambit actually dedicates a register for the return address.

CHAPTER 3. LAZY TASK CREATION

80

SP

HEAPIFICATION

RET

@
R
@

SP

ret5

A
6

5

..
?

ret4

4

..
?

ret3

STACK

.....
.....
...

(p1)
(p2)
(p3)
(p4)
(p5)
(p6)

... (p2)
... (p3)
... (p4)
... (p5)
... (p6)
...)

...)
...)
...)
...)
...)

.....

ret5

@
@
R

6

5

HEAP

ret4

ret1

?

6

HEAP

?

q

...
.... @
@

ret2

2

q
q

?

UNDERFLOW CONT

@ @.....
@
@
....

q

q
q

6STACK
?

?

underflow

@
R
@

A
U
A

UNDERFLOW CONT

3
(define
(define
(define
(define
(define
(define

RET

under
flow

?

Continuation on entry to p6

.

SP

RET
ret2

@
@
R

6

2

STACK

underflow

?

UNDERFLOW CONT
ret1

@
@
R
q
q
q

UNDERFLOW

Figure 3.2: Continuation representation and operations.

6HEAP
?

3.2. CONTINUATIONS FOR FUTURES

81

topmost heap frame) and ret (the return address for the topmost heap frame). Note
that the stack frames are only linked conceptually; in reality they are allocated contiguously on the stack. On the other hand, heap frames are independent objects in a
format suitable for garbage collection and explicit links between them are maintained.
The link between the stack frames and the heap frames is preserved in a special
way. This link is traversed when a procedure returns to its continuation and the stack
is empty. This is called a stack under ow . When the stack under ows, the topmost
heap frame must be copied back to the stack so that the return point can access the
content of the continuation frame in a normal manner. This is the only frame that is
immediately needed. The older heap frames get restored one at a time by subsequent
under ows.
A special mechanism is used to avoid having to check explicitly for stack underow at every procedure return. The return address logically attached to the oldest
stack frame is stored in UNDERFLOW_CONT.ret. In its place, the continuation frame
contains a pointer to the under ow handler . This handler consequently gets called
by the normal procedure return mechanism when the stack under ows. The handler
performs the following sequence of steps: the correct return address is extracted from
UNDERFLOW_CONT.ret, the topmost heap frame is copied to the stack, UNDERFLOW_CONT
is updated to represent the parent heap frame, the return address in the stack frame
is replaced by the under ow handler to prepare it for under ow, and nally control is
returned to the correct return address. The cost for an under ow is thus dependent
on the frame size which in typical cases is fairly small. For example, the largest frame
size for the parallel benchmarks is 10 slots and the average, measured statically, is just
below 4. An under ow should thus be fairly cheap for these programs (between 10 and
20 instructions if the under ow handler and heap frame format are chosen carefully).

3.2.3 Continuation Heapi cation
Heap continuations are created by the process of heapi cation . Heapi cation transforms the current continuation into one that only contains heap frames. The stack
frames are transferred one by one to the heap with the appropriate links between them.
The oldest stack frame must be handled specially. When it is copied, its return address is rst recovered from UNDERFLOW_CONT.ret and its parent link is obtained from
UNDERFLOW_CONT.link. Finally, the stack is cleared by resetting SP to the bottom of
stack, and RET and UNDERFLOW_CONT are updated to re ect the new location of the continuation. The current continuation before and after heapi cation are logically equiva-

CHAPTER 3. LAZY TASK CREATION

82
lent; only the representation changes.

3.2.4 Parsing Continuations
One complication with the under ow and heapi cation mechanisms is that it must be
possible to parse the stack to know where each frame begins and ends, and also which
frame slot contains the return address6 . One way to achieve this is to associate the
description of a frame's layout (length and return address location) with the return
address of the subproblem call that created the frame. The frame descriptor can for
example be stored just before the return point, as is done in [Hieb et al., 1990]. RET can
then be used to get the size of the topmost stack frame and the location of its return
address. The return address in this frame in turn gives the size of the next frame and
so on.
The heapi cation and under ow mechanisms can now be described in detail. The
algorithms are given in Figure 3.3. In these algorithms two functions are used to parse
the continuation: frame_size(r) and ret_adr_offs(r) return respectively the size
and return address o set of the continuation frame associated with return address r.
It is assumed that all data structures grow towards higher addresses and that, in all
drawings, addresses grow towards the top of the page.

3.2.5 Implementing First-Class Continuations
First-class continuations can easily be implemented with the heapi cation mechanism.
Call/cc rst heapi es its implicit continuation and then packages up UNDERFLOW_CONT
in a new closure. When called, this closure discards the current continuation by resetting
SP to the bottom of stack, restores the new continuation by setting UNDERFLOW_CONT
to the saved value, and then jumps to the under ow handler to transfer control to the
return point. Support for dynamic scoping is a simple addition to this mechanism. The
current dynamic environment is saved in the closure at the moment of the call/cc and
is restored just before jumping to the under ow handler.
Heapi cation might seem to be doing more work than strictly required by call/cc.
By leaving the stack in its original state after its content is copied to the heap, some
returns would become cheaper because the restoration of the frames by the under ow
mechanism would be avoided. However, new costs in space and time would be introduced
6 The ability

pro lers.

to parse the stack is also useful to implement introspective tools such as debuggers and

3.2. CONTINUATIONS FOR FUTURES
typedef struct frm
{ struct frm *link;
value slots[];
} frame;

83
/* heap frame format
/* parent frame pointer
/* content of frame

*/
*/
*/

value *SP;
instr *RET;
struct { frame *link; instr *ret; } UNDERFLOW_CONT;
underflow()
{
frame *f = UNDERFLOW_CONT.link;
instr *r = UNDERFLOW_CONT.ret;
for (i=0; i<frame_size(r); i++)
SP[i] = f->slots[i];

}

*/
*/

SP += frame_size(r);

/* update stack pointer

*/

jump_to( r );

/* jump to return point

*/

/* check for empty stack

*/

/* clear stack

*/

SP = bottom_of_stack;
RET = underflow;

heapify_frame( s, r )
value *s;
instr *r;
{
value *b = s - frame_size(r);
frame *f = alloc( frame_size(r) );
instr *p = b[ret_adr_offs(r)];

}

/* copy frame to stack

UNDERFLOW_CONT.link = f->link;
/* prepare for underflow
UNDERFLOW_CONT.ret = SP[ret_adr_offs(r)];
SP[ret_adr_offs(r)] = underflow;

heapification()
{
if (RET != underflow)
heapify_frame( SP, RET );

}

/* get topmost heap frame */
/* get return address
*/

/* compute frame's base
/* allocate heap frame
/* get parent ret adr

*/
*/
*/

if (p == underflow)
/* oldest frame?
b[ret_adr_offs(r)] = UNDERFLOW_CONT.ret;
else
heapify_frame( b, p );

*/

for (i=0; i<frame_size(r); i++)
f->slots[i] = b[i];
f->link = UNDERFLOW_CONT.link;

/* copy frame content

*/

/* link frame to parent

*/

UNDERFLOW_CONT.link = f;
UNDERFLOW_CONT.ret = r;

/* update UNDERFLOW_CONT */

Figure 3.3: Under ow and heapi cation algorithms.

CHAPTER 3. LAZY TASK CREATION

84

since there could now be multiple copies of the same stack frame. This occurs when
multiple continuations which share the same tail are captured. Programs with nested
calls to call/cc, such as those typically found in backtracking algorithms and exception
processing, exhibit this behavior. As an example, consider this de nition for f
(define (f n)
(if (zero? n)
0
(+ 1 (call-with-current-continuation
(lambda (cont)
(f (- n 1)))))))

Note that the call (f n) calls call/cc n times. If there are k stack frames in the
continuation for the call (f n), n(k + n+1
2 ) heap frames will be created. The sharing
properties of heapi cation are much better because there is at most one heap copy of
any continuation frame. In the example, only k + n heap frames will be created (a
savings of a factor of O(n)). The same reasoning holds for nested futures when they
are implemented with call/cc (as is the case for the implementation of the Katz-Weise
semantics shown in Figure 2.12).

3.3 The LTC Mechanism
An important bene t of combining LIFO scheduling and bottom stealing is that it promotes stack-like execution. For fork-join DAC programs, entire subtrees of the spawning
tree get executed in an uninterrupted stack-like fashion because it is the older tasks that
get stolen (those closer to the spawning tree's root). Since the tasks in these subtrees
are exactly those that are not stolen, they will be called non-stolen tasks. Stack-like execution stops only when the oldest non-stolen task terminates (the one at the non-stolen
subtree's root).
LTC presupposes that this stack-like execution is the predominant execution order.
In other words, LTC speculates that most tasks are not stolen. Several task spawning
steps are only required if the task is stolen. Referring to Figure 2.12, these steps include:
the heapi cation of the parent continuation (the call to call/cc), and the creation and
manipulation of the task's result and legitimacy placeholders (the calls to make-ph).
LTC postpones these steps until it is known that the task is stolen (this explains the
name \lazy task creation"). In summary, non-stolen tasks completely avoid these steps
whereas stolen tasks perform these steps when the task is stolen.
To achieve this, LTC uses a lightweight task representation. When a future is

3.3. THE LTC MECHANISM

85

evaluated, a lightweight task representation of the parent task is pushed on the task
stack. The task stack push and pop operations, which are the only operations needed
for a purely stack-like execution, can be implemented at a very low cost with this
representation. Moreover, there is enough information in a lightweight task to recreate
the corresponding heavyweight task object if the task is ever stolen from the task stack.
The rest of this section is a more detailed description of the LTC mechanism. The
important issue of synchronization between the thief and victim is discussed in the
section that follows.

3.3.1 The Lazy Task Queue
The task stack is represented by a group of three stack-like data structures: the run
time stack, the lazy task queue (LTQ), and the dynamic environment queue (DEQ).
The same terminology as [Mohr, 1991] has been used when possible for consistency. The
term lazy task refers to a task in the lightweight representation (i.e. a task contained
in the task stack). These three data structures are really double ended queues which
are mostly used as stacks. Items can be pushed and popped from the tail of these
queues. Items can also be removed from the head. For eciency, the entries are laid
out contiguously in memory. For the LTQ and DEQ, two pointers indicate the extent
of the queue (the head and tail).
The run time stack contains the continuation frames of all the tasks in the task stack.
The LTQ and DEQ contain pointers to continuation frames in the run time stack. The
DEQ, which is only needed to support dynamic scoping, is explained in Section 3.3.4.
The purpose of the LTQ is to keep track of each lazy task's continuation. For each lazy
task in the task stack there is exactly one pointer on the LTQ. Each pointer points to
the rst continuation frame of the corresponding future's continuation. The \before"
part of Figure 3.5 shows a possible state of the LTQ and run time stack on entry to
procedure p9 after a call to procedure p1
(define
(define
(define
(define
(define
(define
(define
(define
(define

(p1)
(p2)
(p3)
(p4)
(p5)
(p6)
(p7)
(p8)
(p9)

... (p2) ...)
... (p3) ...)
... (FUTURE (p4)) ...)
... (p5) ...)
... (FUTURE (p6)) ...)
... (p7) ...)
... (FUTURE (p8)) ...)
... (p9) ...)
...)

The LTQ's TAIL points to the youngest entry on the LTQ whereas HEAD points just below
the oldest entry. Thus, the LTQ is non-empty if and only if HEAD < TAIL. Otherwise,

86

CHAPTER 3. LAZY TASK CREATION

the LTQ is empty and HEAD = TAIL. The same is true for the DEQ with the pointers
DEQHEAD and DEQTAIL.

3.3.2 Pushing and Popping Lazy Tasks
The task stack's push and pop operations translate into a small number of steps. When
a future is evaluated, the thunk representing the future's body is called as a subproblem.
The continuation frame created on the run time stack for this call corresponds to the
rst frame of the parent task's continuation. To indicate the presence of the parent
task on the task stack, a pointer to the continuation frame (i.e. SP) is pushed on the
LTQ (thereby incrementing TAIL) upon entering the thunk. This pointer is used by
the steal operation to recreate the parent task. The processor has e ectively queued
the parent on the task stack and is now running the child. When the thunk returns,
the LTQ is either empty (indicating that the parent was stolen), or not (indicating
that the parent is still on the LTQ). If the LTQ is not empty, the parent task gets
resumed in parent next fashion. Note that at this point both SP and the topmost
pointer on the LTQ point to the parent's continuation frame. To pop the parent task
it is sucient to place an instruction that decrements TAIL at the subproblem call's
return point. After decrementing TAIL, the processor has e ectively terminated the
child and resumed the parent. The body's result has been transferred from the child to
the parent without having to create a placeholder. Moreover, legitimacy propagation
cost nothing because the parent task's legitimacy before and after executing the child
are identical. A single legitimacy ag, CURRENT_LEGITIMACY, is needed per processor. It
logically corresponds to the legitimacy of the task currently running on that processor.
Similarly, each processor has a CURRENT_DYNAMIC_ENV variable that is always bound to
the dynamic environment of the currently running task. There is no need to change this
variable when a lazy task is pushed or popped from the task stack. The handling of a
stolen parent is explained in the next section.
It would seem that most of the work to push a task on the task stack goes into two
operations: the creation of the closure for the body and the creation of the continuation
frame. However, these operations do not really constitute an important overhead with
respect to a purely sequential execution of the program.
Firstly, it isn't necessary to heap allocate the closure because its single call site is
known. It is more ecient to lambda-lift the closure so that the closed variables are
passed to the body as parameters. Frequently, these variables are already in registers
so they can be left as is for the body to use. As shown in Table 3.1, most of the

3.3. THE LTC MECHANISM
Program
abisort
allpairs
fib
mm
mst
poly
qsort
queens
rantree
scan
sum
tridiag

87
Number of closed
variables for each future
and number copied

3 (0), 1 (0)
7 (3)
1 (0)
7 (3), 6 (2)
5 (1)
3 (0)
3 (0)
6 (2)
3 (0)
5 (1), 4 (0)
4 (0)
4 (0), 3 (0)

Table 3.1: Size of closure for each future in the benchmark programs.

benchmarks require little or no work to setup the closed variables for the body because
they are already in registers (Gambit does a good job at allocating variables to registers). A system could be designed to avoid any copying by directly accessing the closed
variables in the parent continuation frame. However, this would create dependencies
between frames which are hard to manage (in particular, heapi cation would become
more complex and expensive because the frames can't be separated).
Secondly, the continuation frame created by the future can be reused by the future's
body. Futures are typically subproblems and have a procedure call as their body (all
the futures in the benchmarks are like this). A sequential version of the program would
create a continuation frame for the call, just before the procedure is invoked. The same
continuation frame is created by the future but there is no need to create another frame
for the call in the body since it is now a reduction call. The only di erence is that
the frame is created before the arguments to the procedure are evaluated rather than
afterwards but the cost will be the same.

CHAPTER 3. LAZY TASK CREATION

88

resume_task( t )
task *t;
{
CURRENT_TASK = t;
UNDERFLOW_CONT.link = CURRENT_TASK->cont_link;
UNDERFLOW_CONT.ret = CURRENT_TASK->cont_ret;
CURRENT_DYNAMIC_ENV = CURRENT_TASK->cont_denv;
result_location
= CURRENT_TASK->cont_val;
CURRENT_LEGITIMACY = CURRENT_TASK->leg_flag;
SP = bottom_of_stack;
TAIL = bottom_of_LTQ;
HEAD = bottom_of_LTQ;
DEQTAIL = bottom_of_DEQ;
DEQHEAD = bottom_of_DEQ;
underflow();
}

Figure 3.4: Resuming a heavyweight task.

3.3.3 Stealing Lazy Tasks
When a thief processor steals a lazy task from a victim processor's task stack, it removes
the oldest entry on the LTQ (thereby incrementing HEAD) and then must do three things:
recreate the parent task as a heavyweight task object, notify the victim so that it knows
the oldest lazy task is no longer on the task stack, and nally resume the parent task.
A heavyweight task is represented with a structure containing ve elds







cont_link
cont_ret
cont_denv
cont_val
leg_flag

The rst four elds describe the task's continuation. Cont_link is a pointer to
the continuation frames in the heap, cont_ret is the continuation's return address,
cont_denv is the continuation's dynamic environment, and cont_val is the value passed
to the continuation when the task is resumed. The fth eld, leg_flag, is the task's
legitimacy ag. Resuming a heavyweight task is performed by the steps in Figure 3.4.
Note that variables are local to the processor unless explicitly marked otherwise (the
notation P ->v , where P is a processor, will be used to denote P 's local variable v ).
Thus, resume_task rst sets the processor's current task and, after initializing the task
stack, uses the under ow mechanism to restore the task's continuation. The value in

3.3. THE LTC MECHANISM

89

is passed to the continuation by setting result_location. It is assumed
that all continuations, including those for futures, receive their result in this location
(result_location is a machine register in Gambit). This restriction could be lifted
by parameterizing the result location by the return point, that is UNDERFLOW_CONT.ret
(this would require adding a eld to the frame descriptor).
cont_val

Figure 3.5 will help illustrate the e ect of a steal on the LTQ and run time stack.
The pointer p removed from the victim's LTQ points to the rst continuation frame
of the corresponding task (frame 3 in the gure). To ease its manipulation, the task's
continuation is rst heapi ed from this continuation frame down to the next frame
having the under ow handler as its return address. This is achieved by the call
heapify_frame(

p, r

)

where r corresponds to the return address associated with frame p (i.e. ret3 in the
example). In addition, r must be replaced by a pointer to the under ow handler
so that the child invokes UNDERFLOW_CONT when it is done. An important issue is
how to locate r from p but for now this operation will be hidden in the procedure
swap_child_ret_adr_with_underflow(p) that sets r to underflow and returns its
previous value. The victim's current continuation is now logically the same as before;
only the representation has changed.
After being heapi ed, the future body's continuation is in UNDERFLOW_CONT. Note
that UNDERFLOW_CONT.ret contains the address of the subproblem's return point. The
rst instruction at this address is the one which decrements TAIL. The only purpose of
this instruction is to pop the parent task on a \parent next" transition and it shouldn't
be executed in any other case. The future's continuation is reconstructed by adjusting
UNDERFLOW_CONT.ret so that it points to the following instruction7 (i.e. ret03 in the
example). At this point UNDERFLOW_CONT corresponds to the parent task's continuation
(k in Figure 2.12). The thief can now use this continuation to create a heavyweight
task representation of the parent. The cont_link and cont_ret elds are initialized
directly from UNDERFLOW_CONT. An undetermined placeholder, res_ph, is also created
to represent the result of the future. Res_ph is stored in the eld cont_val so that it
will get passed to the parent's continuation. To represent the parent task's legitimacy,
another undetermined placeholder, leg_ph, is created and stored in the eld leg_flag.
The eld cont_denv is initialized to the dynamic environment in e ect when the task
was pushed on the task stack (the next section explains how this is done).
7 This may not be this simple because all return addresses must be parsable.

Gambit always generates
a secondary return point along with each future body return point (at a constant distance from it). The
secondary return point contains a jump to the instruction that follows the popping of the parent task.

CHAPTER 3. LAZY TASK CREATION

90

SP

RET

@
R
@
8

..
?

ret7




TAIL-


6 

?


-

HEAD-

SP

ret8

6

..
?

6

ret5

5

..
?

ret4

4

..
?

3

..
?

2

underflow

q
q
q

BEFORE

6

7

..
?

ret6

6

..
?

ret5

TAILHEAD-

. . . . . . . . . .-.



5

..
?

ret4

STACK

4

p

..
?

underflow

3

..
?

ret2

LTQ

2

?

UNDERFLOW CONT
ret1

@
R
@





6 
?


STACK

ret3

..
?

ret7

7

ret2

LTQ

@
@
R
8

ret6

..
?

RET

ret8

underflow

?

UNDERFLOW CONT
endbody

6
HEAP
?

parent
leg flag
cont val
cont denv
cont link
cont ret ret3

@
@
R
end frame

- (_?_) 
( ^)
__) 
H
j (^

Z (^?^)
~
Z
0 @Z
@
q q q

AFTER

Figure 3.5: The LTQ and the steal operation.

leg ph
res ph
ret3

0

?
3
ret2
?
2
ret1
?
q
q
q

6

HEAP

?

3.3. THE LTC MECHANISM

91

task *steal_task( p )
value *p;
{
instr *r = swap_child_ret_adr_with_underflow(p);
heapify_frame( p, r );

/* update child's ret adr
/* heapify parent's cont

{ task *parent
frame *end_frame

= alloc_task();
= alloc_frame(3);

/* allocate heavyweight task */
/* allocate end_frame
*/

parent->cont_link
parent->cont_ret
parent->cont_denv
parent->cont_val
parent->leg_flag

=
=
=
=
=

UNDERFLOW_CONT.link;
future_secondary_ret_adr(r);
recover_dyn_env(p);
alloc_ph();
alloc_ph();

/*
/*
/*
/*
/*

end_frame->link
end_frame->slots[0]
end_frame->slots[1]
end_frame->slots[2]

=
=
=
=

parent->cont_link;
parent->cont_ret;
parent->cont_val;
parent->leg_flag;

/* setup end_frame

*/

/* setup UNDERFLOW_CONT

*/

UNDERFLOW_CONT.link = end_frame;
UNDERFLOW_CONT.ret = end_body;

setup parent's cont
(using secondary ret adr)
setup task's dynamic env
allocate result ph
allocate legitimacy ph

*/
*/

*/
*/
*/
*/
*/

return parent;
}}

Figure 3.6: The task stealing mechanism.

The thief will resume the parent task by a call to resume_task. Before doing this
however, the victim's under ow continuation must be changed so that it will take the
appropriate action when it returns from the child. Note that this new continuation
will be invoked with the result of the future's body. Consequently, this continuation
must logically correspond to procedure end-body of Figure 2.12. The rst time it is
called, end-body uses the result it is passed to determine the placeholder res_ph and the
task is terminated after propagating the task's legitimacy (i.e. CURRENT_LEGITIMACY)
to leg_ph. Subsequently, the result is simply passed on to the parent continuation.
This functionality is obtained by pushing a new continuation frame, end_frame, to
the front of the continuation in UNDERFLOW_CONT. End_frame corresponds to the continuation frame created for the call to thunk in Figure 2.12. Thus, UNDERFLOW_CONT.ret
is set to that call's return address (which is essentially a call to procedure end-body).
End_frame contains the following values needed by end-body: the parent task's continuation and the placeholders res_ph and leg_ph. The \after" part of Figure 3.5 shows
the system's state just before the thief resumes the parent task. Figure 3.6 gives the
complete task stealing mechanism (except for removing p from the LTQ).

92

CHAPTER 3. LAZY TASK CREATION

3.3.4 The Dynamic Environment Queue
For every task that is stolen, it is necessary to know what the dynamic environment was
when the task was pushed on the task stack. When the recreated task is resumed by the
thief, CURRENT_DYNAMIC_ENV will be set to that dynamic environment, thus restoring it
to its previous state.
A straightforward solution is to store the value of the dynamic environment in the
future's continuation frame. In other words, CURRENT_DYNAMIC_ENV is pushed on the
stack on entry to the future body's thunk. Unfortunately, this adds an overhead to all
futures independently of how heavily dynamic scoping is actually used, if at all.
It would be preferable if the cost of supporting dynamic scoping was only related
to how heavily it is used. This can be achieved by a lazy mechanism that recreates a
task's dynamic environment when it is stolen. It is assumed that the dynamic binding
construct, dyn-bind, creates a new continuation for the evaluation of its body (as in
Figure 2.7). The continuation frame contains prev_env, the dynamic environment that
was in e ect when dyn-bind's evaluation was started. Since a change of the dynamic
environment is always indicated by one of these frames, the following invariants will
hold

 The dynamic environment Ef associated with a continuation frame f is equal to
the prev_env eld of the rst dynamic binding continuation frame above f on the
stack.
 If there is no dynamic binding continuation frame above f then Ef is equal to
CURRENT_DYNAMIC_ENV.

The DEQ provides an ecient mechanism to nd the rst dynamic binding continuation frame above the stolen task's continuation frame. For each dynamic binding
continuation frame on the stack there is exactly one entry in the DEQ; a pointer to
the frame. The pointer is pushed onto the DEQ just before evaluating the body and
is popped after the body as shown in Figure 3.7 (this code uses the association list
representation of dynamic environments but the search tree representation could also
be used).
A stolen task's dynamic environment is easily recovered with the DEQ. If the frame
pointer removed from the LTQ is p, a linear or binary search can locate the lowest
pointer on the DEQ that is larger than p. Figure 3.8 shows how this is done. Note that
a linear search, as shown, is acceptable because its cost is of the same order as the cost

3.3. THE LTC MECHANISM

93

dyn_bind( id, val, body )
value id, val;
instr *body;
{
*++SP = RET;
/* create continuation frame
*++SP = CURRENT_DYNAMIC_ENV;
/* setup prev_env
*++DEQTAIL = SP;
/* push frame pointer onto DEQ
CURRENT_DYNAMIC_ENV =
/* install new dynamic env
cons( cons( id, val ), CURRENT_DYNAMIC_ENV );
RET = env_restore;
/* execute body
jump_to( body );
}
env_restore()
{
if (DEQTAIL > DEQHEAD) DEQTAIL--;
CURRENT_DYNAMIC_ENV = *SP--;
RET = *SP--;
jump_to( RET );
}

*/
*/
*/
*/
*/

/* pop frame pointer from DEQ */
/* restore dyn env to prev_env */
/* return from dyn_bind
*/

Figure 3.7: The implementation of dyn-bind.
of heapifying the stolen task's continuation (i.e. there are no more entries skipped on
the DEQ as there are frames heapi ed).
The cost of supporting dynamic scoping can be attributed entirely to the use of
dyn-bind (i.e. the cost is O(n) where n is the number of dyn-bind's evaluated). For
each dyn-bind evaluated, a few instructions in dyn-bind are needed to maintain the
DEQ and a few more instructions are needed in recover_dyn_env to skip its entry on
the DEQ if it is part of a stolen task's continuation (a DEQ entry is never skipped more
than once).

3.3.5 The Problem of Over ow
Because the LTQ, DEQ, and run time stack are of nite size, an important concern is
the detection and handling of over ows. A useful invariant of these structures is that
the combined number of entries in the LTQ and DEQ is never more than the number of
frames in the stack. Since each frame contains at least one slot for the return address,
the space occupied by the LTQ and DEQ is never more than the space occupied by
the stack. If these structures are allocated in two equal sized areas, one for the LTQ
and DEQ growing towards each other and one for the stack, then the stack will always
over ow before the LTQ and DEQ. Thus, it is only necessary to check for stack over ow.
Chapter 4 explains how stack over ows can be detected eciently.

CHAPTER 3. LAZY TASK CREATION

94

(define
(define
(define
(define
(define
(define
(define

(p2)
(p3)
(p4)
(p5)
(p6)
(p7)
(p8)

... (dyn-bind y 2 (p3)) ...)
... (FUTURE (p4)) ...)
... (p5) ...)
... (FUTURE (dyn-bind z 3 (p5))) ...)
... (FUTURE (p7)) ...)
... (p8) ...)
...)

SP

RET

@
R
@
7

ret7

CURRENT DYNAMIC ENV

-

-

z

3

-

?
-

y

2

?
-

x

1

..
?

ret6

TAIL -


6 

?


HEAD -







..
?

prev env

..
?

ret5

5

..
?

ret4

4

-

p

LTQ

6

env restore

..
?

ret3

3

..
?

env restore
prev env

..
?

ret2

2

underflow

S
o
S
S

Ep
S
S

S

S

6
S ?
S
S







DEQ



DEQTAIL
DEQHEAD

-

?
q
q
q

STACK

HEAP

value recover_dyn_env( p )
value *p;
{
while ((DEQHEAD < DEQTAIL) && (DEQHEAD[1] < p)) DEQHEAD++;

}

if (DEQHEAD == DEQTAIL)
return CURRENT_DYNAMIC_ENV;
else
return DEQHEAD[1][0]; /* get frame's prev_env */

Figure 3.8: The DEQ and its use in recovering a stolen task's dynamic environment.

3.3. THE LTC MECHANISM

95

A stack over ow could simply cause the program to signal an error (or to terminate).
This approach puts a strict limit on the depth of the call chain so it is inappropriate
for a language like Lisp where recursion is used liberally. A more elegant approach that
removes this restriction is to heapify the current continuation and then clear the stack,
LTQ, and DEQ. Note that because the stack might contain lazy tasks this heapi cation
is special (as discussed in the next section). Subsequent computation will reuse the
stack and possibly cause some other stack over ows. The continuation thus migrates
to the heap incrementally and it is only when there is no space left in the heap that an
error is signalled.

3.3.6 The Heavyweight Task Queue
In general, the current continuation might contain lazy tasks when it is heapi ed. The
four situations where this happens are
1. Task suspension (for touching an undetermined placeholder)
2. Task switch (caused by a preemption interrupt)
3. Stack over ow
4. call/cc
In these situations, something has to be done with the lazy tasks currently on the
stack so that they remain runnable and independent. Since the lightweight representation is no longer adequate for these tasks, they are converted to the heavyweight
representation and added to the processor's heavyweight task queue (HTQ). This queue
contains all the heavyweight tasks runnable on that processor. It is in this queue that
suspended tasks are put when the placeholder they are waiting on gets determined.
Before heapifying the current continuation, the processor will in essence steal all lazy
tasks on its own task stack (by calling steal_task( *++HEAD ) while HEAD<TAIL) and
add the resulting tasks to its HTQ.
But is this the best thing to do in the case of a task suspension? The only task that
has to be suspended is the currently running task so it seems wasteful to remove all
lazy tasks. The topmost lazy task could simply be recreated and resumed (i.e. popped
from the task stack) after adding the current task on the placeholder's waiting queue.
Mohr's system [Mohr, 1991] uses this approach (which he calls tail-biting ) even though
he concedes that
\... it goes against our preference for oldest- rst scheduling, since we have e ectively

96

CHAPTER 3. LAZY TASK CREATION
created a task at the newest potential fork point. Performance can su er because
this task is more likely to have small granularity; also, further blocking may result,
possibly leading to the dismantling of the entire lazy task queue."

Tail-biting o ers no savings when supporting the Katz-Weise semantics because the
parent continuation must be saved in the suspended task. Thus, the whole stack needs
to be heapi ed anyway. In addition, by immediately moving all lazy tasks to the HTQ on
a task suspension and by managing the HTQ as a FIFO structure, the same scheduling
order as bottom stealing is obtained (oldest task rst). There is also greater liberty as
to which task to run next after the suspension. Gambit uses the following heuristic for
choosing the next task: if x is the placeholder that caused the suspension, then the child
task associated with x (i.e. x's owner task) is resumed if it is runnable; otherwise the
processor goes idle8 . Conversely, when a task terminates after determining placeholder
y , one of the tasks waiting on y will be resumed if there is one; otherwise the parent task
associated with y is resumed if it is runnable9 . These heuristics promote an execution
order close to the program's data dependencies so it tends to reduce the number of task
suspensions.
Since there are two sources of runnable tasks per processor, the HTQ and the task
stack, idle processors could obtain a runnable task from either source. Gambit however checks the HTQ rst and then the task stack because this promotes the LIFO
scheduling order, it avoids allocating new heavyweight tasks, and it is faster because
the heavyweight tasks can be resumed immediately.
Another advantage of managing the HTQ as a FIFO structure is that scheduling
will be fair because all runnable tasks, including the lazy tasks on the task stack, are
guaranteed to start running in a nite amount of time. On every preemption interrupt,
all lazy tasks and the current task are transferred to the HTQ and the rst task on the
HTQ is resumed. Consequently, if there are m tasks in the task stack and n tasks in
the HTQ at the moment of the preemption interrupt, then these m + n tasks will get
at least one quantum out of the next m + n quantums.

3.3.7 Supporting Weaker Continuation Semantics
The task stealing algorithm can be modi ed to accommodate any of the other continuation semantics described in Section 2.8. These weaker semantics o er a lower cost for
task stealing because they avoid some steps.
8 The link to the owner task is recorded in x when the parent task is stolen.
9 A link to the parent task is recorded in end_frame when the parent task is

stolen.

3.3. THE LTC MECHANISM

97

Firstly, since these semantics do not support legitimacy, they do not need to create
the legitimacy placeholder (and of course the parent task and end_frame need not
contain the leg_flag and leg_ph elds). Also, legitimacy propagation in end-body is
not needed.
Secondly, the parent task's continuation is not needed in end_frame. In fact,
end_frame, just like the root continuation frame, has no parent continuation10 . For
the original Multilisp semantics, end_frame will only contain the result placeholder
res_ph. It is the only parameter passed to the procedure end-body apart from the
body's result.
For the MultiScheme semantics, end-body only takes the body's result as a parameter. Consequently, end_frame contains no pertinent information and can simply be
preallocated once and for all at program startup. Nevertheless, the result placeholder is
needed by the child task so an extra eld, goal_ph, must be added to heavyweight task
objects. At the time of the steal, the parent task's goal placeholder is initialized from
the child's goal placeholder and the result placeholder becomes the new goal placeholder
of the child, i.e.
parent->goal_ph = CURRENT_TASK->goal_ph;
CURRENT_TASK->goal_ph = res_ph;

The steps avoided by the weaker continuation semantics do not amount to much;
perhaps a saving of the order of 10 to 20 machine instructions per steal. A more
promising source of saving is the handling of the parent continuation. Since only the
parent task needs this continuation and it is immediately going to be restored by the
thief, it seems useless to heapify the continuation. The steal operation could transfer
the continuation frames from the victim's stack to the thief's stack in a single block
(with a \block transfer" or similar operation). When heapifying the continuation, two
copies of the frames are done: once to the heap (for heapi cation) and once to the
stack (because of under ow). Moreover, these copies are more complex to perform than
a block transfer of the stack because of the frame formatting and under ow handler
overheads.
Upon closer examination, neither method is clearly superior to the other. Firstly,
communication between the thief and victim processors is more important than the
complexity of the algorithms. Assuming the thief actually returns through all the continuation frames, the frames only need to be transferred once between the processors in
either method. When using heapi cation, one of the transfers will be between processors
10 To preserve the format of frames and avoid a special case in the under

frames contain a dummy parent continuation.

ow handler, it is best if these

98

CHAPTER 3. LAZY TASK CREATION

and one between local memory and the cache (assuming the stack lives mostly in the
cache). Since interprocessor communication is an order of magnitude more expensive
than local memory accesses, both methods will have roughly similar performance.
Secondly, the thief might not use all of the parent continuation frames. In such a case
a block transfer will do more work than strictly required. When using heapi cation, only
the frames which are needed are transferred (since frames are restored on demand). This
can make a big di erence in some programs, in particular when a given task spawns
several children deep in some recursion. To explain this case, consider the following
variant of pmap
(define (pmap proc lst)
(if (pair? lst)
(let ((val (FUTURE (proc (car lst)))))
(let ((tail (pmap proc (cdr lst))))
(cons (TOUCH val) tail)))
'()))

Assume the root task calls pmap with a continuation containing k stack frames. Note
that the continuation of the ith evaluation of the future contains k + i frames. Also
note that the only task that ever gets stolen with LTC is the root task. If the list is of
P
length n and there are n steals, a total of ni=1 k + i = n(k + n+1
2 ) frames are transferred
between processors when using the block transfer method. The cost is lower by a factor
of O(n) when the parent continuation is heapi ed on every steal. On the rst steal,
k + 1 frames are heapi ed and the topmost is transferred and restored by the thief.
Subsequent steals will heapify two frames (one for the recursive call to pmap and one
for the call to the future's thunk) and a single frame will be transferred and restored.
Finally, in the unwinding of the recursive calls to pmap, n frames will be transferred and
restored. The total is: 2n + k + 1 heapi ed frames, 2n restored frames, and 2n frames
transferred between processors.

3.4 Synchronizing Access to the Task Stack
In the above description of LTC a critical issue was not addressed: the synchronization of
the processors. This is an issue because multiple processors, including the victim, might
try to simultaneously remove the same task from the task stack. Some synchronization
is needed to resolve this race condition.
The case of multiple thieves can be prevented by associating a \steal" lock with
every processor. A processor wanting to steal from a victim rst acquires the victim's

3.5. THE SHARED-MEMORY PROTOCOL

99

steal lock before attempting to steal a task. The lock is released when the attempt is
nished so there in never more than one thief trying to steal from a given victim.
The only remaining race condition occurs when the victim's task stack contains a
single task and the thief tries to steal the task while the victim is trying to pop the task.
The term protocol refers to how the thief and victim processors interact to avoid con icts
when accessing the task stack. Two protocols are explored here: the shared-memory
(SM) protocol and the message-passing (MP) protocol.

3.5 The Shared-Memory Protocol
The SM protocol tries to maximize concurrency between the thief and victim by minimizing the interference of the thief on the victim's current execution. The victim does
not cooperate with the thief but rather the responsibility of stealing falls entirely on
the thief (a cute analogy is that the thief is behaving like a pickpocket trying to stay
unnoticed by its victim). Thus, it is the thief that executes the steps in Figure 3.6.
The problems with this approach are explained throughout the description of the SM
protocol that follows.
The rst problem is that, at the moment of a steal, the thief has no way of knowing where the child's return address r is because the victim could be in any of several
states (this problem shows up in swap_child_ret_adr_with_underflow(p)). The return address is only on the victim's stack if the child is in the process of executing a
subproblem call. Even if the procedure calling convention required that r be passed
on the stack in a predetermined slot (e.g. the rst), there would be a problem because
when r is invoked to return from the future's body, r will rst get popped from the
stack before the parent task is popped. This race condition between the thief mutating
r and the victim invoking r can be handled in the following way. Instead of having the
thief mutate r to bring the victim to call underflow when it returns from the child, the
detection of a stolen parent task is done explicitly by the victim at the future's return
point. The test at the return point will cause a branch to the under ow handler if the
parent was stolen. Nevertheless, the thief must still know the value of r to reconstruct
the parent's continuation. A simple solution is to save the value of r inside the future's
continuation frame (just before pushing the lazy task on the LTQ). Thus, the thief can
get the value of r by indirecting p.
Before stealing a task, the thief must rst verify that one is present, that is check
if HEAD<TAIL. However, this only tests the instantaneous presence of a task because

100

CHAPTER 3. LAZY TASK CREATION

nothing prevents the victim from immediately decrementing TAIL as part of the popping
of a lazy task. To prevent this from happening, each LTQ entry could be augmented
with a \popping" lock that controls the popping of the corresponding task. The victim
acquires the popping lock under TAIL before decrementing TAIL and the thief acquires
the popping lock under HEAD+1 before testing for the presence of a task. If a task is
present, i.e. HEAD<TAIL, the thief is certain that this condition will remain true until
the popping lock is released because the victim cannot decrement TAIL from HEAD+1 to
HEAD. Note that locking is not needed for pushing a lazy task since this can't cause a
race with the thief (as long as TAIL is updated after the entry is written to the LTQ). To
complete the stealing of the task, the thief increments HEAD, recreates the task by calling
steal_task( *HEAD ), and releases the popping lock under HEAD. Unfortunately, the
cost of lock operations on some machines is an order of magnitude more expensive
than typical instructions. For example, the aquisition of a lock on the GP1000 is done
through a system call that takes 6 secs (the equivalent of roughly 20 instructions).
Accessing the locks would constitute the dominant cost of a future because it is needed
on every task pop. The next section explains how hardware locks can be avoided.
A major problem with the SM protocol is that the task stack and related data structures must be accessible to all processors. This includes the following data structures

 the runtime stack and UNDERFLOW_CONT,
 the LTQ and its HEAD and TAIL pointers, and
 CURRENT_DYNAMIC_ENV, the DEQ and its DEQHEAD and DEQTAIL pointers.
The problem is that these data structures must be in shared memory and can't be
cached optimally. The victim processor would have faster access to these data structures
if they were private data. This is the prime motivation for the MP protocol described
in Section 3.7. Two of these data structures can nevertheless be private even with
the SM protocol: the TAIL and DEQTAIL pointers. Since this is achieved in a similar
way for both pointers it will only be explained for TAIL. The idea is to maintain the
following invariant: all LTQ entries above TAIL contain a special marker, for example
a NULL pointer (all LTQ entries are initialized with this value). This means that, for
all X>HEAD, X>TAIL if and only if X[0]=NULL. The thief can thus replace the test
HEAD<TAIL by HEAD[1]6=NULL. The victim can keep TAIL in the most convenient place
(Gambit dedicates one of the processor registers). Pushing and popping an entry on the
LTQ each require a single memory write to the LTQ (SP and NULL respectively) and an
adjustment of TAIL. The code sequences for this method are given in the next section.

3.5. THE SHARED-MEMORY PROTOCOL
.
.
.
RET
= ret_point; /* setup future body's return address
*++SP
= RET;
/* save ret adr in continuation frame
*++TAIL = SP;
/* push parent task on LTQ

101

*/
*/
*/

future's body

ret_point:
SM_attempt_pop();
secondary_ret_point:
SP--;
.
.
.

/* pop parent task if still there

*/

/* pop ret adr from continuation frame */

Figure 3.9: Code sequence for a future under the SM protocol.

3.5.1 Avoiding Hardware Locks
Hardware locks can be avoided in the task popping operation by implementing the
popping locks with any of several \software lock" algorithms based on shared variables
(such as Dekker's algorithm [Dijkstra, 1968] and Peterson's algorithm [Peterson, 1981]).
The same basic principles used by these algorithms can be adapted to design a special
purpose synchronization mechanism for LTC as described next. With the exception of
the previously mentioned method to make TAIL private, this algorithm is similar to the
one described in [Mohr, 1991]. The only atomic operations in these algorithms are the
memory references and lock operations (increments and decrements do not have to be
atomic).
The mechanism arbitrates access to the task stack during task steal and task pop
operations using only the pointers HEAD and TAIL, and a lock governing mutation of
HEAD (i.e. HEAD_LOCK). Note that HEAD_LOCK can be either a hardware or software lock
but because it is used infrequently in the popping operation it doesn't really matter
which type it is. The task stealing and popping operations are implemented by the
procedures SM_attempt_steal and SM_attempt_pop respectively (the code is given in
Figures 3.10 and 3.11). These procedures attempt to remove a task from the task
stack and indicate if the attempt was successful. SM_attempt_steal indicates failure
by returning NULL; otherwise it returns a heavyweight task object corresponding to the
stolen task. SM_attempt_pop indicates failure by calling the under ow handler directly;
otherwise control returns to the caller. The code sequence generated for a future calls
SM_attempt_pop at the future's return point, as shown in Figure 3.9. The performance
of the popping operation can be improved by inlining the instructions of procedure
SM_attempt_pop at the return point (or at least the two rst instruction; which are the

CHAPTER 3. LAZY TASK CREATION

102

task *SM_attempt_steal( V )
processor *V;
{
value *p;
1
2
3

4
}

/* V is victim processor

*/

/* entry obtained from V's LTQ

*/

if (V->HEAD[1] == NULL) return NULL;

/* nothing to steal if LTQ empty */

acquire_lock( V->HEAD_LOCK );
V->HEAD++;
p = *V->HEAD;
if (p != NULL)
{
task *parent = steal_task( V, p );
release_lock( V->HEAD_LOCK );
return parent;
}
V->HEAD--;
release_lock( V->HEAD_LOCK );
return NULL;

/*
/*
/*
/*

get right
increment
get entry
check for

to increment HEAD
HEAD
from LTQ
conflict

*/
*/
*/
*/

/* won race... recreate parent
/* done with HEAD
/* indicate success

*/
*/
*/

/* lost race... undo increment
/* done with HEAD
/* indicate failure

*/
*/
*/

Figure 3.10: Thief side of the SM protocol.

SM_attempt_pop()
{
5
*TAIL-- = NULL;
6
if (HEAD > TAIL)
{
boolean thief_won;

/* remove topmost LTQ entry
/* check for possible conflict

7
8

acquire_lock( HEAD_LOCK );
thief_won = (HEAD > TAIL);
release_lock( HEAD_LOCK );

/* prevent thief from mutating HEAD */
/* definitive conflict check
*/

if (thief_won)
{
*TAIL++ = SP;
underflow();
}

/* if thief won race...

*/

/* restore LTQ top
/* jump to end-body

*/
*/

}

}

Figure 3.11: Victim side of the SM protocol.

*/
*/

3.5. THE SHARED-MEMORY PROTOCOL

103

most frequently executed instructions). In SM_attempt_steal, steal_task needs to
know which task stack to access so it is called with the victim processor as an extra argument. Also note that the operation swap_child_ret_adr_with_underflow(p) used
by steal_task is equivalent to *p (the child's return address is not mutated).
Clearly there is no possible con ict between the thief and victim when the task
stack contains more than one task. The thief can increment HEAD and take the lowest
entry on the LTQ at the same time that the victim voids the topmost entry (by writing
NULL) and decrements TAIL. A con ict can only occur if calls to SM_attempt_steal
and SM_attempt_pop overlap in time and the task stack contains a single task, that is
HEAD=TAIL 1. The idea is to let the thief and victim blindly access the LTQ as though
there was no con ict (thereby adjusting HEAD and TAIL) and only then check to see if
there is a con ict (that is check if HEAD=TAIL+1 or equivalently HEAD>TAIL). When a
con ict is detected, one of the two processors is selected as the \winner" of the race for
the task and it returns success. The other processor undoes its mutation of the LTQ
and returns failure. The thief detects success very simply: it is the winner if and only
if the entry it reads from the LTQ at line 3 is not NULL. This entry can only become
NULL if the victim voids it by executing line 5 . The two possible orderings of these lines
are considered next.
1. Thief executes line 3 before the victim executes line 5
The thief has won the race. It will recreate the parent task and returns it from
SM_attempt_steal. Note that from this point on, HEAD will never point lower than
the entry that was removed (HEAD can only increase). When the victim eventually
executes line 5 with TAIL pointing to the removed entry, it will decrement TAIL
to below HEAD and consequently line 6 will detect the con ict. Line 8 will nd
the same result so the victim will conclude that the parent was stolen and will
jump to end-body.
2. Victim executes line 5 before thief executes line 3
The thief will lose the race because it will read NULL at line 3 . Consequently, the
thief will restore HEAD to its previous value (at line 4 ). There are two subcases
depending on what the thief is doing when the victim executes line 6 .
(a) Thief is not between lines 2 and 4 when victim executes line 6
The thief has either not yet tried to remove the entry or has restored HEAD
to the value it had just before line 2 . Thus, HEAD=TAIL when line 6 is
executed. The victim sees no con ict and declares success by returning from
SM_attempt_pop.

CHAPTER 3. LAZY TASK CREATION

104

(b) Thief is between lines 2 and 4 when victim executes line 6
The thief has not yet restored HEAD to its original value so HEAD=TAIL+1.
The victim thus detects a possible con ict at line 6 . The reason for acquiring
HEAD_LOCK at line 8 is to make sure that the thief is not between lines 2
and 4 when the test at line 8 is executed. At that point the thief will have
restored HEAD and will not mutate HEAD again (because HEAD_LOCK is locked).
Line 8 thus sees HEAD=TAIL, causing SM_attempt_pop to return successfully.
The role of line 1 is to ensure that the victim eventually acquires the lock
at line 7 in systems where locks are not fair. It prevents new thieves from
crossing line 1 , so eventually the victim will be the only processor trying to
lock HEAD_LOCK. It also avoids the overhead of attempting to steal from a
processor with an empty task stack.
Thus, the SM protocol satis es the following correctness criteria

 Safety | Either the thief or the victim, but not both, will remove a given entry

from the LTQ.
 Liveness | An attempt to remove an entry will eventually indicate failure or
success (i.e. deadlock and livelock are impossible).

3.5.2 Cost of a Future on GP1000
This section describes the details of the GP1000 implementation of the SM protocol and
evaluates the costs related to the evaluation of a future on that machine. As explained
above, the cost of a future depends on many parameters but mostly on whether the
corresponding parent task is stolen or not.

Parent Task is not Stolen
If the parent is not stolen, the cost is simply that of pushing and popping a lazy task.
Pushing a lazy task requires four steps: setting up the body's return address, setting
up the arguments to the body (the closed variables), pushing the return address to the
stack, and pushing the stack pointer to the LTQ. The rst step typically replaces the
same step that would be required in a sequential version of the program to evaluate the
body (assuming it is a procedure call) so it won't be counted as overhead. Often the
second step requires no instructions because the arguments are already in a location

3.5. THE SHARED-MEMORY PROTOCOL

105

accessible to the body (e.g. in the registers). Only the last two steps are necessary
extra work with respect to a sequential version of the program. Popping a lazy task
takes two steps: popping and voiding the topmost entry on the LTQ, and checking for a
con ict. The popping of the return address from the stack has no cost because it can be
combined with the deallocation of the continuation frame by the future's continuation.
To get a concise code sequence on the GP1000, some of the special addressing modes
of the M68020 processor were used, in particular predecrement and postincrement indirect addressing. TAIL, SP, and RET are all kept in address registers (a4, sp, and a0
respectively). The two required steps in the lazy task push translate into two instructions and a lazy task pop translates into three instructions as shown below.
movl a0,sp@; push return address to stack
movl sp,a4@+
; push stack pointer to LTQ
.
. code for future's body
.
ret_point:
clrl a4@; pop and void entry on LTQ
cmpl HEAD,a4
; compare head and tail
bcs
conflict
; jump to handler if conflict
secondary_ret_point:
.
. code for future's continuation
.

Note that the stack grows downward on the M68020. Of the ve instructions, three
are writes to shared memory. The sequence accounts for a run time of roughly 2 secs.
The assembly code generated for the SM protocol when compiling the fib benchmark
is given in Section 3.8.

Parent Task is Stolen
To the above cost must be added the extra work performed as a consequence of the
steal. Assuming that there is always a single return from the future's body, the thief
and victim will perform the following operations

Thief
(1) Heapify the parent continuation
(2) Find the parent's dynamic environment
(3) Allocate new objects | This includes the allocation and initialization of the
parent task, result and legitimacy placeholders and end_frame.

CHAPTER 3. LAZY TASK CREATION

106
Operation
steal_task

(excluding heapify_frame
and recover_dyn_env)

heapify_frame
recover_dyn_env
resume_task

(excluding underflow)

underflow
determine!
idle

(only accounts for search)

Instruction count
75
6 + 34f + 2s
8 + 2b
10
50 + 2s0
28
37 + 6w
15
34 + 8n

(w = 0)
(otherwise)
(n = 0)
(otherwise)

Table 3.2: Cost of operations involved in task stealing.
(4) Resume the parent task | Note that only the rst continuation frame needs
to be restored.

Victim
(5) Invoke end-body | This is performed by the under ow handler.
(6) Terminate the child | The result and legitimacy placeholders get determined

and then control goes to idle.
(7) Find new work | The victim must nd a runnable task to resume. The task
either comes from the victim's HTQ or is stolen from another processor.

In addition, there is a cost for restoring the other frames of the parent continuation
heapi ed in (1). This is done at least in part by the thief but maybe also by some other
processors (if the parent task migrates to other processors).
Table 3.2 gives the cost of the operations involved in task stealing (the costs correspond to the number of machine instructions executed in Gambit's encoding of the
algorithms). In this table, f is the number of frames heapi ed (which is the number of
frames separating the future from the enclosing future), s is the number of values on
the stack, b is the number of dynamic variable bindings that were added to the dynamic
environment since the enclosing future, s0 is the size of the continuation frame to restore, w is the number of tasks on the placeholder's waiting queue, and n is the number
of processors that were considered in the search for a runnable task (n = 0 when the
task is found in the local HTQ). Note that these costs do not account for the location
(i.e. local vs. remote memory) of the data being accessed.

3.6. IMPACT OF MEMORY HIERARCHY ON PERFORMANCE

107

From the table can be derived the approximate costs associated with the victim
(Tvictim ), the thief (Tthief ), and the processors that restore the parent's continuation
(Tunder ow ).

Tvictim
= (50 + 2  3) + 28 + 28 + 15
= 127
Tthief
= 75 + (6 + 34f + 2s) + (8 + 2b) + 10 = 99 + 34f + 2s + 2b
Tunder ow
= 50f + 2s
226 + 84f + 4s + 2b
The minimal cost corresponds to f = 1, s = 2, b = 0, w = 0, and n = 0. This gives a
total cost of 318 instructions (106 secs). In a more realistic situation, the frames will

be larger and more numerous so the cost of heapi cation and under ow will increase.
Assuming s = 8 and f = 2, the total cost will be 426 instructions (142 secs).

3.6 Impact of Memory Hierarchy on Performance
An unfortunate requirement of the SM protocol is that all processors must have access
to the task stack's data structures; in particular the runtime stack and LTQ. Making
these structures accessible to all processors has a cost because it precludes the use of the
more ecient caching policies. The runtime stack and the LTQ are read and written by
the victim but are only read by thief processors; thus, they are single writer shared data
and can be cached by the victim using the write-through caching policy (as explained in
Section 1.4.5). This however is not as ecient as the copy-back caching policy normally
used in single processor implementations of Lisp. For typical Lisp programs, caching of
the stack will likely be an important factor since the stack is one of the most intensely
accessed data structures. Caching of the LTQ will also be an important factor for
parallel programs with small task granularity because each evaluation of a future causes
a few memory writes to the LTQ and stack (three in the SM protocol). Although this
may not seem like much at rst sight, the cost of a memory write to a write-through
cached location on modern processors (such as the M88000 processors in the TC2000)
is 5 to 20 times larger than the cost of a non-memory instruction or a cache hit (read
or write) to a copy-back cached location. Note that this is not an issue on the GP1000
which lacks a data cache.
But how large is the performance loss due to a suboptimal caching policy? To
better understand the importance of caching on performance, it is useful to analyze
the memory access behavior of typical programs. The run time of a Lisp program can
be broken down into the time spent accessing data in memory and the time spent on

108

CHAPTER 3. LAZY TASK CREATION

\pure computation". Memory accesses can further be broken down into two categories:
accesses to the stack and accesses to the heap. Thus, a program is described by the
three parameters S (stack), H (heap), and C (pure computation) which represent the
proportion of total run time spent on each category of instructions (S + H + C = 1).
For reference purposes, these parameters are de ned with respect to an implementation
where the stack and heap are not cached (i.e. all accesses go to local memory).
Some experiments were conducted to measure the value of S , H , and C for each
benchmark program on both the GP1000 and TC2000. All these programs were run on
a single processor as sequential programs (futures and touches were removed from the
parallel benchmarks). The run time of each program was measured in three di erent
settings. The rst run was with the stack and heap located in non-cached local memory.
The second run was with the stack located in remote memory (on another processor)
so that each access to the stack would cost more. The nal run was with the heap in
remote memory. The three run times are respectively T , TS , and TH . Now since the
relative cost R of a remote access with respect to a local access is known (R = 12:1 on
the GP1000 and R = 4:2 on the TC2000), a system of three linear equations is obtained

S +H +C=1
SR + H + C = TS =T
S + HR + C = TH =T
This system can easily be solved to nd the value of S , H , and C . Note that
this model does not take into account factors such as the pipelining of instructions by
the processor and the di erence in costs between reads and writes. Also note that
the values are dependent on the quality of the code generated by the compiler, but
because an optimizing compiler was used, the measurements are representative of a
high-performance system. As a sanity check, the values of S , H , and C obtained on
the TC2000 were used to predict the run time of the program when the stack is cached
with the copy-back policy. Assuming that the cache hit ratio for the stack is close to 1
(which is reasonable due to the high locality of stack accesses), the run time should be
T ( KS + H + C ) where K = 3:8 is the relative cost of a local memory access with respect
to a cache access. For most programs (21 out of 27), the prediction was within 5% of
the actual run time. Only 3 programs had a di erence above 10%: fib with 12%, mm
with 13%, and sum with 15%. This suggests that the values obtained for S , H , and C
are reasonably close to reality.

3.6. IMPACT OF MEMORY HIERARCHY ON PERFORMANCE

GP1000
Program
boyer
browse
cpstak
dderiv
deriv
destruct
div
puzzle
tak
takl
traverse
triangle
compiler
conform
earley
peval
abisort
allpairs
fib
mm
mst
poly
qsort
queens
rantree
scan
sum
tridiag

TC2000

109

Stack Caching

S

H

C

ORemHeap

S

H

C

ORemHeap

ONone

OWT

.08
.15
.11
.14
.14
.01
.08
.09
.53
.16
.35
.20
.17
.15
.25
.17
.19
.56
.41
.43
.29
.09
.25
.41
.15
.49
.38
.35

.08
.09
.25
.06
.06
.14
.25
.21
.00
.27
.14
.13
.11
.10
.06
.14
.32
.12
.00
.10
.13
.03
.26
.00
.00
.05
.05
.10

.84
.75
.64
.80
.80
.85
.67
.70
.47
.57
.52
.67
.72
.74
.69
.69
.49
.33
.59
.47
.58
.88
.49
.59
.85
.46
.57
.55

1.84
2.05
3.79
1.70
1.64
2.52
3.82
3.37
1.00
4.01
2.53
2.45
2.20
2.16
1.67
2.54
4.56
2.30
1.00
2.08
2.44
1.32
3.94
1.00
1.00
1.54
1.56
2.13

.32
.25
.24
.28
.26
.03
.19
.17
.83
.32
.56
.38
|
.25
.58
.35
.31
.73
.70
.71
.59
.33
.37
.74
.40
.70
.73
.65

.16
.15
.49
.13
.12
.41
.47
.35
.00
.45
.17
.19
|
.10
.10
.26
.49
.14
.00
.14
.23
.12
.26
.00
.00
.09
.08
.18

.53
.60
.27
.59
.61
.56
.34
.48
.17
.23
.27
.43
|
.65
.32
.38
.20
.13
.30
.15
.19
.55
.37
.26
.60
.21
.19
.17

1.64
1.58
2.95
1.51
1.49
2.32
2.71
2.23
1.00
2.82
1.90
1.81
|
1.40
1.58
2.13
3.03
2.01
1.00
1.99
2.27
1.49
2.14
1.00
1.00
1.55
1.44
2.13

1.29
1.20
1.24
1.23
1.23
1.02
1.16
1.13
2.45
1.28
1.63
1.29
|
1.20
1.88
1.34
1.31
2.27
1.81
2.36
1.76
1.28
1.38
2.15
1.36
1.96
1.85
1.99

1.15
1.10
1.10
1.11
1.11
1.01
1.07
1.04
1.55
1.13
1.33
1.16
|
1.09
1.43
1.14
1.13
1.54
1.34
1.58
1.30
1.14
1.18
1.48
1.17
1.35
1.34
1.48

Table 3.3: Measurements of memory access behavior of benchmark programs.

CHAPTER 3. LAZY TASK CREATION

110

H6

.
....
....
.
.
.
.
....
....
.
.3
.
.
.
takl
qsort......
..
.
.
div
.
.
....
cpstak
....
.
.
.
.
puzzle
....
....
.
.
.
.2
....
....
.
.
.
.
....
....
.
peval
.
traverse
.
..
destruct
mst
....
.
.
.
triangle
.
tridiag
....
compiler
conform
....
.
.1
.
.
.
browse
....
boyer
.... dderiv
earley
.
sum
.
.
.
.
deriv
.
.
.
.
.
.
.
.... poly
....
.
rantree
queens
.
.
....
abisort
t

t

t

t

t

t

t

t

t

t

t

t

t

t

allpairs

t

mm

t

t

t

t
t

t

t

.0.0

t

.1

scan

t

.2

.3

.4

GP1000

t

fib
t
t

.5

tak
t

.6

-

S

H6

..
...
.
.
abisort
. ..
.5
.. .
.
cpstak
div
...
...
.
takl
.
..
destruct
...
.
.
.4
...
.. .
.
.
puzzle
..
...
.
.
. ..
.. .
.
.3
ctak
...
... peval
.
qsort
.
.
mst
...
.
.
.
.. . .
triangle
.
.2
...
traverse
.
.
.
browse
.. .
boyer
.
...
dderiv
poly
...
deriv
.
.
earley
.
.
.
.1
.
conform
.
.
.
...
.. .
.
.
..
rantree
..
...
s

s

s

s

s

s

s

s

s

s

s

s

s

s

s

s

s

s

.0.0

.1

.2

s

.3

.4
s

TC2000

.5

.6

s

tridiag
mm

s s

allpairs

scan
s
s

fib

.7

s

sum

queens tak
s
s -

.8

S

Figure 3.12: Relative importance of stack and heap accesses of benchmark programs.

3.6. IMPACT OF MEMORY HIERARCHY ON PERFORMANCE

111

These additional measurements were also taken

 ORemHeap , the overhead of locating the heap in remote memory rather than local

memory when the stack is cached optimally (i.e. no caching on GP1000 and copyback caching on TC2000). This value is a good indicator of the overhead that will
appear due to the sharing of user data if the program is run in parallel (assuming
user data gets distributed uniformly to all processors, the number of processors is
large, and there is little contention).
 ONone (TC2000 only), the overhead of not caching the stack rather than using
copy-back caching.
 OWT (TC2000 only), the overhead of caching the stack with write-through caching
rather than with copy-back caching.
The measurements are given in Table 3.3 and Figure 3.12 presents this data in a
more readable form (plots in S -H space).
A few observations can be made from Figure 3.12. Firstly, most of the programs
access the stack more often than the heap (i.e. all the programs below the S = H
line). This tendency is even more pronounced for the parallel benchmarks (i.e. the
boxed names in the plots). This is to be expected since the majority of the parallel
benchmarks are based on recursive (DAC) algorithms.
Secondly, the importance of memory accesses is greater on the TC2000 than on the
GP1000 (i.e. the position of a given program on the S -H plane is further from the
origin). This is in agreement with the well known fact that modern processors need
caches and a high hit rate to keep them going at peak speed. Most of the programs
actually spend more time accessing memory than doing pure computation when run on
the TC2000 (C is below 21 ). As indicated by column ONone of Table 3.3, copy-back
caching the stack provides an important performance gain. This gain is in some cases
higher than a factor of 2. However, the median gain is 1.34 and the average is 1.56.
The last column in the table, OWT , is of special interest because it re ects the
cost of suboptimally caching the stack to support the SM protocol. The overhead of
using write-through caching rather than copy-back caching is as high as 1.58. The
sequential benchmarks have a median overhead of 1.11 (average of 1.17) whereas the
median overhead for the parallel benchmarks is 1.34 (average of 1.34). Note also that
the cache on the TC2000 is not very fast (only a factor of 3.8 faster than local memory).
Some machines have caches which operate several times faster, with a corresponding
increase in OWT . The objective of the MP protocol is to avoid this overhead altogether.

112

CHAPTER 3. LAZY TASK CREATION

3.7 The Message-Passing Protocol
If the role of the thief in the SM protocol is analogous to a pickpocket, in the MP protocol
stealing a task is analogous to a holdup because the victim actively cooperates with the
thief. To initiate a task steal, the thief sends a steal request message to the victim and
starts waiting for a reply. The victim eventually interrupts its current execution and
calls a steal request handler routine to process the message. This handler checks the task
stack and, if a lazy task is available, recreates the oldest task and sends it back to the
thief. Otherwise a failure message is sent back to the thief which must then try stealing
from some other processor. The victim then resumes the interrupted computation.
There are several advantages to this protocol. Firstly, it relies less on an ecient
shared memory. All the data structures comprising the task stack are private to each
processor. The stack, LTQ, DEQ, and associated pointers can all be cached with copyback caching. All programs which use the stack and/or dynamic scoping will thus
bene t, whether they are sequential or parallel. Parallel programs will in addition
bene t from the caching of the LTQ which reduces the cost of pushing and popping
lazy tasks.
Secondly, it is possible to handle the race condition more eciently than the SM
protocol because all task removals from the task stack are performed by its owner.
Preventing the race condition between task steals and task pops is as simple as inhibiting
interrupts for the duration of the task pop. This can be achieved by adding a pair of
instructions around the task popping sequence to disable and then reenable interrupts
to the processor. The method used by Gambit is to detect interrupts via polling and
never check for interrupts inside the popping sequence (ecient polling is explained in
Chapter 4). There are other methods that have no direct overhead. For example, in the
instruction interpretation method [Appel, 1989] the hardware interrupt handler checks
to see if the interrupted instruction is in an \uninterruptible" section (i.e. a popping
sequence). If it is, the rest of the section is interpreted by the interrupt handler before
the interrupt is serviced. Other zero cost techniques are described in [Feeley, 1993].
Thirdly, the operation swap_child_ret_adr_with_underflow(p) can be implemented according to its original speci cation (i.e. an actual mutation of the child's
return address), thus avoiding the push of the body's return address to the stack and
the explicit check for under ow at the future's return point. The sequence generated for
a future only has to push an entry to the LTQ before evaluating the body and to decrement TAIL at the future's return point. Doing this in the SM protocol was not possible
because the thief could not know where the victim had stored the return address r. In

3.7. THE MESSAGE-PASSING PROTOCOL

113

the MP protocol r can be located in several ways.

 Scanning the stack downward from the top | The system can be designed

so that the steal request handler is always called in the same way as a subproblem
call. This is fairly easy to do when the system detects interrupts through polling
because the call to the handler is a subproblem call. For a system that uses
hardware interrupts it is more complex but still possible11 . Thus, when the handler
is executed, SP and RET can be used to parse the content of the stack. The handler
can walk back through the frames until the frame directly above p is found. At
this point the format of this frame is known, so r can be accessed directly. This
approach may be expensive since there can be an arbitrary number of frames
above p at the moment the steal request is received.

 Scanning the stack upward from p | Assuming the handler is always called
as a subproblem, either r has been saved to the stack by the child's outermost
subproblem call or it has been saved in the continuation frame for the call to the
handler. Thus, when the handler is executed, r will necessarily be the rst return
address above p on the stack (i.e. the return address in the frame directly above
p). An upward search of the stack starting from p and stopping at the rst return
address will locate r. It is assumed here that the values on the stack are tagged,
at least to the extent of allowing return addresses to be distinguished from other
values. It is also assumed that return addresses are not rst-class objects and
that return addresses are never saved to more than one location. Achieving this
might require a close coupling of the steal request handler, interrupt system, and
compiler. The cost of nding r with this method is O(n) where n is the size of
the frame above p. This method is used by Gambit. Gambit makes an e ort to
lessen the cost of the search by using heuristics that favor the saving of the return
address in the lower end of continuation frames.

Finally, in the MP protocol it is the victim that is in charge of creating the parent
task, its continuation, and related structures. By allocating these structures in the
victim's local memory steal_task avoids remote memory accesses and thus completes
faster than in the SM protocol. Remote memory accesses are performed by the thief
when it resumes the task but strictly on demand. The parent task may actually start
11 For example,

a table could be setup with a description of the register allocation for every instruction
in the program. This description indicates among other things where the parent return address is located
when the instruction is executed. This table is used by the handler to build a correctly formatted
continuation frame for the return to the interrupted code.

CHAPTER 3. LAZY TASK CREATION

114

executing sooner than with the SM protocol because only the parent task object and its
rst continuation frame need to be transferred from victim to thief. The total number
of remote memory accesses may also be smaller if the parent's continuation is not used
fully by the thief (for example if the parent task migrates to another processor).
The disadvantages of the MP protocol are explained in Section 3.7.3.

3.7.1 Really Lazy Task Creation
The basic idea of LTC is to defer the creation of heavyweight tasks to the moment they
are known to be required, that is when they are stolen. This usually saves a lot of work
because non-stolen tasks are handled at very low cost and the cost of stealing a task is
roughly the same as creating a heavyweight task in the rst place. In the MP protocol,
the cost of a non-stolen task is two instructions. This cost can actually be removed
completely by doing more work when the task is stolen. Notice that the only purpose
of the LTQ is to facilitate the reverse parsing of the stack (i.e. from bottom to top) to
nd the task continuation boundary of the lowest task. Finding the task continuation
boundaries can however be done by parsing the stack from top to bottom and checking
for return addresses to future return points. As explained previously, this parsing can
be done by the steal request handler. The problem with this method is that the cost of
stealing is not bounded since all the stack must be parsed. Fine grain programs with
shallow recursions may nevertheless perform better with this method if most tasks are
not stolen. Due to its worst-case behavior and the fact that it saves only two inexpensive
instructions, this method is not very appealing for general use.

3.7.2 Communicating Steal Requests
The algorithms for the thief and victim sides of the MP protocol are shown in Figures 3.13 and 3.14 respectively. Even though they are based on a message-passing
paradigm, these algorithms implement the communication using shared variables: THIEF
and REPLY. In addition, the parent task is also communicated through shared memory. The victim's THIEF variable is set by the thief so that the victim can tell which
processor has sent the steal request. It is also used to indicate the presence of a steal
request (when there is a steal request THIEF6=NULL). A thief's REPLY variable is set by
the victim in response to a steal request. After the thief has sent a request, it busy-waits
until the victim responds by setting the REPLY variable to the task that was stolen or

3.7. THE MESSAGE-PASSING PROTOCOL
task *MP_attempt_steal( V )
processor *V;
{
REPLY
= NONE_YET;
1
V->THIEF = CURRENT_PROCESSOR;
2
raise_interrupt( V );
3
while (REPLY == NONE_YET) ;
return REPLY;
}

115

/* V is victim processor

*/

/*
/*
/*
/*

*/
*/
*/
*/

initialize with special marker
tell victim who the thief is
get victim to process the request
busy-wait until victim replies

Figure 3.13: Thief side of the MP protocol.
interrupt_handler()
{
if (THIEF != NULL)
{
/* the steal request handler: */
4
5

/* check for a steal request

processor *T = THIEF;
/*
THIEF = NULL;
/*
if (HEAD < TAIL)
/*
T->REPLY = steal_task( *++HEAD ); /*
else
T->REPLY = NULL;
/*

}

}
.
. /* check other sources of interrupts */
.

*/

get pointer to thief
set it up for next request
anything on the task stack?
send oldest task to thief

*/
*/
*/
*/

indicate failure to thief

*/

Figure 3.14: Victim side of the MP protocol.
to NULL if the victim had an empty task stack12. Note that the interrupt handler can
get invoked for other reasons than the call to raise_interrupt at line 2 (assuming all
types of interrupts go through interrupt_handler). This means that the victim might
detect the steal request at line 4 as soon as line 1 is executed. Consequently, it is
important for the thief to initialize REPLY before line 1 . THIEF must also be reset (line
5 ) before the reply is sent back. In the reverse order a deadlock might occur if a second
steal attempt executes line 1 before THIEF is reset. The victim would be unaware of
the second request and would never send a reply back to the thief (the thief would thus
busy-wait forever).
The implementation of raise_interrupt will depend on the interrupt handling
mechanism. If polling is used, then raise_interrupt can simply raise the victim's
interrupt ag (the cost is that of a remote memory access). Sometime after this, the
12 The advantage of having

REPLY in the thief's local memory is that the busy-waiting does not create
any trac on the memory interconnect.

116

CHAPTER 3. LAZY TASK CREATION

victim will detect the interrupt and call interrupt_handler. Note that this requires
the interrupt ag to be multiple writer shared data so it can't be cached by the victim
(or any other processor). Other systems send interrupts to other processors through
dedicated hardware in the interconnect (the CM-5 for example). Sending an interrupt
on these systems might require a system call. Clearly the cost will vary according to
the features of the machine and operating system.

3.7.3 Potential Problems with the MP Protocol
The MP protocol has a number of characteristics that enhance performance but also
some others that degrade it. This section examines the detrimental aspects and brie y
discusses their severity. An important question is whether the performance gains are
more important than the losses. This question will not be answered fully here because
there are too many performance related parameters to consider. Chapter 5 will instead
evaluate the performance of the MP and SM protocols experimentally.

Busy-Waiting
The most obvious problem with the MP protocol is that the busy-wait for the reply
wastes processing resources. The total time wasted by the thief is the time it takes
before the victim sends back the reply. This is the steal latency . The steal latency is
the sum of the time needed by the victim to detect the steal request (Tdetect ) and the
time to process the request (Tprocess ). If the request is successful, Tprocess is roughly the
time required to call steal_task (Tsteal task); otherwise Tprocess = 0.
The time wasted by the busy-wait must be put in context. If the steal is successful, the thief receives a task after wasting Tdetect + Tsteal task of its time and taking
Tsteal task time away from the victim, so the total amount of work expended to get the
task is Tdetect + 2Tsteal task. If Twork is the time the thief spends running the stolen
task before another task needs to be stolen, the overhead costs for stealing the task in
the MP and SM protocols are
Tsteal task
OMP = 1 + Tdetect +2Twork
task
OSM = 1 + Tsteal
Twork
OSM and OMP are hard to compare because Tsteal task for the SM protocol is larger

than for the MP protocol due to the additional remote memory accesses. If the penalty
of a remote memory access is suciently low, OSM will be lower than OMP . However,

3.7. THE MESSAGE-PASSING PROTOCOL

117

the di erence will be small when Twork is large relative to Tsteal task and Tdetect . This
is helped by the fact that LTC tends to increase the e ective granularity of programs
(i.e. the granularity of heavyweight tasks) and Twork is directly related to the e ective
granularity. However, an increase in the number of processors tends to decrease the
e ective granularity, thus increasing the importance of OMP relative to OSM .

Speed of Work Distribution
The speed at which work gets distributed to the processors is dependent on the steal
latency. Distributing work quickly is crucial to fully exploit the machine's parallelism.
It is especially important at the beginning of the program13 because all processors are
idle except one. Reducing the steal latency not only gets processors working sooner but
also allows these processors to generate new tasks sooner for other processors. The MP
protocol has a potentially smaller steal latency than the SM protocol, but only if Tdetect
is kept small. Unfortunately, minimizing Tdetect may increase the cost of other parts of
the system thus creating a trade-o situation. As explained in the next chapter, polling
will become more expensive because interrupts need to be checked more frequently.

Interrupt Overhead
Finally, the cost of failed steal requests is a concern because the victim pays a high price
for getting interrupted but this serves no useful purpose. The victim might get requests
at such a high rate that it does nothing else but process steal requests. For example,
a continuous stream of steal requests will be received by the victim if it is executing
sequential code and all other processors are idle. The problem here is that processors
are too \secretive". No information about the task stack is shared with other processors
so the only way for a thief to know if the victim has some work is to send it a steal
request.
A simple solution is to have each processor regularly save out HEAD and TAIL in
a predetermined shared-memory location. Before attempting a steal, the thief checks
the copy of HEAD and TAIL in shared memory to see if a task might be available. For
thief processors this snapshot only re ects a previous state of the task stack but, if it
is updated frequently enough, its correlation to the current state will be high. If the
snapshot indicates a non-empty task stack it is thus likely that the steal attempt will be
successful. Gambit always keeps HEAD in shared memory so it does not need to be saved
13 Or more precisely

a transition from sequential to parallel execution.

CHAPTER 3. LAZY TASK CREATION

118

out (this does not a ect performance because the victim accesses HEAD infrequently).
TAIL is saved out on every interrupt check.
Unfortunately, this strategy reduces the speed of work distribution because thieves
can only become aware of a task's presence at the next interrupt check. Performance
is not a ected if the task stack was not empty at the last interrupt check. However, if
the task stack was empty, the newly created task can at best be stolen at the second
following interrupt check. The rst interrupt check will announce the task's presence
to the thieves and the steal request will be handled at best at the second interrupt
check. Since a processor's task stack is empty immediately after it has stolen a task, it
is important to have a low interrupt check latency so that work can spread quickly to
idle processors.

3.8 Code Generated for SM and MP Protocols
This section compares the code generated for a small program when using the SM and
MP protocols on the GP1000. The program used here is the benchmark fib. Figure 3.15
shows the M68020 assembly code generated for fib for each protocol.
The following information will be useful to understand the code. Integer objects are
8 times their value because the three lower bits are used for the type tag. Fib's entry
point is label L1. When fib is called, the return address is passed in register a0 and
parameter n is passed in register d1. Register d1 is also used to return fib's result.
The following registers have a dedicated role: a4 contains TAIL, a5 is a pointer to the
interrupt ag and processor local data, d6 is a mask to test for placeholder objects, d5
is a private counter to perform interrupt checks intermittently (this counter is explained
in the next chapter).
The boxed parts contain the instructions that relate to polling and the parallelization
of fib. The rest of the code is identical in both protocols14. A sequential version of
fib is obtained by removing the boxed parts from the code. One parallelization cost
common to both protocols is the touch operation. Of its three instructions, only the
rst two are executed when a non-placeholder is touched (the run time for this case was
measured at roughly .7 secs). The most important di erence between the protocols
is in the lazy task push and pop operations. These operations take two instructions in
the MP protocol. The run time for these instructions was measured at roughly .7 secs
14 Except

for the instruction at
optimizations.

L7

which is di erent due to one of the compiler's stack allocation

3.8. CODE GENERATED FOR SM AND MP PROTOCOLS

119

(define (fib n)
(if (< n 2)
n
(let* ((f1 (FUTURE (fib (- n 1))))
(f2 (fib (- n 2))))
(+ (TOUCH f1) f2))))

Shared-Memory Protocol
L1:
moveq
cmpl
ble
bra
L2:
movl
movl
subql
moveq
cmpl
bgt
L3:
movl
movl
lea
dbra
moveq
cmpl
bcc
jsr
bra
L4:
jsr
L5:
bra
L6:
clrl
cmpl
bcs
L7:
movl
movl
moveq
addl
lea
moveq
cmpl
ble
L8:
jmp
L9:
movl
btst
bne
jsr
L10:
addl
dbra
moveq
cmpl
bcc
jsr

#16,d0
d1,d0
L3
L8

lazy task push

a0,sp@sp,a4@+
#8,d1
#16,d0
d1,d0
L8
a0,sp@d1,sp@L6,a0
d5,L2
#9,d5
a5@,sp
L2
intr_handler

Interrupt check

L2
conflict

con ict handler

L7
a4@a5@(8),a4
L4

lazy task pop

d1,sp@
sp@(4),d1
#-16,d0
d0,d1
L9,a0
#16,d0
d1,d0
L3
a0@
sp@+,d2
d2,d6
L10
touch_undet
d2,d1
d5,L11
#9,d5
a5@,sp
L11
intr_handler

L11:
addql #4,sp
rts

TOUCH

Interrupt check

Message-Passing Protocol
L1:
moveq
cmpl
ble
bra
L2:
movl
subql
moveq
cmpl
bgt
L3:
movl
movl
lea
dbra
movl
moveq
cmpl
bcc
jsr
bra

#16,d0
d1,d0
L3
L8
sp,a4@+
#8,d1
#16,d0
d1,d0
L8
a0,sp@d1,sp@L6,a0
d5,L2
a4,a5@(4)
#9,d5
a5@,sp
L2
intr_handler
L2

L5:
bra
L7
L6:
subql #4,a4
L7:
movl
movl
moveq
addl
lea
moveq
cmpl
ble
L8:
jmp
L9:
movl
btst
bne
jsr
L10:
addl
dbra
movl
moveq
cmpl
bcc
jsr
L11:
addql
rts

lazy task push

Interrupt check

lazy task pop

d1,sp@sp@(4),d1
#-16,d0
d0,d1
L9,a0
#16,d0
d1,d0
L3
a0@
sp@+,d2
d2,d6
L10
touch_undet
d2,d1
d5,L11
a4,a5@(4)
#9,d5
a5@,sp
L11
intr_handler
#4,sp

Figure 3.15: Assembly code generated for fib.

TOUCH

Interrupt check

120

CHAPTER 3. LAZY TASK CREATION

(compared to 2 secs for the ve instructions required in the SM protocol). Notice that
in both protocols, label L6 is the future's return point and L5 is the secondary return
point (which jumps past the popping sequence). The frame description information
has been removed from the code for clarity. The other di erence is in the interrupt
check sequence. The code for the MP protocol has one more instruction to save out
TAIL. However, this instruction is in the body of the interrupt check sequence which is
executed once out of 10 times. The only accesses to shared memory in the MP protocol
are in the body of the interrupt check sequence (a test of the interrupt ag and the
saving of TAIL).

3.9 Summary
ETC is not an adequate implementation of futures because the overhead of creating a
heavyweight task for each future is too high for ne grain programs. LTC postpones
the creation of the heavyweight task until it is known to be required. This only happens
when another processor needs work (or there is a task suspension, a preemption interrupt, a stack over ow, or a call to call/cc). To do this, LTC uses a lightweight task
representation that contains enough information to recreate the corresponding heavyweight task. Lightweight tasks are put in a local task stack that is accessed by three
operations: push, pop, and steal. A future translates to pushing the parent task onto
the task stack, evaluating the future's body, and then popping the parent task to resume
it (assuming it is still on the task stack). Since a task is essentially a continuation, a
future is nothing more than a special procedure call. The task stack is the runtime
stack and a table (LTQ) that indicates the extent of each continuation on the stack.
In principle, the push and pop operations are only one instruction apiece. The KatzWeise continuation semantics and dynamic scoping have no cost for non-stolen tasks
because the associated support operations (i.e. copying the future's continuation and
the dynamic environment) can also be postponed to the time of the steal.
Thief processors access the task stack from the bottom (the older task is stolen
rst). In divide-and-conquer algorithms this has the advantage of reducing the number
of task steals required because the task containing the most work is transferred between
processors.
A critical issue is which processor extracts the task from the task stack at the time of
a steal. In the shared-memory (SM) protocol, the thief accesses the victim's stack and
LTQ directly to steal the task. Careful synchronization between the thief and victim
is needed to avoid a steal and pop of the same task. An unfortunate consequence of

3.9. SUMMARY

121

the SM protocol is that the stack and LTQ must be accessible to all processors, so they
can't be cached optimally on a machine such as the TC2000. This suboptimal caching
of the stack causes a sizeable overhead because the stack is one of the most frequently
accessed data structures. In the message-passing (MP) protocol, the stack and LTQ are
only accessed by the owner processor so they can be fully cached. The thief sends a
work request message to the victim which sends back a task from its task stack if one is
available. One of the important issues for the MP protocol is the interrupt latency. If
it is too large then the thief will lose precious time busy-waiting and it will hinder the
exploitation of the machine's parallelism because work distribution will be slow.

122

CHAPTER 3. LAZY TASK CREATION

Chapter 4

Polling Eciently
The message-passing implementation of LTC relies on a mechanism to communicate
messages asynchronously from one processor to another. Such a mechanism must have
the ability to interrupt a processor at any time. Conceivably, this could be done using
some special feature of the hardware (e.g. interrupt lines of the processor) or the operating system (e.g. the Unix \signal" system). Unfortunately, these solutions are not very
portable and a suitable performance cannot be guaranteed across a range of machines.
Instead, it is better to consider software methods that are portable and provide a ner
control of performance.
The idea behind software methods is rather simple. Each processor has a ag in
shared memory that indicates whether or not that particular processor has a pending
interrupt. The processor periodically checks (i.e. polls ) this ag and traps to an interrupt
handling procedure when it discovers that the ag has been raised. The interrupt
check code necessary for polling the ag is added by the compiler to the normal stream
of instructions required for the program. This unfortunately means that there is an
overhead cost for any program, even if interrupts never occur. Minimizing this overhead
is thus an important goal.
In theory, the compiler could arbitrarily reduce the polling overhead (Opoll ) by
decreasing the proportion of executed interrupt checks with respect to the normal instructions executed by the program. If all instructions take unit time then Opoll =
Npoll =Ninstr , where Npoll is the number of interrupt checks executed and Ninstr is the
number of non interrupt check instructions executed. This strategy lowers the frequency
of interrupt checking and consequently increases the time between an interrupt request
and the actual acknowledgement by the processor. Average latency (L) and polling over123

124

CHAPTER 4. POLLING EFFICIENTLY

Ninstr
1 . Note that interrupt latency
head are inversely related by L = NpollN+poll
= 1 + Opoll
here refers to the time interval between interrupt checks and not the time between an
interrupt request and its acknowledgement. Here latency is expressed in number of instructions. To account for non-unit time instructions, latency can be expressed in units
of time or number of machine cycles. This leads to the de nitions Opoll = Tpoll =Tinstr
Tinstr
where Tpoll is the total time spent on interrupt checks and Tinstr
and L = TpollN+poll
the time spent on other instructions. If an interrupt check takes k units of time on av1 ). To simplify the discussion, all instructions will be assumed
erage then L = k(1+ Opoll
to take unit time.

As explained in the previous chapter, increasing the interrupt latency is detrimental
to parallel programs because it will take longer to respond to steal requests. This limits
the rate at which work can get distributed to other processors. Thus, there is a trade-o
between overhead and latency. High latency is preferable for sequential code because
the polling overhead is low and low latency is best for parallel code because parallelism
can be exploited better. The importance of latency is actually more subtle than this
simple statement suggests. A high latency may be appropriate for applications where
tasks often suspend on undetermined placeholders. Tasks that become ready following
a determine! are made available to other processors by placing them on the HTQ.
The HTQ is conveniently accessed through shared memory making it impervious to
interrupt latency. If most of the tasks migrate in this fashion to the HTQ, a low latency
may not signi cantly improve the rate of work distribution.
An \optimal" latency for all programs does not exist because the ratio of sequential
to parallel code di ers from program to program. The compiler could select a latency
that suits the needs of the particular program, or procedure, being compiled. Even
if the compiler had enough information to make such a decision, this strategy is still
questionable. Latency requirements vary at runtime as the program switches back and
forth between a sequential and parallel mode of execution. A procedure might be called
both when latency requirements are low and high, and so a xed polling frequency
will give suboptimal performance. One could imagine having multiple versions of each
procedure with varying polling frequencies, but this introduces new problems.
Instead of further exploring such ad hoc strategies, this chapter addresses the problem of eciently achieving a particular latency with the use of polling. It will be assumed
that code duplication is not permitted. The next chapter explores the e ect of interrupt
latency on the performance of the parallel benchmark programs. The results indicate
that a particular choice of latency performs well for a wide range of programs.

4.1. THE PROBLEM OF PROCEDURE CALLS

125

4.1 The Problem of Procedure Calls
Although polling seems simple enough to implement, there is a complication. Normally,
programs are not composed of a single stream of instructions. If this were the case the
compiler could simply count the instructions it emits and insert an interrupt check after
every so many instructions. Branches and procedure calls can alter the ow of control
in unpredictable ways and so, it isn't clear how the compiler can achieve a constant
number of instructions between interrupt checks. A reasonable compromise is to ask of
the compiler to emit interrupt checks such that a given latency (Lmax ) is never exceeded.

4.1.1 Code Structure
To explore the problem further, it is convenient to introduce a formalism to describe
the structure of a procedure's code. In general, the code of a procedure can be viewed
as a graph of basic blocks of instructions. There are two special types of basic blocks:
entry points and return points . There is a single entry point per procedure and one
return point for each procedure call in subproblem position.
The only place where branches are allowed is as the last instruction of a basic block.
There are four types of branches: local branches (possibly conditional) to other basic
blocks of the same procedure, tail calls to procedures (i.e. reductions), non-tail calls to
procedures (i.e. subproblems) and returns from procedures. Local branches and non-tail
calls are not allowed to form cycles and thus they impose a DAG structure to the code.
Loops can only be expressed with tail calls.
Note that subproblem and reduction calls always jump to entry points and that procedure returns always jump to return points. These restrictions are important because
they simplify the analysis of a program's control ow.
Figure 4.1 gives the graph for the procedure for-each which contains all four types
of branches. Returns and tail calls have been represented with dotted lines because they
do not correspond to DAG edges. Solid lines are used for subproblem calls to highlight
the fact that, just like direct branches, it is known where control continues after the
procedure returns (if it returns at all). The generality of the DAG is only needed to
express the sharing of code. For the moment, it is sucient to make the simplifying
assumption that the DAG has been converted into a tree by duplicating each shared
branch. The handling of shared code is described in Section 4.4.
A necessary condition for any polling strategy is that an inline sequence of more

CHAPTER 4. POLLING EFFICIENTLY

126

for-each

....
.
.
S
.
.
S
w
S
.
.
(car l)
.
.
.
.
.
.
?
.
(cdr l)
.
. . . . .

(null? l)

(define (for-each f l)
(if (null? l)
#f
(begin
(f (car l))
(for-each f (cdr l)))))

/




#f

.
.
.
.
.
.
.
.
.
.
.
.

f

.

?

Figure 4.1: The for-each procedure and its corresponding code graph.
than Lmax instructions is never generated without an intervening interrupt check. The
compiler can exploit the code structure for this purpose. A locally connected section is
any subset of the basic blocks that is connected by local branches only (for example,
the three basic blocks at the top of Figure 4.1 or the bottom one). For any instruction i
in a locally connected section, it is easy to determine what instructions are on the path
to i from the section's root. These instructions are exactly those that are executed at
runtime before i. Thus, for any instruction in a locally connected section, the compiler
can tell how far back the last interrupt check occurred (assuming there is one on the same
path from that section's root). The number of instructions that separate an instruction
from the previous interrupt check is called the instruction's delta 1 . When the delta is
Lmax , an interrupt check is inserted by the compiler before the instruction.

4.1.2 Call-Return Polling
Polling strategies di er in how the transition between locally connected sections is handled. Call-return polling is a simple polling strategy that consists of putting an interrupt
check as the very rst instruction of each section's root. Since the root of a section is
either the entry point of the procedure or the return point of a subproblem call, this
corresponds to polling on procedure call and return.
1 For

instructions that are not preceded by an interrupt check in the same section, the de nition of
delta will vary according to the polling strategy.

4.2. SHORT LIVED PROCEDURES
(define
(define
(define
(define

(make-person name
(person-name
x)
(person-age
x)
(person-gender x)

age gender)
(vector-ref
(vector-ref
(vector-ref

127
(vector name age gender))
x 0))
x 1))
x 2))

(define (sum vect l h) ; sum vector from 'l' to 'h'
(if (= l h)
(vector-ref vect l)
(let* ((mid (quotient (+ l h) 2))
(lo (sum vect l mid))
(hi (sum vect (+ mid 1) h)))
(+ lo hi))))

Figure 4.2: Two instances of short lived procedures.
There are several variations on this theme. The interrupt check at the return point
can be removed if checks are put on all return branches. Similarly, the interrupt check at
the entry point can be replaced by checks on branches to procedures (both tail calls and
non-tail calls). The four possible variations give equivalent dynamic behavior (i.e. same
number of interrupt checks executed) but one may be preferable to the others if it yields
more compact code. This depends on the particular code generation techniques used
by the compiler and the programs being compiled. Compactness of code is not a big
issue here so it won't be considered further.

4.2 Short Lived Procedures
Unfortunately, call-return polling can break down in certain circumstances. The worst
case occurs when procedures are short lived, that is they return shortly after being
called. At least two interrupt checks are performed per procedure call in subproblem
position (once on entry and once on exit) and one if it is a reduction. This is a signi cant
overhead if the procedure contains few instructions. This would not be a serious problem
in languages that promote the use of large procedures, but in Lisp it is common to
arrange programs into many short procedures.
Two instances of this style, typi ed in Figure 4.2, are the implementation of data
abstractions and divide and conquer algorithms. This later situation is especially relevant because in Multilisp, parallelism is frequently expressed using divide and conquer
algorithms. In binary divide and conquer algorithms, at least half of the recursive calls

CHAPTER 4. POLLING EFFICIENTLY

128

6

m
MBB
B

B

..
....
..z
. R
P
B
B

Interrupt checks



..

....
.
.
.
9


3


?

6

Lmax
?

Figure 4.3: The maximal delta method.
correspond to the base case. If the algorithm is ne grained, such as the procedure sum,
the overhead of polling will be noticeable because all the leaf calls are short lived.
Putting an interrupt check at every section's root is a very conservative method that
doesn't take the structure of the program into account. If it is known that a procedure
P is always called when delta is equal to n 1, then the compiler could infer that
the rst instruction in P has a delta of n. This would introduce a \grace period" of
Lmax n instructions at P 's entry point during which interrupt checks are not needed.
A similar statement holds for return points. Note that this yields a perfect placement of
interrupt checks if it is carried out at all procedure entry and return points. Interrupt
checks occur exactly every Lmax instructions.
A more realistic solution is needed to handle the case where procedures and return
points are called in di erent contexts (i.e. from call sites with di erent deltas). A simple
extension to the previous method is to use m instead of n, where m is the maximum
delta of all call sites to P (and similarly for return points). This maximal delta method
is illustrated in Figure 4.3 where dark rectangles are used to represent interrupt check
instructions. Note that delta now represents an upper bound on the number of non
interrupt check instructions preceding an instruction. The maximal delta method is not
an ideal solution for two reasons. First, it forces all control paths through P to have
an early interrupt check (in P ) if just one call site to P has a high delta. It would be
much better if each procedure call \paid its own way", meaning that polling should be
put on the call sites with high deltas. Not only would this improve P 's grace period, it
would put the interrupt check where it causes the least overhead (because a high delta
at a call site is a sign of a high number of normal instructions preceding it)2 .
2 For simplicity,

it is assumed here that all paths to P are equiprobable.

4.3. BALANCED POLLING

129

A second shortcoming of this method is that the source and destination of procedure calls has to be known at compile time. In Scheme this information is not generally
available, although one could reasonably argue that with the use of programmer annotations and/or control ow analysis the destination of most procedure calls could be
inferred by the compiler for typical programs. However, the destination of returns is
harder to determine because it would require a full data ow analysis of the program
and in general there are multiple return points for each procedure. The existence of
higher order functions is another source of diculty.

4.3 Balanced Polling
This section presents a general solution that does not rely on any knowledge of the
control ow of the program. The method could be extended with appropriate rules,
such as maximal delta, to better handle the cases where control ow information is
available, but this is not considered here.
The idea is to de ne polling state invariants for procedure entry and exit. The
polling strategy expects these invariants to be true at the entry and return points of all
procedures and consequently must arrange for them to be true at procedure calls and
returns.
Speci cally, the invariant at procedure entry is that interrupts have been checked
at most Lmax E instructions ago. Here E is the grace period at entry points and is
constant for all procedures. In other words, delta is de ned to be Lmax E at entry
points. The invariant at procedure return is more complex. Either delta is less than E
or, the path from the entry point to the return instruction is at most E instructions.
These invariants are represented in Figure 4.4. Procedure P has two branches that
illustrate the two cases for procedure return. Note that a procedure can be exited by a
procedure return as well as a reduction call. For now, reduction calls will be ignored to
simplify the discussion.

4.3.1 Subproblem Calls
These invariants have important implications. To begin with, short lived procedures
are handled well because there is no need to check interrupts on any path that returns
quickly without a call to another procedure (i.e. with less than E non-call instructions).
This corresponds to the rightmost path in Figure 4.4.

CHAPTER 4. POLLING EFFICIENTLY

130

6

at most

Lmax
P

.

E

instructions

.

.

.

.

?

.

. .

.

.

.

.

.

U

entry point

6

at most E
instructions



/
?

..?
.
?

at most E 6
instructions ?

.
.
.
.

?



call sites
to P

.

S
S
w
S

.
.
.
.
.
.
.
.
.
.
.
.
.
.
procedure ..
return .
.

procedure
return

?

Figure 4.4: Procedure return invariants in balanced polling.

4.3. BALANCED POLLING

131

Moreover, the delta at return points can be de ned as E plus the delta for the
corresponding call point. This can be con rmed by considering the two possible cases.
Assume procedure P1 does a subproblem call to procedure P2 which eventually returns
back to P1 via a procedure return in P2 , i.e.

P1

subproblem
call

 P2

procedure
return

 P1

Either the last interrupt check was in P2 , so by de nition delta at the return point
(in P1) is less than E . Alternatively, P2 was short lived and didn't check interrupts, so
there are at most E instructions that separate the call site (in P1 ) from the return point
(in P1 ). As far as polling is concerned, a procedure called in subproblem position can
be viewed as an interrupt check free sequence of E instructions. The compilation rule
here is that if delta at a call point exceeds Lmax E then an interrupt check is inserted
at the call.
This rule means that up to bLmax =E c subproblem procedure calls can be done in
sequence without any interrupt checking. To see why, consider the scenario where the
rst call is immediately preceded by an interrupt check. At the return point, delta is
equal to E . If the instructions for argument setup and branch are ignored, delta at the
nth return point is n  E . Only when this reaches Lmax is an interrupt check needed.

4.3.2 Reduction Calls
As described, the polling strategy does not handle reduction procedure calls (tail calls)
very gracefully. The case to consider here is when a subproblem call is to a procedure
which exits via a series of tail calls, nally ending in a procedure return, i.e.

P1

subproblem
call

 P2

reduction
call

 P3

   Pn

1

reduction
call

 Pn

procedure
return

 P1

An interrupt check must always be put at a reduction call point to guard against the
case where the called procedure returns quickly without checking interrupts (as in Pn 1
calling Pn ). Note that the return point in P1 can have a delta as low as E . Note also
that Pn might execute as many as E non interrupt check instructions before returning
to the return point in P1 . Thus, it is not valid for Pn 1 to jump to Pn with a delta
greater than 0 because this would violate the polling invariant at the return point in
P1 .
The treatment of reductions can be improved by introducing a new parameter (R)
and consequently adjusting the polling invariants to support it. R is de ned as the

132

CHAPTER 4. POLLING EFFICIENTLY

largest admissible delta at a reduction call. Thus, an interrupt check is put on any
reduction call whose delta would otherwise be greater than R. Note that the same
polling behavior as before is obtained by setting R to 0. The polling constraints for
reduction calls can be relaxed by increasing the value of R. R can be as high as Lmax E
because a reduction call might be to a procedure that doesn't check interrupts for as
many as E instructions.
A new invariant for return points has to be formulated to accomodate R. The delta
at return points must now be at least E + R to account for the case explained previously
(a chain of reduction calls from P2 to Pn ending in a procedure return to P1 ). That is, on
return to P1 there could be up to E instructions in Pn plus as much as R instructions
at the tail of Pn 1 since the last interrupt check. When the compiler encounters a
subproblem procedure call it sets the delta at the return points to E plus the largest
value between R and the delta for the corresponding call point. If this value is greater
than Lmax an interrupt check is rst put at the call site and the delta at the return
point is set to E + R. The introduction of R also makes it possible to relax the invariant
for procedure returns. Since the delta for return points is at least E + R, a delta as high
as E + R can be tolerated at procedure returns without requiring an interrupt check.
With these new invariants, there can be up to b(Lmax R)=E c subproblem procedure
calls in sequence without interrupt checks. This polling strategy will be called balanced
polling . A summary of the compilation rules for balanced polling is given in Figure 4.5.
The two constants E and R must be chosen carefully to achieve good performance.
Small values for E and R increase the number of interrupt checks for short lived procedures and tail recursive procedures respectively. On the other hand, high values increase
the number of interrupt checks in code with many subproblem procedure calls (e.g. recursive procedures). Choosing E = R = bLmax =kc is a reasonable compromise and a
value of k = 6 gives good performance in practice. This suggests that there are typically
less than 6 subproblem procedure calls per procedure in the benchmark programs (see
Section 4.6).

4.3.3 Minimal Polling
The choice of Lmax is also an issue. A high Lmax will give a low polling overhead.
However, it is important to realize that there is a limit to how low the polling overhead
can be made by increasing the value of Lmax . This is due to the conservative nature of
the strategy. Whatever the values of Lmax , E and R are, at least one interrupt check

4.3. BALANCED POLLING
Location

133

Action by compiler

Entry point 

Lmax

E

Non-branch if (  Lmax 1) then add interrupt check; 
instruction   + 1 (for the next instruction)

0

Subproblem if (  Lmax E ) then add interrupt check; 
call
 E + max(R; ) (for the return point)

0

Reduction
call

if (  R) then add interrupt check

Procedure
return

if (  E + R) and there is an interrupt check on the path
from the procedure's entry point then add interrupt check
Figure 4.5: Compilation rules for balanced polling.

is generated between the entry point and the rst procedure call. Delta is Lmax E
on entry to a procedure, so clearly the rst call (reduction or subproblem) must be
preceded by an interrupt check. Similarly, there is at least one interrupt check between
any return point and the exit of the procedure (return or reduction call) because delta
at any return point is at least E + R. These two types of paths are the only ones
that are a necessary part of any unbounded length path. Thus, it is sucient to have
one interrupt check on each of these paths to guarantee that all possible control paths
have a bounded number of instructions between interrupt checks. This minimal polling
strategy is useful because its overhead is a lower bound that can be used to evaluate
other techniques.
An example of minimal polling for the procedure sum and the tail recursive variant
tr-sum is presented in Figure 4.6. For the call (sum v l h) there are exactly 2  (h l)
interrupt checks executed or nearly one interrupt check per procedure call (assuming
h l + 1 is a power of two). By comparison, checking interrupts at procedure entry and
exit would execute twice as many interrupt checks (two per procedure call). However,
for the tail recursive procedure tr-sum both methods are essentially equivalent with one
interrupt check per iteration.

CHAPTER 4. POLLING EFFICIENTLY

134

sum

tr-sum
(= l h)



/

S
SS
w

(vector-ref
vect l)

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

?

.....
.
.
S
.
.
S
w
S
.
(+ s
.
(vector-ref .
.
vect i))
.
.
(- i 1)
.
.......

(< i 0)

(quotient
(+ l h) 2)

sum

?
(+ mid 1)
sum

?
(+ lo hi)

.
.
.
.

?


/




s

.
.
.
.
.
.
.
.
.

.

?

(define (tr-sum vect s i)
(if (< i 0)
s
(tr-sum
vect
(+ s (vector-ref vect i))
(- i 1))))

Figure 4.6: Minimal polling for the recursive procedure
variant.

sum

and a tail recursive

4.4. HANDLING JOIN POINTS

135

It is interesting to note that balanced polling is more general than minimal polling
and call-return polling. These can be emulated by judiciously choosing E , R and Lmax .
Minimal polling is obtained when 0  E = R  Lmax (i.e. E and R are arbitrarily
large and Lmax is arbitrarily larger). An interrupt check is put at the rst call and
another one is put at the return or reduction call that follows the last return point.
Call-return polling occurs when 0 = E  R = Lmax . This places interrupt checks at
all entry points and return points.

4.4 Handling Join Points
It has been assumed that the code of procedures is in the form of a tree. However,
the compilation of conditionals (e.g. and, or, if and cond) in subproblem position
introduces join points that give a DAG structure to the code. Certain optimization
techniques, such as common code elimination, can also produce join points to express
the sharing of identical code branches. A simple approach for join points is to use the
maximal delta method. That is, the delta at the join point is the maximum delta of all
branches to the join point. Although this is not an optimal strategy, its performance
on the benchmark programs seems suciently good to be content with it.

4.5 Polling in Gambit
Polling is a general mechanism that can serve many purposes. In Gambit, polling is
used for







Stack over ow detection
Inter-processor communication (for stealing work)
Preemption interruption (for multitasking)
Inter-task communication (for interrupting tasks)
Barrier synchronization (e.g. for synchronizing all processors for a garbage collection and to copy objects to the private memory of every processor)

A special technique is used to check all these cases with a single test. The interrupt
ag in shared memory is really a pointer that is normally set to point to the end of the
area available for the stack. An interrupt check consists of comparing the ag to the
current stack pointer, and to jumping to an out of line handler when the stack pointer

136

CHAPTER 4. POLLING EFFICIENTLY

exceeds this limit. A processor can be interrupted by setting the ag to a value that
forces this situation (e.g. 0). The interrupt handler can then use some other ags to
discriminate between the possible sources of interrupt.
Although it can be done with a single test, the interrupt check may still be relatively expensive due to the reference to shared memory. Increasing Lmax is not a viable
solution because the polling frequency can't be lowered beyond a certain point. To
provide a ner level of control, interrupts can be checked intermittently. Polling instructions generated by the compiler represent \virtual" interrupt check points and an
actual interrupt check occurs only every so many virtual checks. This new parameter
is the intermittency factor and is called I . Intermittent checking is easily implemented
by a private counter that is decremented at every virtual check. When it reaches zero
it is reset to I and the interrupt check is performed. The average cost of an interrupt
check will thus be the cost of updating and checking the counter plus 1=I th the cost of
checking the interrupt ag.
An interesting optimization occurs here. Balanced polling has a tendency to put the
interrupt checks at branch points. An interrupt check itself involves a branch instruction
so in many cases it is possible to combine the two branches into a single one. Moreover,
several machines have a combined \decrement and branch" instruction that helps reduce
the cost even further. All these ideas are implemented in Gambit.

4.6 Results
To have a better idea of the polling overhead that can be expected from these polling
methods, it is important to measure the overhead on actual programs. Two situations are especially interesting to evaluate: the overhead on typical programs and on
pathological programs that are meant to exhibit the best and worst performance.
Several programs and polling methods were tested. The programs were run on the
GP1000 using a single processor. Each program was compiled in four di erent ways:
with no interrupt checks, with minimal polling, with call-return polling and balanced
polling. For balanced polling, Lmax was set to values from 10 to 90 and E and R were
set at bLmax =6c. A value of I = 10 was used as the intermittency factor. The average
run time on ten runs was taken for each situation. The polling overhead of minimal
polling over the program compiled with no interrupt checks is reported in the rst
column of Table 4.1. The overhead for the other polling methods is expressed relatively
to the overhead of minimal polling. Thus a relative overhead of 2 means that the

4.6. RESULTS

137

overhead is twice that of minimal polling. Overheads lower than one can be explained
by a combination of factors: timing inaccuracies and degradation of instruction cache
performance (due to the di erent loading location of the programs). The table also gives
the average latency obtained with minimal polling and balanced polling (at Lmax = 10
and Lmax = 90). The latency for compiler is not shown because the number of interrupt
checks executed was not available (to measure it, the program must be compiled with a
statistics gathering option which increases the size of the code so much that it can not
t anymore on the GP1000!).
The program tight, shown below, was designed to exhibit worst-case behavior.
(define (tight n)
(if (> n 0)
(tight (- n 1))))

It is a tight loop that doesn't do anything except update a loop counter. There are only
two instructions executed on every iteration: an increment and a conditional branch.
Interrupt checks will clearly add a high overhead to this. For most polling methods the
overhead is about 80%. In the case of balanced polling with Lmax = 10 the overhead
is roughly twice that because two interrupt checks get added to every loop (because
E = R = 1).
The program unfolded is the same loop as tight but unfolded 80 times. Thus, it is
a long inline sequence of 80 decrements followed by one conditional branch instruction.
The polling methods do well on this program (about 6% for minimal and call-return
polling) because procedure calls are relatively infrequent and it is easy to handle the
inline sequence of instructions. As expected for balanced polling, increasing Lmax decreased the overhead, down to about 14%. Lmax would have to be higher than 486
(i.e. 6  81) to reduce the overhead to that of minimal polling (at Lmax = 90 there are
two interrupt checks per loop).
The other programs are from the standard set of benchmarks. The parallel programs
were compiled as sequential programs (i.e. with futures and touches removed) to factor
out the overhead of supporting parallelism.
The results for these programs indicate that minimal polling outperforms call-return
polling in nearly all cases. Sometimes by as much as a factor of four, but by a factor
closer to 1.7 on average. The largest di erences occur for ne grain recursive programs (e.g. tak and fib) and programs with a profusion of data abstraction procedures
(e.g. conform). The performance of balanced polling is rather poor for small values of
Lmax , two to three times the overhead of minimal polling when Lmax = 10. However,

138

CHAPTER 4. POLLING EFFICIENTLY

CallMinimal polling return
Balanced polling
L (secs)
Opoll
L polling Rel. ov. when E = R = bLmax =6c and Lmax is for Lmax =
Program
(%) (secs) Rel. ov. 10 20 30 40 50 60 70 80 90 10 90
tight
83.9
13
1.0 2.5 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0
11 13
unfolded
6.1
154
0.9 10.8 6.5 4.2 3.5 3.7 3.9 2.3 2.3 2.3
22 83
boyer
21.5
58
1.4 1.7 1.1 1.0 1.0 1.1 1.0 0.9 0.9 0.9
36 57
browse
14.7
90
1.1 1.6 1.1 0.8 1.0 1.7 1.2 1.0 1.0 0.9
46 88
cpstak
10.9
108
1.2 1.9 1.5 1.2 1.0 1.1 1.0 1.0 1.0 1.1
59 110
dderiv
9.0
93
1.6 2.1 1.4 1.6 1.2 1.0 1.3 1.3 1.2 1.3
53 95
deriv
8.1
114
1.4 1.8 1.4 1.8 1.1 0.9 1.0 1.0 1.1 1.2
62 115
destruct
21.3
34
1.1 2.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0
20 34
div
14.1
49
1.0 1.3 0.9 1.0 1.0 1.0 1.0 1.0 1.0 1.0
38 49
puzzle
14.5
58
0.9 2.1 1.7 1.2 1.0 1.0 1.0 0.9 0.9 0.9
28 57
tak
8.7
71
4.6 3.9 1.4 1.8 1.2 1.0 1.0 1.0 1.0 1.0
25 71
takl
29.3
38
0.9 1.5 1.0 1.1 1.0 0.9 0.9 0.9 0.9 0.9
21 36
traverse
16.9
36
1.5 2.5 1.3 0.9 0.9 0.9 0.9 0.9 0.9 0.9
27 35
triangle
3.9
63
3.7 6.0 6.0 3.2 3.8 2.4 2.1 2.3 1.0 2.0
38 65
compiler
14.4
|
1.8 2.3 1.3 1.1 1.0 1.0 1.0 1.1 1.0 1.0
| |
conform
10.5
34
2.5 2.8 1.7 1.3 1.1 1.2 1.4 1.3 1.4 1.2
18 34
earley
6.4
120
1.5 2.3 1.6 1.5 1.0 1.1 2.1 0.8 1.1 1.2
59 122
peval
9.7
98
1.7 2.2 1.5 1.0 1.1 1.1 1.3 1.0 1.0 1.1
50 98
abisort
11.4
72
1.3 2.5 1.7 1.4 1.4 1.0 1.1 1.1 1.0 1.0
36 72
allpairs
4.4
149
1.0 3.9 2.6 2.0 2.0 2.0 0.5 1.8 1.0 1.0
56 149
fib
18.7
36
2.1 2.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0
21 36
mm
4.7
115
1.1 3.0 2.7 3.0 1.6 2.2 2.2 0.8 0.9 0.9
61 114
mst
10.5
122
1.6 2.2 1.5 2.1 1.0 1.2 1.0 0.8 1.1 1.0
53 122
poly
23.4
101
0.6 1.2 0.7 0.7 0.3 0.4 0.8 0.6 0.4 0.6
39 93
qsort
12.3
64
1.3 1.9 1.3 1.0 1.3 1.0 1.0 1.0 1.0 1.0
44 63
queens
15.2
48
1.4 3.0 1.5 1.5 1.5 1.4 1.3 1.3 1.2 1.3
26 50
rantree
11.4
101
2.5 2.2 1.2 0.9 1.4 1.1 1.3 1.0 1.0 0.9
41 100
scan
6.6
84
2.4 3.5 2.0 0.8 0.8 1.2 1.0 1.0 1.0 1.0
39 84
sum
11.8
66
1.8 2.5 1.4 0.7 0.5 0.9 1.0 0.8 0.9 0.8
30 65
tridiag
1.6
364
2.7 7.9 4.5 4.3 4.2 3.4 3.7 3.9 3.0 3.6 103 259

Table 4.1: Overhead of polling methods on GP1000.

4.7. SUMMARY

139

balanced polling gives performance close to minimal polling when Lmax is high. With
Lmax = 90 the polling overhead ranges from 5% to 25%. The highest overheads are for
ne grain recursive programs. The average overhead for balanced polling is about 12%
for values of Lmax higher than 50.

4.7 Summary
Interrupts can be detected by the processor's hardware interrupt system or by polling.
Polling has the advantage of simplicity and portability. A common claim is that polling
is not appropriate for a high-performance system because it has a high overhead. This
chapter described the balanced polling method whose overhead is almost half that of
the more straightforward call-return polling method. Balanced polling as implemented
on the GP1000 has a 12% overhead on average. This overhead still seems rather high
but this can be explained by the high quality of code generated by Gambit and the poor
instruction set of the M68020 processors on the GP1000. Systems with a compiler that
generates less tight code or with a processor that permits a lower cost code sequence for
an interrupt check (for example, a fast \compare and trap on condition" instruction)
would have a correspondingly lower overhead for polling.
Clearly, the processor's hardware interrupt system should be used to implement the
MP protocol if the interrupt latency and overhead are low enough and the state of the
processor at the time of interrupt can be recovered conveniently. If not, polling is at
least a viable alternative.

140

CHAPTER 4. POLLING EFFICIENTLY

Chapter 5

Experiments
Performance is the main design objective of the implementation strategies presented in
this thesis. In most cases a purely theoretical performance analysis is not satisfying
because it must abstract away many real issues to make the analysis manageable. The
goal of this chapter is to evaluate performance using experiments. Concrete evidence
for the following claims is given
1. Exposing parallelism with LTC is relatively inexpensive when the MP protocol is
used. The worst-case overhead (when programs are very ne grain) is about 20%.
2. In the absence of a cache, the overhead of exposing parallelism with the SM
protocol is about twice that of the MP protocol (i.e. the worst-case overhead is
about 40%). When a cache is available, the overhead for the SM protocol can be
higher than a factor of two.
3. LTC scales well to large shared-memory multiprocessors. The two protocols have
very similar speedup characteristics when a cache is not present.
4. The MP protocol has speedup characteristics that are consistently better than the
SM protocol on multiprocessors with caches. The di erence in performance when
using a large number of processors is as high as a factor of two on the TC2000.
5. The steal request latency can be relatively large without adversely a ecting the
MP protocol's performance.
6. Supporting the Katz-Weise semantics and legitimacy generally has a negligible
impact on performance.
141

CHAPTER 5. EXPERIMENTS

142

5.1 Experimental Setting
Several experiments were conducted to evaluate and compare the various implementation strategies. The experiments consisted of running each benchmark program in a
particular context and measuring some of its characteristics. The context was dependent
on the following parameters.

 Machine and compiler | The experiments were performed on the GP1000 and

TC2000 multiprocessors. Each of the M68020 processors on the GP1000 delivers
roughly 3 MIPS and each M88000 processor on the TC2000 delivers roughly 20
MIPS. Only the TC2000 has a data cache. Each machine has its own version of
the compiler (but the front-ends are the same). The back-end for the GP1000 generates highly optimized native code, whereas the version for the TC2000 generates
portable C code which must be subsequently compiled with a C compiler. The
price to pay for this portability is a slowdown of a factor of 1.5 to 3 over native
code depending on the program. The slowdown is a result of extra \pure computation" instructions. The number of memory accesses would however be the same
in a native code implementation. This means that the importance of the TC2000's
memory hierarchy is lower than it would be if the back-end generated native code.
Consequently, the results obtained with the GP1000 are more representative of
a high-performance compiler and the results obtained on the TC2000 are more
representative of a modern multiprocessor with a low cost memory hierarchy.
A severe handicap of these machines is the small size of physical memory. The local
memory on each processor is only 4 Mbytes on the GP1000 and 8 Mbytes on the
TC2000. Since this memory holds the operating system's code and data structures
and the program's code, little space is left for the program's heap (only about 2
Mbytes on the GP1000). Allocating virtual memory is not a solution because it
adversely a ects the performance of garbage collection and also because it doesn't
scale well (page faults are handled by a small set of processors dedicated for this
purpose). To minimize these problems, the benchmarks were chosen so that the
data they allocate ts in the heap without causing any garbage collection. In an
e ort to reduce the number of page faults, the benchmarks perform a few \dry
runs" before the run actually measured. Nevertheless, some memory intensive
programs, allpairs and poly in particular, consistently caused page faults due
to their poor locality of reference.

 Number of processors | One of the goals of this thesis is to show that LTC
scales well to large shared-memory multiprocessors. For this reason, the experi-

5.1. EXPERIMENTAL SETTING

143

ments were conducted on the largest machines that were accessible: a 94 processor
GP1000 (at Michigan State University) and a 45 processor TC2000 (at Argonne
National Laboratory). These are multi-user machines where processors are dynamically allocated into partitions at the time the program is launched by the
user. The program is only aware of the processors in its partition but, because
the memory interconnect is a butter y network shared by all the partitions, the
contention on the network depends on the other programs running on the machine. To minimize this e ect, experiments were performed at \o -peak" hours
and the average of several runs (usually 10) was taken. However, it was dicult
to nd times where large partitions could be allocated, so it was necessary to
limit the number of experiments and number of runs for the larger partitions (this
explains, at least in part, the greater variations in the results on large partitions).
The largest partition used on the GP1000 was 90 processors; on the TC2000 it
was 32 processors.
Another problem aicts large partitions. Each processor on the GP1000 and
TC2000 has a limited size TLB (translation lookaside bu er) for holding the mapping information that is used to translate virtual addresses to physical addresses.
The TLB is managed like a cache and has roughly 60 entries. Each entry maps
a page of the program's virtual address space. When a memory reference is to a
page not currently mapped by the TLB, a translation fault occurs and the operating system must load the appropriate mapping information into the TLB from
a table in memory. Translation faults must be avoided because they are handled
in software and are relatively expensive. Programs with poor locality of reference
and that have more than 60 or so pages in their working set will cause frequent
translation faults. Unfortunately, several of the benchmarks have poor locality
because they distribute user data evenly across the machine to reduce contention.
The working set of these programs increases with the number of processors and
thrashing occurs when the working set exceeds 60 pages (this typically starts happening somewhere between 32 and 64 processors but the exact point depends on
the program). Moreover, poor locality is inherent in the search for a task to steal
which possibly \ ushes" several entries from the TLB that are part of the stolen
task's working set. The importance of this factor will increase with the number
of processors and the scarcity of tasks to steal.

 Polling parameters | Balanced polling with E = R = 15 and Lmax = 90 was
used for all experiments. The steal request latency was controled by changing the
polling intermittency factor I . Unless otherwise indicated, I was set to 10 (the
same value used in the previous chapter to evaluate the polling methods).

CHAPTER 5. EXPERIMENTS

144

 Stealing protocol | Both the SM and MP protocols were tested.
 Continuation semantics | Two continuation semantics were used: the original

Multilisp semantics and the Katz-Weise semantics. On the GP1000, the original
semantics was used with the SM protocol and the Katz-Weise semantics was used
with the MP protocol. The TC2000 used the original semantics for both protocols. For the original semantics the transfer of the stolen task's continuation was
performed with a single block transfer operation. The Katz-Weise semantics was
implemented with heapi cation.

 Legitimacy | Unless otherwise indicated, legitimacy was not supported.

5.2 Overhead of Exposing Parallelism
Oexpose corresponds to the cost of exposing the parallelism to the system. Part of this

cost comes from the futures and touches added to the sequential program to parallelize
it. The other part of the cost is a consequence of the less ecient caching policy that is
needed for the SM protocol. Recall that Tseq is the run time of a sequential version of
the program (the parallel program with futures and touches removed) and Tpar is the
run time of the parallel program on one processor. Tpar , Tseq , and Oexpose are related
by the equation

Oexpose = TTpar
seq
To evaluate Oexpose , the run time was measured on a single processor partition with
the program compiled with and without futures and touches (giving Tpar and Tseq
respectively). Tpar and Oexpose are given on the left side of Tables 5.1 through 5.4.
The rst two tables are for the SM and MP protocols on the GP1000 and the last two
tables are for the SM and MP protocols on the TC2000. On the TC2000, the stack was
write-through cached for measuring the SM protocol's Tpar and the stack was copy-back
cached for measuring Tseq and the MP protocol's Tpar .
Notice that for nearly all programs, the SM protocol has an Oexpose larger than the
MP protocol. The only exceptions are the programs mm and abisort on the GP1000.

5.3. SPEEDUP CHARACTERISTICS

145

5.2.1 Overhead on GP1000
On the GP1000, Oexpose is closely dependent on G, the task granularity, and n, the
number of closed variables that must be copied for the future's body (Tables 2.1 and
3.1 give the value of G and n for each benchmark). Oexpose is approximately equal to
1 + (2:7+1:G6n)secs when using the SM protocol and 1 + (1:4+1:G6n)secs when using the MP
protocol. This is consistent with the costs measured in Chapter 3 for the lightweight task
push and pop sequence, 2 secs for the SM protocol and .7 sec for the MP protocol,
and the .7 sec cost for a touch (most programs have the same number of touches and
futures). For the SM protocol, Oexpose is at its lowest value (.3%) for allpairs, the
program with the largest granularity. The highest overhead (37.5%) is for fib, the
program with the smallest granularity. For the MP protocol, allpairs and fib also
yield the lowest and highest overheads (.2% and 20.8%). This is about half the overhead
of the SM protocol.

5.2.2 Overhead on TC2000
On the TC2000, Oexpose for the MP protocol ranges from 2.3% to 20.9%, which is
essentially the same range as for the GP1000. However, Oexpose for the SM protocol
is much larger, ranging from 27.1% to 127.8%. The highest overhead is for fib, which
runs a factor of 2.278 slower than the sequential version of the program. For the MP
protocol, the overhead for fib is only 15.7%. The large di erence in overheads is mostly
due to the SM protocol's use of write-through caching for the stack and LTQ. According
to column OWT of Table 3.3, write-through caching the stack accounts for an overhead
of 1.34 on sequential fib. Thus, the additional overhead of the parallel version (to go
from 1.34 to 2.278) is attributable mostly to the three stack and LTQ writes performed
for each future. On the other hand, the overhead of coarse grain programs is closer to
OWT . For example, allpairs has an OWT of 54% and an Oexpose of 55%.

5.3 Speedup Characteristics
The right side of Tables 5.1 through 5.4 provides some information on the parallel
behavior of the programs. The programs were run on increasingly large partitions (up
to 90 processors on the GP1000 and 32 processors on the TC2000) to see how well
they exploit parallelism. For the GP1000, three measurements were taken: the run
time of the program, the number of heavyweight tasks created, and the number of task

CHAPTER 5. EXPERIMENTS

146

Program

Tpar Oexpose

fib

1.1300 37.5%

queens

1.3080 19.3%

rantree

.4550 14.9%

mm

1.5760

1.1%

scan

1.2960 21.8%

sum

.4820 22.6%

tridiag

4.0320

1.7%

allpairs

24.9530

.3%

abisort

5.0710

6.9%

25.1160

7.0%

mst

qsort

poly

.2630 25.8%
2.4340

6.3%

S=
TC=
TS=
S=
TC=
TS=
S=
TC=
TS=
S=
TC=
TS=
S=
TC=
TS=
S=
TC=
TS=
S=
TC=
TS=
S=
TC=
TS=
S=
TC=
TS=
S=
TC=
TS=
S=
TC=
TS=
S=
TC=
TS=

2
1.45
.0000
.0000
1.66
.0003
.0000
1.68
.0012
.0004
1.20
.0005
.0001
1.26
.0001
.0000
1.22
.0001
.0000
1.20
.0004
.0001
|
|
|
.62
.0001
.0000
|
|
|
|
|
|
|
|
|

Speedup, TC and TS when
number of processors is
4
8
16
32
64
90
2.82 5.47 10.33 17.79 27.04 31.37
.0002 .0005 .0008 .0019 .0039 .0042
.0001 .0001 .0003 .0006 .0012 .0014
3.16 5.70 9.75 15.10 19.16 18.21
.0015 .0042 .0083 .0152 .0305 .0404
.0003 .0008 .0019 .0036 .0076 .0117
3.18 5.41 8.84 11.86 14.04 13.38
.0025 .0085 .0190 .0346 .0593 .0722
.0011 .0039 .0071 .0121 .0204 .0259
1.88 3.24 5.90 10.14 15.94 18.34
.0091 .0238 .0491 .1048 .1830 .2408
.0018 .0056 .0099 .0214 .0416 .0598
2.13 3.61 6.43 10.21 13.54 13.57
.0009 .0022 .0045 .0083 .0155 .0201
.0001 .0002 .0005 .0008 .0015 .0022
2.09 3.72 6.55 10.23 11.77 12.20
.0009 .0019 .0041 .0075 .0133 .0171
.0001 .0002 .0004 .0008 .0013 .0021
1.78 2.93 5.18 8.60 12.58 16.51
.0021 .0055 .0126 .0238 .0454 .0631
.0001 .0005 .0014 .0026 .0055 .0073
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
.76 1.20 2.16 3.60 5.63 6.95
.0013 .0030 .0072 .0167 .0386 .0563
.0001 .0003 .0008 .0019 .0046 .0064
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|

Table 5.1: Performance of SM protocol on GP1000.

5.3. SPEEDUP CHARACTERISTICS

Program
fib

queens

rantree

Tpar Oexpose

.9930 20.8%
1.2550 14.5%
.4460 12.6%

mm

1.5830

1.5%

scan

1.1900 11.8%

sum

.4460 13.5%

tridiag

3.9880

.6%

allpairs

24.9400

.2%

abisort

mst

qsort

poly

5.2800 11.3%
24.7990

5.7%

.2480 18.7%
2.3580

3.0%

S=
TC=
TS=
S=
TC=
TS=
S=
TC=
TS=
S=
TC=
TS=
S=
TC=
TS=
S=
TC=
TS=
S=
TC=
TS=
S=
TC=
TS=
S=
TC=
TS=
S=
TC=
TS=
S=
TC=
TS=
S=
TC=
TS=

2
1.64
.0000
.0000
1.73
.0003
.0000
1.72
.0012
.0004
1.21
.0006
.0002
1.35
.0001
.0000
1.34
.0001
.0000
1.21
.0004
.0001
1.10
.0104
.0082
.61
.0001
.0000
.93
.0116
.0037
1.33
.0017
.0009
.98
.1510
.0120

147

Speedup, TC and TS when
number of processors is
4
8
16
32
64
90
3.22 6.14 11.27 19.21 27.96 32.88
.0002 .0005 .0010 .0020 .0041 .0051
.0001 .0002 .0003 .0006 .0012 .0016
3.27 6.00 10.26 15.39 20.60 22.28
.0016 .0039 .0081 .0171 .0308 .0396
.0003 .0009 .0019 .0041 .0078 .0098
3.24 5.53 8.53 11.48 13.56 14.04
.0031 .0087 .0195 .0371 .0635 .0747
.0014 .0039 .0078 .0141 .0236 .0277
1.85 3.26 5.78 10.23 15.78 18.87
.0094 .0270 .0465 .1017 .1744 .2219
.0016 .0046 .0086 .0190 .0376 .0507
2.23 3.91 6.54 10.27 14.22 14.78
.0008 .0021 .0043 .0081 .0140 .0189
.0000 .0001 .0003 .0006 .0009 .0012
2.23 3.92 6.46 10.13 12.68 13.28
.0010 .0021 .0043 .0076 .0134 .0178
.0000 .0001 .0003 .0006 .0010 .0012
1.79 2.96 5.04 8.49 12.13 15.80
.0023 .0046 .0108 .0221 .0366 .0453
.0001 .0004 .0009 .0018 .0035 .0034
1.58 2.62 4.20 6.27 7.88 7.16
.1254 .2553 .4478 .6910 .8898 .9853
.0194 .0642 .1020 .1321 .1532 .2094
.75 1.23 2.09 3.58 5.57 6.85
.0014 .0030 .0071 .0162 .0347 .0487
.0001 .0002 .0006 .0013 .0024 .0029
1.06 1.38 1.53 1.59 1.37 1.25
.0454 .0889 .1523 .2434 .3294 .3578
.0029 .0055 .0083 .0113 .0147 .0150
1.63 1.60 1.42 1.22 1.02 1.13
.0101 .0399 .1073 .2394 .4334 .5290
.0061 .0232 .0611 .1313 .2245 .1611
1.08 1.27 1.44 1.41 1.22
.76
.3083 .5142 .7066 .8179 .8458 .6338
.0771 .1520 .1626 .1346 .1763 .0741

Table 5.2: Performance of MP protocol on GP1000.

CHAPTER 5. EXPERIMENTS

148

Program
fib
queens
rantree
mm
scan
sum
tridiag
allpairs
abisort
mst
qsort
poly

Tpar Oexpose

.6763 127.8%
.6338 117.3%
.1827 39.0%
.6576 60.2%
.7156 81.3%
.2471 94.8%
1.6559 56.0%
12.2866 55.0%
2.9351 27.1%
9.5555 43.2%
.1740 52.8%
.7275 30.2%

Speedup when
number of processors is
2
4
8
16
32
.88 1.74 3.23 6.03 11.18
.91 1.77 3.24 5.95 9.49
1.41 2.70 4.60 7.44 10.79
.93 1.64 3.01 5.64 10.07
.94 1.72 3.01 5.51 8.97
.90 1.67 2.86 4.85 7.82
.92 1.58 2.82 5.02 8.49
.95 1.63 2.85 4.85 7.61
.84 1.36 2.41 4.36 7.79
.85 1.21 1.55 1.81 1.85
1.17 1.67 1.75 1.80 1.71
.61 .79 1.01 1.19 1.20

Table 5.3: Performance of SM protocol on TC2000.

Program
fib
queens
rantree
mm
scan
sum
tridiag
allpairs
abisort
mst
qsort
poly

Tpar Oexpose

.3435
.3525
.1391
.4198
.4558
.1430
1.0907
8.1841
2.4107
6.9101
.1294
.5759

15.7%
20.9%
5.9%
2.3%
15.5%
12.7%
2.7%
3.2%
4.4%
3.6%
13.6%
3.1%

2
1.72
1.63
1.82
1.29
1.35
1.40
1.21
1.22
.95
1.04
1.48
.77

Speedup when
number of processors is
4
8
16
32
3.37 6.52 11.97 20.46
3.07 5.58 9.11 13.51
3.46 5.76 9.07 11.99
2.18 3.88 7.09 11.40
2.40 4.29 7.57 12.20
2.49 4.43 7.56 11.88
1.98 3.35 5.83 9.44
2.03 3.43 5.56 7.80
1.48 2.49 4.51 7.93
1.34 1.76 1.93 1.92
2.07 2.05 1.95 1.62
.92 1.19 1.43 1.29

Table 5.4: Performance of MP protocol on TC2000.

5.3. SPEEDUP CHARACTERISTICS

149

suspensions that occurred. Each entry in the table contains three values computed from
these measurements:

 S | This is the program's speedup over the sequential version of the program
(i.e. that has futures and touches removed and that is run with copy-back caching
of the stack on the TC2000).
S= runTseqtime

 TC | This is the proportion of lightweight tasks that were transformed into
heavyweight tasks.

tasks
TC= heavyweight
Nfuture

created

 TS | This is the number of task suspensions expressed relatively to the number
of lightweight tasks.

suspensions
TS= number of Ntask
future

Note that a few of the benchmarks (allpairs, mst, poly, and qsort) did not run
properly with the SM protocol on the GP10001. The tables for the TC2000 only contain
the speedup. The speedup data is reproduced as speedup curves in Figures 5.1 through
5.6. The speedup curves for the GP1000 also contain data for runs of the MP protocol
with higher and lower intermittency factors. For now, only the curves marked I = 10 are
considered. TC and TS for the MP protocol on GP1000 are also plotted as a function
of the number of processors in Figures 5.7 and 5.8. The benchmark programs can be
roughly classi ed in three groups, according to the shape of their speedup curves.
1. Parallel and compute bound: fib, queens, rantree. These programs do
not access memory. The speedup curve is initially close to linear speedup, and
gradually diverges from it as the number of processors increases (in other words
the rst derivative of the curve starts at 1 and the second derivative is negative).
The attening out of the curve as the number of processors increases is explained
by Amdahl's law (i.e. each program has a maximal speedup).
1 The bug has stumped me to this day.

I suspect that it is a race condition I introduced in the assembly
language encoding of the algorithms (Gambit's kernel contains about 5000 lines of hand optimized
assembly code). After obtaining a working version of the SM protocol on the TC2000 (written in C), I
convinced myself that the problem was not algorithmic. The problem may also be related to a known
bug in the parallel garbage collection algorithm.

CHAPTER 5. EXPERIMENTS

150
128
64
32
16

S
P 8
E
E
D 4
U
P

2
1

fib

...
SM, I=10
.. ..
.
. ..
MP, I=50, L= 198 secs
. ..
.
.
MP, I=10, L= 41 secs
...
...
.
.
MP, I=2, L= 10 secs
....
.. .
.
.
...
.. .
.
.
. ..
....
.
.
...
...
.
.
..
...
.
.
...
.. ..
.
.
...
...
.
.
...
....
.
.
....

.5

64
32
16

S
P 8
E
E
D 4
U
P

2
1

64
32
16

S
P 8
E
E
D 4
U
P

2
1

queens

...
.. .
.
.
...
...
.
.
...
....
.
....
.. .
.
.
...
...
.
.
...
....
.
.
...
...
.
.
...
...
.
.
...
....
.
.
...
.. .
.
.
. ..
....
.
.
.
...

SM, I=10
MP, I=50, L= 281 secs
MP, I=10, L= 58 secs
MP, I=2, L= 13 secs

.5

.25 1
128

128

2

4

8

16

32

Number of Processors

64

128

rantree

..
SM, I=10
....
.
.
...
MP, I=50, L= 453 secs
...
.
.
MP, I=10, L= 93 secs
....
.. .
.
.
MP, I=2, L= 20 secs
...
.. .
.
.
.
...
...
.
.
...
...
.
.
..
....
.
.
.
...
.. .
.
.
...
...
.
.
..
...
.. ..
.
.
...
.. ..
.

.5
.25 1

.25 1
128
64
32
16

S
P 8
E
E
D 4
U
P

2
1

2

4

8

16

32

Number of Processors

64

128

mm

...
...
.
.
.
...
.. ..
.
. ..
...
.
.
..
...
... ..
....
.. .
.
.
...
...
.
.
.
. ..
...
.
.
...
. ..
.
.
..
....
.
.
.
. ..
...
.
.
.
...
.. ...

SM, I=10
MP, I=50, L= 579 secs
MP, I=10, L= 117 secs
MP, I=2, L= 25 secs

.5
2

4

8

16

32

Number of Processors

64

128

.25 1

2

4

8

16

32

Number of Processors

Figure 5.1: Speedup curves for fib, queens, rantree and mm on GP1000.

64

128

5.3. SPEEDUP CHARACTERISTICS
128
64
32
16

S
P 8
E
E
D 4
U
P

2
1

scan
SM, I=10
MP, I=50, L= 449 secs
MP, I=10, L= 91 secs
MP, I=2, L= 20 secs

..
..
.
.
.
...
.. ..
.
.
.. .
.. ..
.
..
...
.
.
...
.. ..
.
.
...
...
.
.
.
...
...
.
.
...
.. .
.
.
...
...
.
.
...
.. ..
.
.
...
. ..
.
.
.
...

.5

64
32
16

S
P 8
E
E
D 4
U
P

2
1

128
64
32
16

S
P 8
E
E
D 4
U
P

2
1

sum
SM, I=10
MP, I=50, L= 332 secs
MP, I=10, L= 68 secs
MP, I=2, L= 15 secs

..
..
.
.
...
....
.
.
...
. ..
.
..
...
.
.
.
...
....
.
.
...
...
.
.
. ..
....
.
.
...
. ..
.
.
...
...
.
.
...
....
.
.
...
.. .
.
.
.
...

.5

.25 1
128

151

2

4

8

16

32

Number of Processors

64

128

tridiag

...
SM, I=10
...
.
.
...
MP, I=50, L= 1265 secs
...
.
.
MP, I=10, L= 256 secs
...
...
.
.
MP, I=2, L= 53 secs
.
....
... ...
...
...
.
.
.
....
.. ..
.
..
...
.
.
...
....
.
.
.
...
.. ..
.
..
...
.
.
.
....
.. ...

.5
.25 1

.25 1
128
64
32
16

S
P 8
E
E
D 4
U
P

2
1

2

4

8

16

32

Number of Processors

64

128

allpairs

.
...
.. ..
.
..
MP, I=50, L= 755 secs
...
.
.
MP, I=10, L= 152 secs
...
...
.
.
MP, I=2, L= 32 secs
...
...
.. ..
...
...
.
.
...
.. ..
.
.
..
...
.
.
. ..
....
.
.
..
....
.
.
.
.. .
.. ..
.
.. ..
.. ...

.5
2

4

8

16

32

Number of Processors

64

128

.25 1

2

4

8

16

32

Number of Processors

64

Figure 5.2: Speedup curves for scan, sum, tridiag and allpairs on GP1000.

128

CHAPTER 5. EXPERIMENTS

152
128
64
32
16

S
P 8
E
E
D 4
U
P

2
1

abisort
SM, I=10
MP, I=50, L= 359 secs
MP, I=10, L= 74 secs
MP, I=2, L= 17 secs

...
.. ..
.
. ..
. ..
.
.
...
...
.
.
....
.. .
.
.
...
.. .
.
.
. ..
....
.
.
...
...
.
.
..
...
.
.
...
.. ..
.
.
...
...
.
.
...
....
.
.
....

.5

64
32
16

S
P 8
E
E
D 4
U
P

2
1

64
32
16

S
P 8
E
E
D 4
U
P

2
1

mst

...
.. .
.
.
...
MP, I=50, L= 694 secs
...
.
.
MP, I=10, L= 142 secs
...
....
.
MP, I=2, L= 30 secs
....
.. .
.
.
...
...
.
.
...
....
.
.
...
...
.
.
...
...
.
.
...
....
.
.
...
.. .
.
.
. ..
....
.
.
.
...

.5

.25 1
128

128

2

4

8

16

32

Number of Processors

64

128

qsort

..
....
.
.
...
MP, I=50, L= 294 secs
...
.
.
MP, I=10, L= 61 secs
....
.. .
.
.
MP, I=2, L= 14 secs
...
.. .
.
.
.
...
...
.
.
...
...
.
.
..
....
.
.
.
...
.. .
.
.
...
...
.
.
..
...
.. ..
.
.
...
.. ..
.

.5
.25 1

.25 1
128
64
32
16

S
P 8
E
E
D 4
U
P

2
1

2

4

8

16

32

Number of Processors

64

128

poly

...
...
.
.
.
MP, I=50, L= 1426 secs
...
.. ..
.
MP, I=10, L= 291 secs
. ..
...
.
.
MP, I=2, L= 59 secs
..
...
... ..
....
.. .
.
.
...
...
.
.
.
. ..
...
.
.
...
. ..
.
.
..
....
.
.
.
. ..
...
.
.
.
...
.. ...

.5
2

4

8

16

32

Number of Processors

64

128

.25 1

2

4

8

16

32

Number of Processors

Figure 5.3: Speedup curves for abisort, mst, qsort and poly on GP1000.

64

128

5.3. SPEEDUP CHARACTERISTICS
32
16
8
S
P 4
E
E
D
U 2
P

1

fib

....
....
SM
.
.
.
....
....
.
MP
.
.
.
....
....
.
.
.
....
....
.
.
.
....
....
.
.
.
....
....
.
.
.
....
....
.
.
.
....
....
.
.
.
....
....
.
.
.
....
....
.
.
.
....
....
.
.
.
...

.5

16
8
S
P 4
E
E
D
U 2
P

1

32
16
8
S
P 4
E
E
D
U 2
P

1

queens

....
....
.
.
.
....
....
.
MP
.
.
.
....
....
.
.
.
....
....
.
.
.
....
....
.
.
.
....
....
.
.
.
....
....
.
.
.
....
....
.
.
.
....
....
.
.
.
....
....
.
.
.
....
....
.
.
.
...

SM

.5

.25 1

32

153

2

4

8

Number of Processors

16

32

rantree

...
....
.
SM
.
.
....
....
.
.
MP
.
....
....
.
.
.
....
....
.
.
.
....
....
.
.
.
....
....
.
.
.
....
....
.
.
.
....
....
.
.
.
....
....
.
.
.
.
....
....
.
.
.
....
....
.
.
.
....

.5
.25 1

.25 1

32
16
8
S
P 4
E
E
D
U 2
P

1

2

4

8

Number of Processors

16

32

mm

...
....
.
.
.
....
....
.
.
MP
.
....
....
.
.
.
....
....
.
.
.
....
....
.
.
.
....
....
.
.
.
....
....
.
.
.
....
....
.
.
.
....
....
.
.
.
.
....
....
.
.
.
....
....
.
.
.
....

SM

.5
2

4

8

Number of Processors

16

32

.25 1

2

4

8

Number of Processors

16

Figure 5.4: Speedup curves for fib, queens, rantree and mm on TC2000.

32

CHAPTER 5. EXPERIMENTS

154
32
16
8
S
P 4
E
E
D
U 2
P

1

scan

....
....
SM
.
.
.
....
....
.
MP
.
.
.
....
....
.
.
.
....
....
.
.
.
....
....
.
.
.
....
....
.
.
.
....
....
.
.
.
....
....
.
.
.
....
....
.
.
.
....
....
.
.
.
....
....
.
.
.
...

.5

16
8
S
P 4
E
E
D
U 2
P

1

16
8
S
P 4
E
E
D
U 2
P

1

sum

....
....
.
.
.
....
....
.
MP
.
.
.
....
....
.
.
.
....
....
.
.
.
....
....
.
.
.
....
....
.
.
.
....
....
.
.
.
....
....
.
.
.
....
....
.
.
.
....
....
.
.
.
....
....
.
.
.
...

SM

.5

.25 1

32

32

2

4

8

Number of Processors

16

32

tridiag

...
....
.
SM
.
.
....
....
.
.
MP
.
....
....
.
.
.
....
....
.
.
.
....
....
.
.
.
....
....
.
.
.
....
....
.
.
.
....
....
.
.
.
....
....
.
.
.
.
....
....
.
.
.
....
....
.
.
.
....

.5
.25 1

.25 1

32
16
8
S
P 4
E
E
D
U 2
P

1

2

4

8

Number of Processors

16

32

allpairs

...
....
.
.
.
....
....
.
.
MP
.
....
....
.
.
.
....
....
.
.
.
....
....
.
.
.
....
....
.
.
.
....
....
.
.
.
....
....
.
.
.
....
....
.
.
.
.
....
....
.
.
.
....
....
.
.
.
....

SM

.5
2

4

8

Number of Processors

16

32

.25 1

2

4

8

Number of Processors

16

Figure 5.5: Speedup curves for scan, sum, tridiag and allpairs on TC2000.

32

5.3. SPEEDUP CHARACTERISTICS
32
16
8
S
P 4
E
E
D
U 2
P

1

abisort

....
....
SM
.
.
.
....
....
.
MP
.
.
.
....
....
.
.
.
....
....
.
.
.
....
....
.
.
.
....
....
.
.
.
....
....
.
.
.
....
....
.
.
.
....
....
.
.
.
....
....
.
.
.
....
....
.
.
.
...

.5

16
8
S
P 4
E
E
D
U 2
P

1

32
16
8
S
P 4
E
E
D
U 2
P

1

mst

....
....
.
.
.
....
....
.
MP
.
.
.
....
....
.
.
.
....
....
.
.
.
....
....
.
.
.
....
....
.
.
.
....
....
.
.
.
....
....
.
.
.
....
....
.
.
.
....
....
.
.
.
....
....
.
.
.
...

SM

.5

.25 1

32

155

2

4

8

Number of Processors

16

32

qsort

...
....
.
SM
.
.
....
....
.
.
MP
.
....
....
.
.
.
....
....
.
.
.
....
....
.
.
.
....
....
.
.
.
....
....
.
.
.
....
....
.
.
.
....
....
.
.
.
.
....
....
.
.
.
....
....
.
.
.
....

.5
.25 1

.25 1

32
16
8
S
P 4
E
E
D
U 2
P

1

2

4

8

Number of Processors

16

32

poly

...
....
.
.
.
....
....
.
.
MP
.
....
....
.
.
.
....
....
.
.
.
....
....
.
.
.
....
....
.
.
.
....
....
.
.
.
....
....
.
.
.
....
....
.
.
.
.
....
....
.
.
.
....
....
.
.
.
....

SM

.5
2

4

8

Number of Processors

16

32

.25 1

2

4

8

Number of Processors

16

Figure 5.6: Speedup curves for abisort, mst, qsort and poly on TC2000.

32

CHAPTER 5. EXPERIMENTS

156
20

20

fib
queens
rantree
mm

scan
sum
tridiag
allpairs

20

abisort

2 1
2 1
2 1 mst
qsort
2 2
2 2
2 2 poly
2 3
2 3
2 3
2 4
2 4
2 4
5
5
2
2
2 5
2 6
2 6
2 6
2 7
2 7
2 7
8
8
8
TC 2 9
TC 2 9
TC 2 9
2
2
2
2 10
2 10
2 10
2 11
2 11
2 11
2 12
2 12
2 12
13
13
2
2
2 13
2 14
2 14
2 14
15
15
2
2
2 15
2 16 1 2 4 8 16 32 64 128 2 16 1 2 4 8 16 32 64 128 2 16 1 2 4 8 16 32 64 128
Number of Processors

Number of Processors

Number of Processors

Figure 5.7: Task creation behavior of MP protocol on GP1000.

20

2 1
2 2
2 3
2 4
2 5
2 6
2 7
8
TS 2 9
2
2 10
2 11
2 12
2 13
2 14
2 15
2 16

fib
queens
rantree
mm

2 1
2 2
2 3
2 4
2 5
2 6
2 7
8
TS 2 9
2
2 10
2 11
2 12
2 13
2 14
2 15
2 16

1 2 4 8 16 32 64 128
Number of Processors

20

scan
sum
tridiag
allpairs

20

abisort

2 1 mst
qsort
2 2 poly
2 3
2 4
2 5
2 6
2 7
8
TS 2 9
2
2 10
2 11
2 12
2 13
2 14
2 15
16
1 2 4 8 16 32 64 128 2 1 2 4 8 16 32 64 128
Number of Processors

Number of Processors

Figure 5.8: Task suspension behavior of MP protocol on GP1000.

5.3. SPEEDUP CHARACTERISTICS

157

2. Parallel and memory accessing: abisort, allpairs, mm, scan, sum, tridiag.
These programs access memory to various extents. The speedup curves for these
programs is \S" like (i.e. the second derivative is initially positive and then negative). A good example is abisort. The initial bend in the curve is explained
by the increase in cost for accessing shared user data which is distributed evenly
across the machine. A memory access has a probability of n n 1 of being to remote
memory (where n is the number of processors), so the average cost of an access to
shared user data is L+Rn(n 1) , where R is the cost of a remote memory access and
L is the cost of a local memory access. The bend in the curve is consequently more
pronounced for programs which spend a high proportion of their time accessing
the heap (e.g. abisort, allpairs, and mm).
3. Poorly parallel: mst, poly, qsort. These are programs whose algorithms do
not contain much parallelism or that contain a form of parallelism that is not well
suited for LTC. The speedup curves for these programs are mostly at because
little of the parallelism is exploited. Generally, the curve starts going down after
a certain number of processors because no more parallelism can be exploited but
other costs, such as contention and memory interconnect trac, increase.

5.3.1 Speedup on GP1000
On the GP1000, it is striking how similar the tables and speedup curves are for the
SM and MP protocols. The speedup, number of tasks created and the number of task
suspensions are normally within a few percent of each other. Nevertheless, the MP
protocol typically has a slightly higher speedup, especially for the ne grain programs.
This can be explained by the fact that the di erence in Oexpose between protocols is
larger for ne grain programs.
Recall that on the GP1000, the SM protocol is using the original continuation semantics and the MP protocol is using the Katz-Weise semantics without legitimacy
support. Since the speedup characteristics for both protocols are so similar, it follows
that the additional work needed to support the Katz-Weise semantics, mostly that of
heapi cation, is globally negligible. The cost of supporting legitimacy is examined in a
later section.
For both protocols, the number of heavyweight tasks created by most programs is a
small fraction of what ETC would have created. When fib is run on 90 processors, only
about .5% of the lightweight tasks are transformed to heavyweight tasks. As suggested
by the curves in Figure 5.7, above 4 processors TC increases roughly linearly with the

158

CHAPTER 5. EXPERIMENTS

number of processors. The notable exceptions are allpairs, mst and poly whose TC
levels o as it nears 1 and qsort whose TC rst goes up roughly as the square of the
number of processors before leveling o as it nears 1. All programs have TC<8% on 90
processors, except mm (22-24%), allpairs (99%), mst (36%), poly (63%), and qsort
(53%). The high TC of these programs can be explained by their coarse granularity and
low degree of parallelism (except qsort which is explained later). These programs create
relatively few lightweight tasks so proportionately more of them need to be stolen to keep
the processors working. An extreme example is allpairs which on each iteration creates
only 116 lightweight tasks (i.e. the maximum parallelism is 117). It isn't surprising that
on a 90 processor partition nearly all of the tasks get stolen to balance the load across
the machine.
The reason why TC is high for qsort (and also poly), is that most of the stolen
tasks perform very little work (i.e. Twork is only a few instructions). Most of qsort's
stolen tasks perform a single call to cons before they terminate. A handful of similarly
simple operations are performed by poly's stolen tasks. Thieves that have just stolen a
task will soon be looking for new tasks to steal so the lightweight tasks that are created
are likely to get stolen. Qsort's poor speedup is explained by its high TC and low Twork
combined with its ne granularity (G = 16 secs) and heavy remote memory usage
(ORemHeap = 3.94).
Similarly Figure 5.8 suggests that, above 4 processors, the number of task suspensions increases fairly linearly with the number of processors for most programs. The
notable exceptions are allpairs, mst and poly which have a fairly constant TS above
8 processors.

5.3.2 Speedup on TC2000
On the TC2000, the speedup curves for the MP protocol have a similar shape to those
for the MP protocol on the GP1000. The actual speedup is however slightly higher for
the TC2000. This is probably due to the TC2000's faster memory system combined with
the lower quality of code generated by the compiler (which makes the memory system
appear even faster). These factors reduce the relative importance of task management
operations and memory accesses. Consequently, a native code implementation on the
TC2000 would have a lower speedup (but higher absolute performance!).
The SM protocol however has a consistently lower speedup than that of the MP
protocol. For each protocol, the speedup curve starts o at 1=Oexpose on 1 processor
(for their respective Oexpose ) and as the number of processors increases the curves tend

5.4. EFFECT OF INTERRUPT LATENCY

159

to get closer. Programs with good speedup characteristics (e.g. fib and sum) maintain a
roughly constant distance between the speedup curves. In other words, the ratio of their
run time stays close to the ratio of their Oexpose . On the other hand, programs with
poor speedup characteristics (e.g. mst and qsort) have speedup curves that become
colinear at a high number of processors. This can be explained by the progressive
decrease of mandatory work being performed by the program. The main cause of
the overhead Oexpose , that is suboptimally caching the stack and task stack, mostly
a ects the performance of the mandatory work. The relative importance of suboptimally
caching the stack will thus decrease as the programs spend more and more time being
idle and/or accessing remote memory.
The only point where the speedup curves cross is for qsort at 32 processors. However, the same thing should be expected for other benchmarks on larger partitions
because, as the number of processors increases, the bene ts of caching decrease whereas
the speed of work distribution becomes more critical to performance. Since the SM
protocol has a lower steal latency, it will likely outperform the MP protocol on very
large partitions. Note however that this might happen at the point where the eciency
(i.e. the ratio of speedup and the number of processors) is so low that it is not cost
e ective. For instance, the best speedup attained by qsort is 2.07 at 4 processors using
the MP protocol whereas the best speedup for the SM protocol is 1.8 at 16 processors.

5.4 E ect of Interrupt Latency
In order to study the e ect of the interrupt latency on the performance of the MP
protocol, the programs were tested on the GP1000 with lower and higher intermittency
factors. The previous experiments were performed with I = 10 and a new set of measurements were taken with I = 2 and I = 50. These changes in I cause the interrupt
latency to vary roughly by a factor of 5 (decrease and increase respectively). Tables 5.5
and 5.6 contain for each program the value of Tpar , Oexpose , and for each partition size:
S, TC, and TS. Figures 5.1 through 5.3 contain the speedup curves for each setting of
I and also give L, the average interrupt latency (L is Tpar divided by the number of
interrupt checks executed). Note that the average time before an interrupt is detected
(Tdetect ) is L=2.
The settings for I were chosen so that Tdetect would be roughly comparable to
Tsteal task, the cost of stealing a task. Experimental measurements put Tsteal task
at between 120 and 180 secs (depending on the program). When I = 2, Tdetect is
normally a fraction of Tsteal task and when I = 50, it is normally larger.

CHAPTER 5. EXPERIMENTS

160

Program

Tpar Oexpose

fib

1.1620 41.4%

queens

1.4180 29.4%

rantree

.4880 23.2%

mm

1.6840

8.0%

scan

1.2850 20.8%

sum

.4870 23.9%

tridiag

4.1510

4.7%

allpairs

26.1640

5.1%

abisort

mst

qsort

poly

5.9390 25.2%
26.3310 12.2%
.2780 33.0%
2.3990

4.8%

S=
TC=
TS=
S=
TC=
TS=
S=
TC=
TS=
S=
TC=
TS=
S=
TC=
TS=
S=
TC=
TS=
S=
TC=
TS=
S=
TC=
TS=
S=
TC=
TS=
S=
TC=
TS=
S=
TC=
TS=
S=
TC=
TS=

2
1.40
.0001
.0000
1.53
.0003
.0000
1.57
.0012
.0004
1.17
.0009
.0003
1.27
.0001
.0000
1.25
.0001
.0000
1.17
.0005
.0000
1.07
.0105
.0080
.58
.0001
.0000
.91
.0120
.0038
1.25
.0011
.0006
.97
.1524
.0116

Speedup, TC and TS when
number of processors is
4
8
16
32
64
90
2.76 5.31 9.82 16.69 26.18 30.44
.0002 .0005 .0011 .0022 .0041 .0057
.0001 .0002 .0004 .0007 .0013 .0017
2.92 5.37 9.18 14.31 19.09 21.49
.0016 .0043 .0092 .0182 .0325 .0427
.0003 .0009 .0022 .0047 .0084 .0108
2.97 5.14 8.20 11.02 13.56 14.14
.0031 .0089 .0192 .0395 .0644 .0843
.0014 .0040 .0078 .0157 .0257 .0327
1.82 3.21 5.66 10.02 15.84 18.87
.0122 .0246 .0562 .1081 .1979 .2561
.0026 .0043 .0120 .0237 .0433 .0582
2.13 3.76 6.39 10.02 13.75 14.94
.0011 .0023 .0049 .0092 .0159 .0213
.0001 .0002 .0003 .0007 .0012 .0015
2.11 3.74 6.28 9.73 12.84 13.65
.0010 .0021 .0046 .0089 .0152 .0199
.0001 .0002 .0003 .0008 .0013 .0014
1.75 2.95 5.07 8.44 12.27 16.18
.0025 .0058 .0126 .0229 .0437 .0587
.0001 .0004 .0012 .0020 .0040 .0051
1.56 2.60 4.21 6.59 8.62 7.67
.1253 .2752 .4920 .7422 .9464 .9970
.0219 .0658 .1079 .1422 .1741 .2438
.74 1.21 2.07 3.59 5.64 6.95
.0014 .0030 .0068 .0168 .0361 .0509
.0001 .0002 .0007 .0016 .0036 .0049
1.03 1.27 1.62 1.55 1.53 1.38
.0473 .1028 .1828 .2912 .4097 .4511
.0033 .0059 .0094 .0144 .0221 .0249
1.56 1.53 1.35 1.13
.99 1.10
.0105 .0431 .1213 .2718 .5112 .6093
.0064 .0250 .0679 .1407 .2402 .2049
1.05 1.27 1.51 1.51 1.42
.64
.3249 .5235 .7161 .8340 .8846 .5555
.0667 .1504 .1655 .1372 .1715 .0565

Table 5.5: Performance of MP protocol on GP1000 with I = 2.

5.4. EFFECT OF INTERRUPT LATENCY

Program
fib

queens

Tpar Oexpose

.9610 16.9%
1.2140 10.8%
.4350

9.8%

mm

1.5630

.3%

scan

1.1780 10.7%

sum

.4350 10.7%

rantree

tridiag

3.9370

-.7%

allpairs

24.8150

-.3%

abisort

5.1110

7.7%

24.3180

3.6%

mst

qsort

poly

.2410 15.3%
2.3110

.9%

S=
TC=
TS=
S=
TC=
TS=
S=
TC=
TS=
S=
TC=
TS=
S=
TC=
TS=
S=
TC=
TS=
S=
TC=
TS=
S=
TC=
TS=
S=
TC=
TS=
S=
TC=
TS=
S=
TC=
TS=
S=
TC=
TS=

2
1.70
.0000
.0000
1.78
.0003
.0000
1.75
.0012
.0004
1.21
.0010
.0000
1.35
.0001
.0000
1.37
.0001
.0000
1.21
.0003
.0001
1.10
.0093
.0074
.61
.0001
.0000
.92
.0090
.0034
1.42
.0009
.0003
.97
.1510
.0115

161

Speedup, TC and TS when
number of processors is
4
8
16
32
64
90
3.33 6.34 11.56 19.85 29.36 33.15
.0002 .0004 .0009 .0020 .0033 .0043
.0000 .0001 .0003 .0006 .0009 .0012
3.38 6.17 10.29 15.48 20.15 22.28
.0014 .0034 .0079 .0150 .0275 .0340
.0003 .0007 .0018 .0037 .0066 .0083
3.29 5.47 8.41 10.77 12.61 12.86
.0028 .0081 .0162 .0293 .0466 .0537
.0013 .0036 .0064 .0109 .0174 .0195
1.88 3.24 5.62 9.52 13.89 16.04
.0102 .0261 .0538 .0892 .1501 .1862
.0011 .0045 .0086 .0137 .0211 .0260
2.24 3.84 6.29 9.60 12.49 13.33
.0009 .0018 .0034 .0063 .0107 .0132
.0000 .0001 .0002 .0004 .0006 .0006
2.25 3.91 6.38 9.27 11.56 11.77
.0010 .0018 .0034 .0062 .0107 .0139
.0000 .0001 .0002 .0002 .0004 .0005
1.78 2.93 4.89 7.89 11.18 12.80
.0019 .0044 .0085 .0157 .0241 .0288
.0001 .0003 .0007 .0013 .0018 .0020
1.56 2.44 3.61 4.68 5.01 4.59
.1238 .2419 .4133 .6207 .8236 .9005
.0155 .0562 .0926 .1298 .1631 .1574
.75 1.21 2.04 3.42 5.07 5.69
.0011 .0024 .0057 .0127 .0247 .0310
.0000 .0002 .0004 .0007 .0009 .0009
1.09 1.32 1.46 1.37 1.12
.99
.0292 .0530 .0791 .0935 .0922 .0911
.0023 .0042 .0053 .0055 .0055 .0054
1.69 1.63 1.42 1.17
.93
.93
.0088 .0335 .0853 .1830 .2196 .3396
.0052 .0211 .0534 .1078 .1272 .0997
.96 1.05 1.04
.69
.30
.26
.2819 .4460 .5883 .5713 .3588 .4754
.0892 .1845 .1723 .1206 .0722 .0629

Table 5.6: Performance of MP protocol on GP1000 with I = 50.

162

CHAPTER 5. EXPERIMENTS

Overall, the speedup curves indicate that the setting of I does not signi cantly a ect
performance. For small partitions, the speedup curves for I = 50 are consistently better
(but only slightly) than smaller values of I . This is simply due to the slightly lower
polling overhead for I = 50. As the number of processors increases and the program's
work distribution requirements become more critical, the performance for the lower
values of I improves and eventually surpasses the performance for I = 50. The only
exception is fib which at 90 processors is still a little faster with I = 50. On large
partitions, most programs perform better with a setting of I = 2 but the performance
of I = 10 is very close. The di erence in performance between I = 2 and I = 10 at 90
processors is less than 3%, with the exception of allpairs (7%) and mst (10%). It is
interesting to note however that good performance is obtained for all settings of I such
that L is less than Tsteal task (allpairs and mst with I = 10 are on the border with
L equal to 152 and 142 secs respectively).

5.5 Cost of Supporting Legitimacy
The previous experiments were performed with a version of the MP protocol that did
not contain support for legitimacy. To evaluate the cost of supporting legitimacy, the
appropriate operations were added to the task management algorithms (i.e. the creation
of the legitimacy placeholder, its installation in the stolen task and end_frame, and the
legitimacy propagation and chain collapsing in end-body). The programs were run
on the GP1000 with increasingly large partitions (up to 16 processors). Two runs
were performed: one with and one without a speculation barrier at the end of the
program. The run time was measured and compared to the run time of the version
lacking legitimacy support. The overhead (the ratio of run times) is shown in Table 5.7.
The results clearly show that for all programs based on fork-join algorithms, the cost
of supporting legitimacy is negligible. In fact, it can hardly be measured at all (the cost
is below the noise level of 2%). The collapsing of the legitimacy chain appears to be
working out as expected for fork-join algorithms. Only the programs qsort and poly,
which are based on pipeline parallelism, have measurable overheads. The overheads
increase with the number of processors, indicating that the legitimacy chain is getting
longer and its collapsing is getting more expensive. The highest overhead is 10% for
poly at 16 processors when a speculation barrier is present. On 16 processors, the
overhead is a little lower (by 2 to 3%) when there is no speculation barrier.

5.6. SUMMARY

Program
fib
queens
mm
scan
rantree
sum
tridiag
allpairs
abisort
mst
poly
qsort

163
1

2

Number of Processors
4

8

16

without

with

without

with

without

with

without

with

without

with

1.00
1.00
1.00
1.00
1.00
1.00
1.00
1.00
1.00
1.00
1.00
1.00

1.00
1.00
1.00
1.00
1.00
1.00
1.00
1.00
1.00
1.00
.98
1.00

1.00
1.00
1.01
.99
1.00
1.00
1.02
.99
1.00
1.00
1.00
1.01

1.00
1.00
1.00
1.00
1.00
1.00
1.00
1.00
1.00
.99
1.00
1.01

1.00
1.00
1.00
1.00
1.01
1.00
1.01
1.00
1.01
1.00
1.00
1.01

1.00
1.00
1.01
1.01
1.01
.99
1.00
1.00
1.00
1.00
1.00
1.03

1.00
.99
1.00
.98
1.00
.99
1.02
1.01
1.00
1.00
1.06
1.05

.99
1.01
1.00
1.01
1.00
1.01
.99
1.01
1.00
1.00
1.05
1.06

1.01
1.00
1.00
1.01
.98
1.02
1.00
1.00
1.00
1.00
1.03
1.07

1.00
.99
1.00
1.00
.99
1.02
.99
1.00
1.00
1.00
1.05
1.10

Table 5.7: Overhead of supporting legitimacy, with and without speculation barrier on
GP1000.

5.6 Summary
This chapter has evaluated the performance of the SM and MP protocol implementations
of LTC on large shared-memory multiprocessors (up to 90 processors). Experiments
were conducted with several benchmark programs on the GP1000 multiprocessor (which
lacks a data cache) and the TC2000 (which has a data cache). The results show that

 The parallelization cost is low | The overhead of parallelizing a sequential

program (by adding futures and touches) is typically less than 20% when using
the MP protocol. For the SM protocol, the overhead is twice as large when a
cache is not available. However, when a cache is available the overhead is much
more important (up to a factor of two on the TC2000) because the SM protocol
must cache the stack and LTQ suboptimally.

 LTC scales well | Programs with a high degree of parallelism have fairly linear

speedup with respect to the sequential version of the program. The SM and MP
protocols have almost identical speedup curves when a cache is not available.
When a cache is available, the speedup curve for the MP protocol is consistently
better due to the di erence in caching policy. However, this di erence gradually

164

CHAPTER 5. EXPERIMENTS
decreases as the number of processors increases because the caching policy becomes
less important (the caching policy has no in uence on the idle time and remote
memory access time which increase with the number of processors).

 Interrupt latency can be relatively high | For the MP protocol, an interrupt

latency as high as the time to steal a task provides adequate performance. On a
90 processor GP1000, the run time is usually within 3% of the run time for the
best latency.

 Supporting the Katz-Weise semantics and legitimacy generally has a
negligible impact on performance | There was no noticeable performance

di erence between a version of the system that supported the Katz-Weise semantics and one that did not. This indicates that the additional cost of heapi cation
is low relatively to the other costs of stealing (in particular, the remote memory
references needed to transfer the task between processors). The cost of legitimacy
propagation and testing is also very low. The overhead for fork-join programs
is too low to measure. However, programs with a less restrictive task termination order exhibit a measurable but small overhead (no more than 10% on 16
processors).

Chapter 6

Conclusion
The initial goal of this work was the implementation of a high-performance Multilisp
system. Earlier implementations of Multilisp, such as Concert Multilisp [Halstead, 1984]
and MultiScheme [Miller, 1987], gave interesting self relative speedups but because they
were based on interpreters it was not clear that the same speedups would apply to a
\production quality" system. As a rst step of this work, a highly optimizing compiler
for Scheme was developed to provide a realistic setting for exploring new implementation
strategies for Multilisp and evaluating their performance. This e ort resulted in Gambit
[Feeley and Miller, 1990], currently the best Scheme compiler in terms of performance
of the code generated.
The system was ported to the GP1000 and TC2000 multiprocessors, and support for
Multilisp's parallelism constructs added to the compiler. Initially the eager task creation
(ETC) method was used to implement futures but it was soon clear that the overhead
of task creation would be too high for ne grain programs (as explained in Chapter 2).
Work on the lazy task creation (LTC) mechanism was triggered by a comment on \lazy
futures" in [Kranz et al., 1989]. LTC postpones the creation of a task until it needs
to be transferred to another processor, the \thief". Consequently, the overhead of task
creation is mostly dependent on the work distribution needs of the program and not so
much the program's granularity. For divide-and-conquer programs, LTC has the nice
property of transferring large pieces of work and roughly balancing the work between
the thief and victim processors. This helps reduce the number of task transfers needed
to keep processors busy. Most tasks end up being executed locally at low cost.
Eric Mohr independently explored the LTC mechanism with the Mul-T system on
the Encore Multimax multiprocessor (a UMA computer with up to 20 processors) and
165

166

CHAPTER 6. CONCLUSION

ended up using a version of the shared-memory (SM) protocol very similar to the one
used here [Mohr, 1991]. In the SM protocol, thief processors directly access the stack
of other processors to \steal" tasks. This thesis extends his results in several ways:

 Experience on large machines | Experiments on a 90 processor GP1000 with

a wide range of benchmarks provide concrete evidence that LTC scales well to large
machines and that good speedup is possible for realistic programs.

 Support of a rich semantics | The semantics of the Multilisp language does

not have to be impoverished to attain good performance. In fact, the laziness
of LTC can be exploited to implement several programming features at low cost.
These include

{ The Katz-Weise continuation semantics with legitimacy; which provides an

elegant semantics for rst-class continuations.
{ Dynamic scoping.
{ Fairness.

 Better implementation of the SM protocol | A slightly faster implemen-

tation of the SM protocol was developed. It requires fewer instructions, fewer
memory references, and is simpler to prove correct.

 The message-passing (MP) protocol | The main problem with the SM pro-

tocol is that all processors must have access to the runtime stack. On machines
lacking coherent-caches, such as the TC2000, the stack can only be cached in writethrough mode instead of the more ecient copy-back mode. This a ects the speed
of computation in general (parallel and sequential parts of the programs su er).
A study of several benchmarks in Chapter 3 shows that the stack is one of the
most frequently accessed data structures and that the di erence in caching policy
can account for an important di erence in performance (as high as a factor of two
on the TC2000).
In the MP protocol the stack is a private data structure that can be cached
optimally. To obtain a task to run, a thief processor sends a work request message
to the \victim" processor. When the request is serviced, the victim accesses its
own stack to remove a lazy task and packages it in a heavyweight task that is
sent back to the thief. This approach would appear to depend on a low latency
interrupt mechanism, such as polling, but in fact the experiments indicate that
performance is close to optimal when the interrupt latency is comparable to the
time required to perform the task steal.

6.1. FUTURE WORK

167

6.1 Future Work
The results of this thesis suggest that task partitioning can be done eciently on machines that lack an ecient shared memory. Coherent-caches are not really required,
as shown by the MP protocol implementation of LTC. There is thus hope that, at least
for some problems, Multilisp can run eciently on distributed-memory machines. A
machine like the Thinking Machine's CM-5, which lacks a shared memory but provides
a fast message-passing system, would be an ideal candidate.
One of the shortcomings of LTC as implemented here is that it does not address the
data partitioning problem. The scheduling algorithm makes no attempt to run a task
on (or close to) the processor that contains the data it accesses. As shown in Chapter 3,
a substantial performance loss is attributable to the remote memory accesses to user
data (up to a factor of 5 on the GP1000 and a factor of 3 on the TC2000). Coherentcaches may help reduce this problem on shared-memory machines but the penalty on
distributed-memory machines will be much higher.
Another problem is the overhead of touching. Contrary to Multilisp's original speci cation, this work has assumed that touches are inserted explicitly by the user. This
is hard to do for programs with complex data dependencies. It would be more convenient for the user if touches were inserted automatically by the compiler. Adding
a touch on each strict operation is a poor solution because it causes a high overhead. On the GP1000 the overhead is roughly a factor of 2 on typical programs
(but a lower overhead may be possible on modern processors which are optimized
for register operations). A better solution would be for the compiler to do a dataow analysis of the program to identify all the strict operations that might be passed
a placeholder. Control- ow and data- ow analysis techniques such as [Shivers, 1988,
Shivers, 1991] would be a good starting point.

168

CHAPTER 6. CONCLUSION

Appendix A

Source Code for Parallel
Benchmarks
This appendix contains the source code for the parallel benchmark programs used in
chapters 2, chapters 3, chapters 4, and 5. A general description of these programs is
given in section 2.9. Half of the programs were originally written in Mul-T by Eric Mohr
as part of his PhD thesis work [Mohr, 1991]. These programs were translated to Scheme
with super cial changes to suit Gambit's particular features. These changes include

 Macro de nitions (Gambit uses the non-standard construct ##define-macro).
 The de nition of record structures (Gambit does not have a prede ned construct
for de ning structures; plain vectors were used instead).

 The performance of abisort, allpairs and mst was improved by partial evaluating the programs by hand. The algorithms are the same but some of the procedure
abstractions were removed by replacing procedure de nitions by macro de nitions.

 The programs abisort, rantree and tridiag originally had a few uses of a non-

standard construct to return multiple values. Since Gambit does not have such
a feature, the multiple returns were reformulated in standard Scheme. This only
a ects rantree's performance because the two other programs used multiple value
returns exclusively in the initialization phase (which is not measured).



Tridiag, which solves a set of equations, uses only half as many equations (i.e. 32767).

This data set just barely ts in the memory available on a single processor node
of the GP1000. About 2 Mbytes per processor (out of a total of 4 Mbytes) are
169

APPENDIX A. SOURCE CODE FOR PARALLEL BENCHMARKS

170

available for the heap after Gambit has started. This makes it possible to evaluate
the program in a uniprocessor con guration (which is useful to generate speedup
curves). All other programs were run with the same data set size in order to make
direct comparisons easier.
The new programs fall into two main classes. The programs mm (matrix multiplication), scan (parallel pre x operation on a vector), and sum (parallel reduction operation
on a vector) are based on divide and conquer algorithms. The program poly (polynomial multiplication) implements a form of pipeline parallelism and qsort (quicksort) is
a combination of pipeline and divide and conquer parallelism.
The programs were modi ed in certain places to address shared-memory problems. To lessen contention to shared data in vectors, the non-standard procedures
make-cvector and cvector-ref were used instead of the corresponding standard vector
operations. A cvector is a vector with immutable elements (i.e. a \constant vector").
When a cvector is created, it is copied to the local memory of each processor. Access
to a cvector is thus both contention free and fast (as fast as a local memory reference).
However, access to the elements of a cvector may still exhibit some contention and
remote memory reference latency if the elements are memory allocated structures (as is
the case in tridiag, the only program that uses cvectors).
When the shared data was in mutable vectors (i.e. the programs allpairs, mm,
and sum), the non-standard procedures make-dvector, dvector-ref and
dvector-set! were used instead of the corresponding standard vector operations. A
dvector is a vector whose entries are evenly allocated across the machine (i.e. a \distributed vector"). If entry i is in the local memory of processor j , then entry i + 1 is
on processor j + 1 (modulo the number of processors). On an n processor machine, a
reference to the vector will correspond to a local memory reference with probability n1
and to a remote reference with probability n n 1 . This means that the average cost of
an access to a dvector increases with the number of processors, quickly approaching
the cost of a remote reference. Dvectors have good contention characteristics because
during a given cycle there can be as many accesses to dvectors as there are processors.
The average number of contention free accesses will be lower, but this is more of an
academic question since in general, processors do not all access memory at the same
moment.
mst, scan

Record structures were similarly distributed where possible (i.e. the programs abisort,
mst and tridiag). This was done with a call to the procedure make-vector-chain
which builds a chain of xed size vectors that are evenly distributed across the machine.

171
The creation of all these special data structures happens once and for all in the
initialization phase of the programs. Thus, it doesn't contribute to the measurements.
Memory allocation in the main part of the program only occurs for qsort and poly and
is done with the standard cons procedure. This means that space is allocated in the
local memory of the processor doing the allocation.
The programs were all compiled with special declarations meant to improve performance. All references to prede ned variables, such as cons and car, were assumed to
be to the corresponding primitive procedure. This essentially means that inline code
was generated for calls to simple prede ned procedures. All arithmetic operations were
assumed to be on small integers ( xnums), except for the program poly which uses
generic arithmetic.
In the code that follows, FUTURE and TOUCH have been underlined to make them
stand out. The last line of each program is a call to the macro benchmark, which starts
the run. The subforms passed to benchmark are in order: the name of the program, the
expression used to initialize the input data and the expression that starts the part of
the program being measured. A brief description is included with each program.

172

A.1

APPENDIX A. SOURCE CODE FOR PARALLEL BENCHMARKS
abisort

This program sorts 16384 integers using the adaptive bitonic sort algorithm described
in [Bilardi and Nicolau, 1989].
(##define-macro
(##define-macro
(##define-macro
(##define-macro
(##define-macro
(##define-macro
(##define-macro

(make-node) `(make-vector 3 #f))
(node-left x) `(vector-ref ,x 0))
(node-value x) `(vector-ref ,x 1))
(node-right x) `(vector-ref ,x 2))
(node-left-set! x v) `(vector-set! ,x 0 ,v))
(node-value-set! x v) `(vector-set! ,x 1 ,v))
(node-right-set! x v) `(vector-set! ,x 2 ,v))

(##define-macro (swap-left l r)
`(let ((temp (node-left ,l)))
(node-left-set! ,l (node-left ,r))
(node-left-set! ,r temp)))
(##define-macro (swap-right l r)
`(let ((temp (node-right ,l)))
(node-right-set! ,l (node-right ,r))
(node-right-set! ,r temp)))
(##define-macro (fixup-tree-1 root up?)
`(let loop ((pl (node-left ,root))
(pr (node-right ,root)))
(if pl
(compare-and-swap pl pr ,up?
;swap right subtrees, search path goes left
(begin (swap-right pl pr)
(loop (node-left pl) (node-left pr)))
;search path goes right
(loop (node-right pl) (node-right pr))))))
(##define-macro (fixup-tree-2 root up?)
`(let loop ((pl (node-left ,root))
(pr (node-right ,root)))
(if pl
(compare-and-swap pl pr ,up?
;swap left subtrees, search path goes right
(begin (swap-left pl pr)
(loop (node-right pl) (node-right pr)))
;search path goes left
(loop (node-left pl) (node-left pr))))))

A.1.

ABISORT

(##define-macro (compare-and-swap node1 node2 up? true false)
`(let ((v1 (node-value ,node1))
(v2 (node-value ,node2)))
(cond ((,(if up? `>= `<) v1 v2)
(node-value-set! ,node1 v2)
(node-value-set! ,node2 v1)
,true)
(else ,false))))
(##define-macro (pbimerge root spare up?)
`(let loop ((root ,root) (spare ,spare))
(compare-and-swap root spare ,up?
(fixup-tree-1 root ,up?)
(fixup-tree-2 root ,up?))
(cond ((node-left root)
(let ((left-half (FUTURE (loop (node-left root) root))))
(loop (node-right root) spare)
(TOUCH left-half))))))
(define (pbisort-up root spare)
(let ((left (node-left root)))
(if left
(let ((left-half (FUTURE (pbisort-up left root))))
(pbisort-down (node-right root) spare)
(TOUCH left-half)
(pbimerge root spare #t)))
(compare-and-swap root spare #t #t #f)))
(define (pbisort-down root spare)
(let ((left (node-left root)))
(if left
(let ((left-half (FUTURE (pbisort-down left root))))
(pbisort-up (node-right root) spare)
(TOUCH left-half)
(pbimerge root spare #f)))
(compare-and-swap root spare #f #t #f)))
(define (new-node l r v)
(let ((node (make-node*)))
(node-left-set! node l)
(node-right-set! node r)
(node-value-set! node v)
node))

173

174

APPENDIX A. SOURCE CODE FOR PARALLEL BENCHMARKS
(define node-chain #f)
(define (init-node-chain n) ; make a chain of 3 element vects
(set! node-chain (make-vector-chain n 3)))
(define (make-node*)
(let ((node node-chain))
(set! node-chain (vector-ref node 0))
node))
(define (make-inorder-tree depth)
(let loop ((i 0)
(depth depth))
(if (= depth 1)
(cons (new-node #f #f i) i)
(let* ((x (loop i (- depth 1)))
(l-tree (car x))
(l-imax (cdr x)))
(let* ((y (loop (+ l-imax 2) (- depth 1)))
(r-tree (car y))
(r-imax (cdr y)))
(cons (new-node l-tree r-tree (+ l-imax 1)) r-imax))))))
(define r #f)
(define s #f)
(define k 14)
(define (init)
(init-node-chain (expt 2 k))
(let* ((x (make-inorder-tree k))
(root (car x))
(imax (cdr x)))
(let ((spare (new-node #f #f (+ imax 1))))
(set! r root)
(set! s spare))))
(benchmark ABISORT (init) (pbisort-up r s))

A.2.

A.2

ALLPAIRS

175

allpairs

This program computes the shortest paths between all pairs of 117 nodes using a parallel
version of Floyd's algorithm.
(##define-macro (do-all var lo hi . body)
`(let loop ((,var ,lo) (hi ,hi))
(if (= ,var hi)
(let () ,@body)
(let* ((mid (quotient (+ ,var hi) 2))
(lo-half (FUTURE (loop ,var mid))))
(loop (+ mid 1) hi)
(TOUCH lo-half)))))
(define (apsp/par a n)
(let ((n-1 (- n 1)))
(do ((k 0 (+ k 1)))
((= k n))
(let ((k*n (* k n)))
(do-all i 0 n-1
(let* ((i*n (* i n))
(i*n+k (+ i*n k)))
(do ((j 0 (+ j 1)))
((= j n))
(let* ((kpath (+ (dvector-ref a i*n+k)
(dvector-ref a (+ k*n j))))
(i*n+j (+ i*n j)))
(if (< kpath (dvector-ref a i*n+j))
(dvector-set! a i*n+j kpath))))))))))
(define (make-linear-adjacency-matrix n)
(let ((a (make-dvector (* n n) (quotient most-positive-fixnum 2))))
(dvector-set! a 0 0)
(do ((i 1 (+ i 1)))
((= i n))
(dvector-set! a (+ (* i n) i) 0)
(dvector-set! a (+ (* (- i 1) n) i) 1)
(dvector-set! a (+ (* i n) (- i 1)) 1))
a))
(define a #f)
(define n 117)
(define (init)
(set! a (make-linear-adjacency-matrix n)))
(benchmark ALLPAIRS (init) (apsp/par a n))

APPENDIX A. SOURCE CODE FOR PARALLEL BENCHMARKS

176

A.3

fib

This program computes F25, the 25th bonacci number, using the \standard" doubly
recursive algorithm.
(define (pfib n)
(let fib ((n n))
(if (< n 2)
n
(let* ((f1 (FUTURE (fib (- n 1))))
(f2 (fib (- n 2))))
(+ (TOUCH f1) f2)))))
(benchmark FIB #f (pfib 25))

A.4.

177

MM

A.4

mm

This program multiplies two matrices of integers (50 by 50).
(define (mm m1 m2 m3) ; m1 * m2 -> m3
(define (compute-entry row col) ; loop to compute inner product
(let loop ((i (+ row (- n 1)))
(j (+ (* n (- n 1)) col))
(sum 0))
(if (>= j 0)
(loop (- i 1)
(- j n)
(+ sum (* (dvector-ref m1 i) (dvector-ref m2 j))))
(dvector-set! m3 (+ (+ i 1) col) sum))))
(define (compute-cols-between row i j) ; DAC over columns
(if (= i j)
(compute-entry row i)
(let ((mid (quotient (+ i j) 2)))
(let* ((half1 (FUTURE (compute-cols-between row i mid)))
(half2 (compute-cols-between row (+ mid 1) j)))
(TOUCH half1)))))
(define (compute-rows-between i j) ; DAC over rows
(if (= i j)
(compute-cols-between (* i n) 0 (- n 1))
(let ((mid (quotient (+ i j) 2)))
(let* ((half1 (FUTURE (compute-rows-between i mid)))
(half2 (compute-rows-between (+ mid 1) j)))
(TOUCH half1)))))
(compute-rows-between 0 (- n 1)))
(define m1 #f)
(define m2 #f)
(define m3 #f)
(define n 50)
(define
(set!
(set!
(set!

(init)
m1 (make-dvector (* n n) 2))
m2 (make-dvector (* n n) 2))
m3 (make-dvector (* n n) #f)))

(benchmark MM (init) (mm m1 m2 m3))

APPENDIX A. SOURCE CODE FOR PARALLEL BENCHMARKS

178

A.5

mst

This program computes the minimum spanning tree of a 1000 node graph. A parallel
version of Prim's algorithm is used.
(##define-macro
(##define-macro
(##define-macro
(##define-macro
(##define-macro
(##define-macro
(##define-macro
(##define-macro
(##define-macro

(make-city) `(make-vector 4 #f))
(city-x
x) `(vector-ref ,x 0))
(city-y
x) `(vector-ref ,x 1))
(city-closest x) `(vector-ref ,x 2))
(city-distance x) `(vector-ref ,x 3))
(city-x-set!
x v) `(vector-set!
(city-y-set!
x v) `(vector-set!
(city-closest-set! x v) `(vector-set!
(city-distance-set! x v) `(vector-set!

,x
,x
,x
,x

0
1
2
3

,v))
,v))
,v))
,v))

(define (new-city x y closest distance)
(let ((city (make-city*)))
(city-x-set! city x)
(city-y-set! city y)
(city-closest-set! city closest)
(city-distance-set! city distance)
city))
(define (prim cities ncities find-closest-city)
(let* ((max-i (- ncities 1))
(target0 (dvector-ref cities max-i)))
(city-closest-set! target0 target0) ;; makes drawing easier
(let loop ((max-i (- max-i 1))
(target target0))
(if (= max-i 0)
(add-last-city (dvector-ref cities 0) target)
(let* ((closest-i (find-closest-city cities max-i target))
(newcity (dvector-ref cities closest-i)))
(dvector-set! cities closest-i (dvector-ref cities max-i))
(dvector-set! cities max-i newcity)
(loop (- max-i 1) newcity))))))
(define (add-last-city city newcity)
(let* ((newdist (distance city newcity))
(olddist (city-distance city)))
(cond ((< newdist olddist)
(city-distance-set! city newdist)
(city-closest-set! city newcity)))))

A.5.

MST

(define (distance c1 c2)
(let ((dx (- (city-x c1) (city-x c2)))
(dy (- (city-y c1) (city-y c2))))
(+ (* dx dx) (* dy dy))))
(##define-macro (combine-interval/ptree lo hi f combine)
`(let ((lo ,lo) (hi ,hi))
(let* ((n (+ (- hi lo) 1))
(adjust (- lo 1))
(first-leaf (quotient (+ n 1) 2))
(treeval
(let loop ((i 1))
(cond ((< i first-leaf)
(let* ((left (FUTURE (loop (* i 2))))
(right (,combine (loop (+ (* i 2) 1))
(,f (+ i adjust)))))
(,combine right (TOUCH left))))
(else
(,f (+ i adjust)))))))
(if (even? n)
(,combine treeval (,f hi))
treeval))))
(define (find-closest-city/ptree cities max-i newcity)
(combine-interval/ptree 0 max-i
(lambda (i) (update-city i cities newcity))
(lambda (i1 i2)
(if (< (city-distance (dvector-ref cities i1))
(city-distance (dvector-ref cities i2)))
i1
i2))))
(define (update-city i cities newcity)
(let* ((city (dvector-ref cities i))
(newdist (distance city newcity))
(olddist (city-distance city)))
(cond ((< newdist olddist)
(city-distance-set! city newdist)
(city-closest-set! city newcity)))
i))

179

180

APPENDIX A. SOURCE CODE FOR PARALLEL BENCHMARKS
(define city-chain #f)
(define (init-city-chain n) ; make a chain of 4 element vects
(set! city-chain (make-vector-chain n 4)))
(define (make-city*)
(let ((city city-chain))
(set! city-chain (vector-ref city 0))
city))
(define random (make-random 3434534))
(define random-range 1000)
(define (make-random-vector-of-cities n)
(let ((cities (make-dvector n)))
(do ((i 0 (+ i 1)))
((>= i n) cities)
(dvector-set! cities i
(new-city (modulo (random) random-range)
(modulo (random) random-range)
'()
most-positive-fixnum
)))
cities))
(define c #f)
(define n 1000)
(define (init)
(init-city-chain n)
(set! c (make-random-vector-of-cities n)))
(benchmark MST (init) (prim c n find-closest-city/ptree))

A.6.

181

POLY

A.6

poly

This program computes the square of a 200 term polynomial of x (with integer coecients) and evaluates the resulting polynomial for a certain value of x.
(##declare (generic)) ; use generic arithmetic
(define (poly* p1 p2) ; compute p1*p2
(if (or (null? p1) (null? p2))
'()
(poly+*k (cons 0 (poly* p1 (cdr p2)))
p1
(car p2))))
(define (poly+*k p1 p2 k)
(if (null? p2)
p1
(if (null? p1)
(let ((rest (FUTURE
(cons (* (car p2)
(let ((rest (FUTURE
(cons (+ (car p1)

; compute p1+p2*k

(poly+*k '() (cdr p2) k))))
k) rest))
(poly+*k (TOUCH (cdr p1)) (cdr p2) k))))
(* (car p2) k)) rest)))))

(define (poly-eval p x) ; compute value of p at x
(let loop ((p p) (y 1) (sum 0))
(if (pair? p)
(loop (TOUCH (cdr p)) (* x y) (+ sum (* (car p) y)))
sum)))
(define
'(1 1
1 1
1 1
1 1
1 1
1 1
1 1
1 1

p
1
1
1
1
1
1
1
1

;
1
1
1
1
1
1
1
1

200
1 1
1 1
1 1
1 1
1 1
1 1
1 1
1 1

terms
1 1 1
1 1 1
1 1 1
1 1 1
1 1 1
1 1 1
1 1 1
1 1 1

1
1
1
1
1
1
1
1

1
1
1
1
1
1
1
1

1
1
1
1
1
1
1
1

1
1
1
1
1
1
1
1

1
1
1
1
1
1
1
1

1
1
1
1
1
1
1
1

1
1
1
1
1
1
1
1

1
1
1
1
1
1
1
1

1
1
1
1
1
1
1
1

1
1
1
1
1
1
1
1

1
1
1
1
1
1
1
1

1
1
1
1
1
1
1
1

(benchmark POLY #f (poly-eval (poly* p p) 1))

1
1
1
1
1
1
1
1

1
1
1
1
1
1
1
1

1
1
1
1
1
1
1
1

1
1
1
1
1
1
1
1))

182

A.7

APPENDIX A. SOURCE CODE FOR PARALLEL BENCHMARKS
qsort

This program sorts a list of 1000 integers using a parallel version of the Quicksort
algorithm.
(define (qsort lst)
(##define-macro (filter keep? lst)
`(let loop ((lst ,lst))
(let ((lst (TOUCH lst)))
(if (pair? lst)
(let ((head (car lst)))
(if (,keep? head)
(cons head (FUTURE (loop (cdr lst))))
(loop (cdr lst))))
'()))))
(define (qs lst tail)
(if (pair? lst)
(let ((pivot (car lst))
(other (cdr lst)))
(let ((sorted-larger
(FUTURE (qs (filter (lambda (x) (not (< x pivot))) other)
tail))))
(qs (filter (lambda (x) (< x pivot)) other)
(cons pivot sorted-larger))))
tail))
(qs lst '()))
(define (walk lst)
(let loop ((lst lst))
(let ((lst (TOUCH lst)))
(if (pair? lst) (loop (cdr lst)))))
lst)
(define l ; randomized
`(34 313 852 803 941
928 472 247 808 88
...
361 762 53 664 892
791 68 293 827 529

list of numbers 0 to 999
931 63 581 309 569 62 561 602 572 353 253 815 869
698 315 152 58 465 881 888 652 312 47 69 279 418
768 778 685 190 52 665 289 558 188 455 408 381 805
301 825 357 202 365 955 746 449 927 823))

(benchmark QSORT #f (walk (qsort l)))

A.8.

A.8

QUEENS

183

queens

This program computes the number of solutions to the n-queens problem, with n = 10.
(define (queens n)
(let try ((rows-left n)
(free-diag1 -1)
;all bits set
(free-diag2 -1)
(free-cols (- (ash 1 n) 1))) ;bits 0 to n-1 set
(let ((free (logand free-cols (logand free-diag1 free-diag2))))
(let loop ((col 1))
(cond ((> col free)
0)
((= (logand col free) 0)
(loop (* col 2)))
((= rows-left 1)
(+ 1 (loop (* col 2))))
(else
(let* ((sub-solns
(FUTURE
(try (- rows-left 1)
(+ (ash (- free-diag1 col) 1) 1)
(ash (- free-diag2 col) -1)
(- free-cols col))))
(other-solns (loop (* col 2))))
(+ (TOUCH sub-solns) other-solns))))))))
(benchmark QUEENS #f (queens 10))

184

A.9

APPENDIX A. SOURCE CODE FOR PARALLEL BENCHMARKS
rantree

This program models the traversal of a random binary tree with on the order of 32768
nodes. The branching factor is 50%.
(define (lehmer-left seed) (+ 1 (* seed #xface475)))
(define (lehmer-right seed) (+ 1 (* seed #x283feed)))
(define (pseudo-random-tree n)
(let loop ((n n) (seed 1))
(cond ((<= n 2)
n)
((> seed 0)
(let* ((ln (+ 1 (modulo seed (- n 2))))
(rn (- (- n 1) ln))
(left (FUTURE (loop ln (lehmer-left seed))))
(right (loop rn (lehmer-right seed))))
(+ (TOUCH left) (+ right 1))))
(else
(+ 1 (loop (- n 1) (lehmer-left seed)))))))
(benchmark RANTREE #f (pseudo-random-tree 32768))

A.10.

A.10

SCAN

185

scan

This program computes the parallel pre x sum of a vector of 32768 integers. The vector
is modi ed in place. A given element is replaced by the sum of itself and all preceding
elements.
(##define-macro (scan f c v)
`(let ((c ,c) (v ,v))
(let ((n (dvector-length v)))
(define (pass1 i j)
(if (< i j)
(let* ((m
(quotient (+ i j) 2))
(left
(FUTURE (pass1 i m)))
(right (pass1 (+ m 1) j))
(result (,f (TOUCH left) right)))
(dvector-set! v j result)
result)
(dvector-ref v j)))
(define (pass2 i j c)
(if (< i j)
(let* ((m
(quotient (+ i j) 2))
(left (FUTURE (pass2 i m c)))
(cc
(,f c (dvector-ref v m)))
(right (pass2 (+ m 1) j cc)))
(dvector-set! v m cc)
(TOUCH left))))
(if (> n 0)
(let ((j (- n 1)))
(pass1 0 j)
(pass2 0 j c)
(dvector-set! v j (,f c (dvector-ref v j))))))))
(define (scan+ c v) (scan + c v))
(define v #f)
(define n 32768)
(define (init)
(set! v (make-dvector n 0)))
(benchmark SCAN (init) (scan+ 0 v))

APPENDIX A. SOURCE CODE FOR PARALLEL BENCHMARKS

186

A.11

sum

This program computes the sum of a vector of 32768 integers.
(define (sum vect l h) ; sum
(if (= l h)
(dvector-ref vect l)
(let* ((mid (quotient (+
(lo (FUTURE (sum
(hi (sum vect (+
(+ (TOUCH lo) hi))))

vector from 'l' to 'h'

l h) 2))
vect l mid)))
mid 1) h)))

(define v #f)
(define n 32768)
(define (init)
(set! v (make-dvector n 1)))
(benchmark SUM (init) (sum v 0 (- n 1)))

A.12.

A.12

187

TRIDIAG

tridiag

This program solves a tridiagonal system of 32767 equations.
(##define-macro
(##define-macro
(##define-macro
(##define-macro
(##define-macro
(##define-macro
(##define-macro
(##define-macro
(##define-macro
(##define-macro

(a obj)
(b obj)
(c obj)
(y obj)
(x obj)
(a-set!
(b-set!
(c-set!
(y-set!
(x-set!

`(vector-ref ,obj 0))
`(vector-ref ,obj 1))
`(vector-ref ,obj 2))
`(vector-ref ,obj 3))
`(vector-ref ,obj 4))
obj v) `(vector-set! ,obj
obj v) `(vector-set! ,obj
obj v) `(vector-set! ,obj
obj v) `(vector-set! ,obj
obj v) `(vector-set! ,obj

0
1
2
3
4

,v))
,v))
,v))
,v))
,v))

(define (reduce/par equ imid)
(define (reduce-equation i delta)
(let* ((equ-ileft (cvector-ref equ (- i delta)))
(equ-iright (cvector-ref equ (+ i delta)))
(equ-i (cvector-ref equ i))
(e (- (quotient (a equ-i) (b equ-ileft))))
(f (- (quotient (c equ-i) (b equ-iright)))))
(a-set! equ-i (* e (a equ-ileft)))
(c-set! equ-i (* f (c equ-iright)))
(b-set! equ-i (+ (b equ-i)
(+ (* e (c equ-ileft))
(* f (a equ-iright)))))
(y-set! equ-i (+ (y equ-i)
(+ (* e (y equ-ileft))
(* f (y equ-iright)))))))
(let do-branch ((i imid)
(delta (quotient imid 2)))
(if (= delta 1)
(reduce-equation i delta)
(let* ((ileft (- i delta))
(iright (+ i delta))
(l (FUTURE (do-branch ileft (quotient delta 2)))))
(do-branch iright (quotient delta 2))
(TOUCH l)
(do ((d 1 (* d 2)))
((> d delta))
(reduce-equation i d))))))

188

APPENDIX A. SOURCE CODE FOR PARALLEL BENCHMARKS
(define (backsolve/par equ imid)
(let loop ((i imid) (delta imid))
(let ((equ-i (cvector-ref equ i)))
(x-set! equ-i (quotient (- (y equ-i)
(+ (* (a equ-i)
(x (cvector-ref equ (- i delta))))
(* (c equ-i)
(x (cvector-ref equ (+ i delta))))))
(b equ-i)))
(if (> delta 1)
(let* ((new-delta (quotient delta 2))
(l (FUTURE (loop (- i new-delta) new-delta))))
(loop (+ i new-delta) new-delta)
(TOUCH l))))))
(define abcyx-chain #f)
(define (init-abcyx-chain n) ; make a chain of 5 element vects
(set! abcyx-chain (make-vector-chain n 5)))
(define (make-abcyx*)
(let ((node abcyx-chain))
(set! abcyx-chain (vector-ref node 0))
node))

A.12.

TRIDIAG

(define n #f)
(define imid #f)
(define equ #f)
(define k 15)
(define (init1)
(let ((n+1 (expt 2 k)))
(set! n (- n+1 1))
(set! imid (quotient n+1 2))
(init-abcyx-chain (+ n 2))
(set! equ (make-cvector (+ n 2) make-abcyx*))))
(define (init2)
(do ((i (+ n 1) (- i 1)))
((< i 0))
(let ((equ-i (cvector-ref equ i)))
(a-set! equ-i 1)
(b-set! equ-i 1)
(c-set! equ-i 1)
(y-set! equ-i 3)
(x-set! equ-i 0)))
(let ((equ-1 (cvector-ref equ 1)))
(a-set! equ-1 0)
(b-set! equ-1 1)
(c-set! equ-1 1)
(y-set! equ-1 2))
(let ((equ-n (cvector-ref equ n)))
(a-set! equ-n 1)
(b-set! equ-n 1)
(c-set! equ-n 0)
(y-set! equ-n 2)))
(define (run)
(reduce/par equ imid)
(backsolve/par equ imid))
(benchmark TRIDIAG (begin (init1) (init2)) (run))

189

190

APPENDIX A. SOURCE CODE FOR PARALLEL BENCHMARKS

Appendix B

Execution Pro les for Parallel
Benchmarks
This Appendix contains \execution pro les" for each of the parallel benchmarks of
Appendix A. An execution pro le is a plot representing the activity of the processors
as a function of time. Pro les are useful to visualize the behavior of parallel programs.
They are also an invaluable tool to detect performance related problems with algorithms
and the language implementation.
To generate the pro les, the programs were compiled with the default polling settings
with an intermittency factor of 10. The message-passing protocol supporting the KatzWeise continuation semantics and legitimacy was used but fairness was disabled. The
programs were run on the GP1000 with 64 processors. Processors can be in one of six
distinctive states in the message-passing protocol
1. Interrupt | The processor is servicing a steal request. This state accounts for
heapifying the parent continuation, creating the task, the result and legitimacy
placeholders, and responding to the thief.
2. Working | The processor is running the main body of the program (i.e. \user
code"). This accounts not only for all the work that is strictly required by a sequential version of a program, but also includes the following extra work needed
to support parallelism: pushing and popping lazy tasks, checking for placeholders (as part of TOUCH), waiting for references to remote memory and restoring
continuations1 .

1 Measuring

all these cases independently would be useful; unfortunately, it is impossible to do in an

191

192

APPENDIX B. EXECUTION PROFILES FOR PARALLEL BENCHMARKS

3. Idle | The processor is looking for work but hasn't yet found an available task
in a work queue or a victim processor to interrupt.
4. Touching an undetermined placeholder | An undetermined placeholder was
touched. This state indicates the suspension of a task.
5. Determine | A placeholder is being determined prior to the termination of a
task.
6. Stealing | The processor has found a victim processor, sent a steal request and
is waiting for a response. The cost of restarting the task is also included except
for restoring the task's continuation.
Only certain transitions between these states are possible, as de ned by the following
diagram
1 stealing PP


q
P
idle P
iP


P
)

}

determine
6Z
=
? Z touch undet 

interrupt

working

6
?

interrupt

Note that it is possible to go directly from the idle state to the working state. This
happens when a task is taken from a processor's HTQ. Also, note that interrupts can
only be serviced in the idle state and in the working state.
For the pro les to be signi cant, it is important to minimize the impact of monitoring
on the behavior of the system. The pro les were obtained by having each processor log
an event in a table in local memory whenever there was a state transition. The extra
code needed to do this is con ned to the runtime system, user code is not changed in
any way. Each event indicates the state being entered and the current time taken from
a real time clock with a 62.5 secs resolution. These tables were then dumped to disk
for later processing by the analysis program generating the pro les. The cost of logging
an event in this way is about 6 secs. This is relatively small compared to the typical
duration of states (usually much more that 100 secs).
A pro le is divided into three sections. The top part displays the instantaneous
activity of the machine. That is, what proportion of all the processors are in each
state as a function of time (time is always expressed in milliseconds). Below this is the
unintrusive way. This is why all these di erent cases were grouped together in one state. Time spent
in the \working" state can only serve as an approximation of the work required by a sequential version
of the program.

193
global activity chart. It indicates what percentage of the run time is spent in each of
the states (in other words it gives the area covered by each state in the instantaneous
activity chart). The bottom section consists of state duration histograms for every
state. Each histogram indicates the distribution of state durations and also the average
duration2 . Note that each state is represented by a di erent shade of gray. To help
distinguish the shades, the states are always in the same order; from bottom to top in
the instantaneous activity chart and from left to right in the global activity chart.
For each benchmark two pro les are given. The rst is for the complete run and the
second is a close-up of the beginning of the run.

2 The

states.

time spent servicing interrupts is ignored to compute the duration of the working and idle

APPENDIX B. EXECUTION PROFILES FOR PARALLEL BENCHMARKS

194

B.1

abisort

File: "abisort-mp.elog"

Processors: 64
100

80

60
%
40

20

0
100

0

0

200

10

300

20

77%

30

40

48%

1

0

500

50

56%

100 200 300

0

.168

400

interrupt

10

20

700

70

.5

.0

idle

msec

90

100

%

9%

1

0

.315

800

80

60%

.958
working

File: "abisort-mp.elog"

60

50%

0

12.603

600

touch_undet

1

0

.159

2

3

msec

.523
determine

stealing

Processors: 64
100

80

60
%
40

20

0

1

2

0

10

20

32%

3

4

30

40

6%

.0

.5
.119
interrupt

5

6

7

8

9

50

60

70

80

90

5%

0

5

33%

0

5

4.402
working

4.500
idle

0
10 msec

100

%

10%

.0

.5
.260
determine

0

1
.640
stealing

2

msec

B.2.

195

ALLPAIRS

B.2

allpairs

File: "allpairs-mp.elog"

Processors: 64
100

80

60
%
40

20

0
500

0

0

10

1000

20

60%

30

19%

0

1

2

3

1500

40

50

17%

0

.206

10

20

0

working

File: "allpairs-mp.elog"

10

20

70

1

2

3

4

idle

90

msec

100

%

16%

0

.369

3000

80

62%

0

5.746

2500

60

37%

7.311
interrupt

2000

1

2

3

touch_undet

5

0

.168

msec

.518
determine

stealing

Processors: 64
100

80

60
%
40

20

5

0

0

10

10

40%

15

20

.5

.165
interrupt

25

30

10%

.0

20

30

40

50

15%

0

10
7.619
working

35

40

60

70

34%

0

10
6.481
idle

20

45

50

80

58%

0

1
.391
touch_undet

0
60 msec

55

90

100

%

13%

0

1

2

.190

0

1

2

3

4

.530
determine

stealing

msec

APPENDIX B. EXECUTION PROFILES FOR PARALLEL BENCHMARKS

196

B.3

fib

File: "fib-mp.elog"

Processors: 64
100

80

60
%
40

20

1

0

2

0

3

4

5

6

10

7

8

20

63%

9

1

0

11

30

30%

13

40

10

interrupt

14

15

16

50

17

18

19

60

5

21

22

23

.5

.0

idle

25

26

27

90

0
28 msec

100

%

50%

1

0

.251

24

80

55%

.912

working

20

70

59%

0

2.053

File: "fib-mp.elog"

12

56%

0

.147

10

touch_undet

1

0

.138

msec

.205
determine

stealing

Processors: 64
100

80

60
%
40

20

.5

.0

0

10

1.0

20

1.5

30

35%

40

50

11%

.0

.5
.107
interrupt

2.0

60

70

11%

0

1

2

80

90

100

%

68%

0

1

1.066
working

0
3.0 msec

2.5

2
1.766

idle

.0

.5
.178
stealing

msec

B.4.

197

MM

B.4

mm

File: "mm-mp.elog"

Processors: 64
100

80

60
%
40

20

0
0

10

0

10

50%

20

30

20

30

1

0

50

0

interrupt

60

50

64%

70

60

10

1

0

90

msec

100

%

14%

.5

.0

.344

idle

90

80

66%

1.776

working

80

70

46%

0

12.403

File: "mm-mp.elog"

50

40

23%

.173

40

touch_undet

1

0

.153

msec

.414

determine

stealing

Processors: 64
100

80

60
%
40

20

.0

.5

1.0

0

10

20

29%

1.5

2.0

2.5

3.0

3.5

30

40

50

60

70

7%

.0

.5
.127
interrupt

7%

0

1

2

3

4

5

6

4.5

80

90

0
5.0 msec

100

%

15%

0

1

1.546
working

4.0

2

3
3.011

idle

4

.0

.5
.445
stealing

1.0

msec

APPENDIX B. EXECUTION PROFILES FOR PARALLEL BENCHMARKS

198

B.5

mst

File: "mst-mp.elog"

Processors: 64
100

80

60
%
40

20

0
1000

0

0

2000

3000

10

4000

5000

20

80%

30

8%

5

0

40

10

interrupt

8000

9000

50

10000

100

200

working

12000

70

13000

5

.319
idle

14000

80

75%

0

.354

11000

60

64%

0

.913

File: "mst-mp.elog"

7000

54%

0

.204

6000

15000

90

msec

100

%

20%

10

0
.165

touch_undet

10

0

msec

.618
determine

stealing

Processors: 64
100

80

60
%
40

20

0
5

0

0

10

10

52%

15

20

30

7%

0

1

.173

40

0

10

working

25

50

23%

3.356
interrupt

20

30

60

70

43%

0

10

3.498
idle

35

80

60%

.0

.5
.335
touch_undet

40

90

msec

100

%

16%

.0

.5
.154
determine

0

msec

.831
stealing

B.6.

199

POLY

B.6

poly

File: "poly-mp.elog"

Processors: 64
100

80

60
%
40

20

0
100

0

200

0

300

400

10

500

20

62%

600

30

95%

0

1

2

3

4

800

900

40

100

.559
interrupt

1100

1200

0

100

200

1400

70

1500

5

idle

1700

1800

90

msec

100

%

60%

10

0

.543

1600

80

91%

0

2.518

1300

60

28%

working

File: "poly-mp.elog"

1000

50

90%

0

.181

700

20

50

0

.242

msec

.790

touch_undet

determine

stealing

Processors: 64
100

80

60
%
40

20

0

10

0

10

24%

20

30

20

30

69%

0

1

.163

50

60

40

50

60

20%

0

50

1.936
interrupt

40

25%

0

100

28.542
working

idle

70

80

70

80

34%

0

5
2.437
touch_undet

90

0
100 msec

90

100

%

33%

0

1

2

3

4

.510

0

5

.619
determine

stealing

msec

APPENDIX B. EXECUTION PROFILES FOR PARALLEL BENCHMARKS

200

B.7

qsort

File: "qsort-mp.elog"

Processors: 64
100

80

60
%
40

20

0
10

0

20

0

30

40

10

50

60

20

62%

30

93%

0

1

80

90

40

10

20

30

.270

interrupt

110

120

130

0

10

20

30

150

70

160

1

2

idle

180

190

90

msec

100

%

13%

0

.320

170

80

44%

0

.658

140

60

35%

working

File: "qsort-mp.elog"

100

50

62%

0

.157

70

1

2

0

.198

touch_undet

1

2

3

msec

.580
determine

stealing

Processors: 64
100

80

60
%
40

20

0

1

0

10

30%

2

3

20

30

47%

0

1

.130

5

6

40

50

60

12%

0

10

.361
interrupt

4

24%

0

10

2.066
working

idle

7

8

70

80

44%

0

0
10 msec

9

90

100

%

22%

.0

0

.388

.195

touch_undet

determine

1

2

.459
stealing

msec

B.8.

201

QUEENS

B.8

queens

File: "queens-mp.elog"

Processors: 64
100

80

60
%
40

20

0
5

0

0

10

10

15

20

20

63%

30

23%

0

1

40

10

interrupt

40

60

10

45

70

1

2

idle

55

90

msec

100

%

42%

1

0

.283

50

80

54%

0

.810

working

File: "queens-mp.elog"

35

57%

0

1.362

30

50

57%

0

.179

25

touch_undet

1

0

.137

msec

.232
determine

stealing

Processors: 64
100

80

60
%
40

20

.0

.5

0

10

34%

1.0

1.5

20

30

7%

.0

.5
.138
interrupt

2.0

2.5

3.0

40

50

60

15%

0

1

2

.806
working

53%

0

1

2

3

3.5

4.0

70

80

55%

.0
.260

idle

touch_undet

90

100

%

36%

.0

1.532

0
5.0 msec

4.5

.0
.134
determine

.5
.203
stealing

msec

APPENDIX B. EXECUTION PROFILES FOR PARALLEL BENCHMARKS

202

B.9

rantree

File: "rantree-mp.elog"

Processors: 64
100

80

60
%
40

20

0
1

0

2

0

3

4

5

6

10

7

8

9

20

65%

.5

12

13

40

interrupt

16

17

18

19

20

60

10

21

22

70

23

1

idle

25

26

27

28

90

msec

100

%

36%

1

0

.280

24

80

55%

0

1.278
working

File: "rantree-mp.elog"

15

42%

0

.969

14

50

68%

0

.157

11

30

21%

.0

10

touch_undet

1

0

.144

msec

.235
determine

stealing

Processors: 64
100

80

60
%
40

20

.0

.5

0

10

37%

1.0

1.5

20

30

13%

.0

.5

.135
interrupt

2.0

2.5

3.0

40

50

60

21%

0

1

2

.761
working

44%

0

1

2

1.099
idle

3

3.5

4.0

70

80

54%

.0

.5
.253
touch_undet

0
5.0 msec

4.5

90

100

%

36%

.0

.5

.139
determine

0

1
.226
stealing

msec

B.10.

203

SCAN

B.10

scan

File: "scan-mp.elog"

Processors: 64
100

80

60
%
40

20

0
5

0

0

10

15

10

67%

20

25

20

30

14%

.5

.0

40

10

20

40

50

interrupt

50

55

60

70

10

60

.5

.0

idle

70

msec

90

100

%

24%

.5

.0

.286

65

80

66%

1.244

working

45

60%

0

3.248

File: "scan-mp.elog"

35

58%

0

.170

30

touch_undet

1

0

.151

2

msec

.288

determine

stealing

Processors: 64
100

80

60
%
40

20

.0

.5

1.0

1.5

2.0

2.5

3.0

3.5

4.0

4.5

0

10

20

30

40

50

60

70

80

90

30%

9%

.0

.5
.118
interrupt

7%

0

1

2

3

71%

0

1

1.832
working

2

3

2.601
idle

.0

0
5.0 msec

100

%

29%

0

.1
.138
determine

1
.303
stealing

msec

APPENDIX B. EXECUTION PROFILES FOR PARALLEL BENCHMARKS

204

B.11

sum

File: "sum-mp.elog"

Processors: 64
100

80

60
%
40

20

0
1

0

2

0

3

4

5

6

10

7

8

9

20

64%

1

0

11

30

10%

13

14

40

10

interrupt

15

50

16

17

18

19

22

23

70

.5

24

25

80

26

27

28

90

msec

100

%

33%

0

.293

idle

21

64%

.0

1.120

working

20

60

65%

0

2.460

File: "sum-mp.elog"

12

44%

0

.158

10

1

touch_undet

1

0

.152

msec

.254
determine

stealing

Processors: 64
100

80

60
%
40

20

.0

.5

0

10

34%

1.0

1.5

20

30

10%

.0

.5

.117
interrupt

2.0

40
8%

0

1

2

3

2.108
working

4

2.5

3.0

50

60

100%

0

1

2
2.245

idle

3

3.5

4.0

70

80

44%

.0

touch_undet

90

100

%

28%

.0
.218

0
5.0 msec

4.5

.0
.135
determine

.5
.246
stealing

msec

B.12.

205

TRIDIAG

B.12

tridiag

File: "tridiag-mp.elog"

Processors: 64
100

80

60
%
40

20

0
50

0

0

100

10

66%

20

30

17%

0

1

.170

150

40

50

81%

100

0

interrupt

60

70

50

.5

.0

msec

90

100

%

13%

.5

.0

.325
idle

300

80

57%

4.276

working

File: "tridiag-mp.elog"

250

42%

0

17.283

200

touch_undet

5

0

.155

msec

.687

determine

stealing

Processors: 64
100

80

60
%
40

20

0

1

2

0

10

20

35%

3

4

5

30

40

50

7%

.0

.5
.118
interrupt

6

7

60

70

7%

0

5

10

9

80

90

0
10 msec

100

%

11%

0

1

2

4.425
working

8

3

4

5

4.856
idle

6

0

1
.817
stealing

msec

206

APPENDIX B. EXECUTION PROFILES FOR PARALLEL BENCHMARKS

Bibliography
[Adams and Rees, 1988] N. Adams and J. Rees. Object-oriented programming in
Scheme. In Conference Record of the 1988 ACM Conference on Lisp and Functional
Programming, pages 277{288, August 1988.
[Agarwal, 1991] A. Agarwal. Performance tradeo s in multithreaded processors. Technical Report MIT/LCS/TR-501, Massachusetts Institute of Technology, Cambridge,
MA, April 1991.
[Appel, 1989] A. W. Appel. Allocation without locking. Software Practice and Experience, 19(7):703{705, July 1989.
[Arvind and Nikhil, 1990] Arvind and R. S. Nikhil. Executing a program on the MIT
tagged-token data ow architecture. IEEE Transactions on Computers, 39(3):300{
318, March 1990.
[Baker and Hewitt, 1978] H. Baker and C. Hewitt. The incremental garbage collection
of processes. Technical Report AI Memo 454, Mass. Inst. of Technology, Arti cial
Intelligence Laboratory, March 1978.
[BBN, 1989] BBN Advanced Computers Inc., Cambridge, MA. Inside the GP1000,
1989.
[BBN, 1990] BBN Advanced Computers Inc., Cambridge, MA. Inside the TC2000
Computer, 1990.
[Bilardi and Nicolau, 1989] G. Bilardi and A. Nicolau. Adaptive bitonic sorting: An
optimal parallel algorithm for shared-memory machines. SIAM Journal of Computing,
12(2):216{228, April 1989.
[Callahan and Smith, 1989] D. Callahan and B. Smith. A future-based parallel language
for a general-purpose highly-parallel computer. In Papers from the Second Workshop
207

208

BIBLIOGRAPHY

on Languages and Compilers for Parallel Computing, pages 95{113. University of
Illinois at Urbana-Champaign, 1989.

[Censier and Feautrier, 1978] L. M. Censier and P. Feautrier. A new solution to coherence problems in multicache systems. IEEE Transactions on Computers, pages
1112{1118, December 1978.
[Chaiken et al., 1991] D. Chaiken, J. Kubiatowicz, and A. Agarwal. LimitLESS directories: A scalable cache coherence scheme. In ASPLOS IV: Architectural Support for
Programming Languages and Operating Systems, pages 224{234, 1991.
[Clinger et al., 1988] W. Clinger, A. Hartheimer, and E. Ost. Implementation strategies
for continuations. In Conference Record of the 1988 ACM Conference on Lisp and
Functional Programming, pages 124{131, Snowbird, UT., July 1988.
[Clinger, 1984] W. Clinger. The Scheme 311 compiler: an exercise in denotational
semantics. In Conference Record of the 1984 ACM Symposium on Lisp and Functional
Programming, pages 356{364, 1984.
[Dijkstra, 1968] E. W. Dijkstra. Cooperating sequential processes. In Programming
Languages, pages 43{112. Academic Press, 1968.
[Dubois and Scheurich, 1990] M. Dubois and C. Scheurich. Memory access dependencies in shared-memory multiprocessors. IEEE Transactions on Software Engineering,
16(6):660{673, June 1990.
[Feeley and Miller, 1990] M. Feeley and J. S. Miller. A parallel virtual machine for
ecient Scheme compilation. In Proceedings of the 1990 ACM Conference on Lisp
and Functional Programming, Nice, France, June 1990.
[Feeley, 1993] M. Feeley. Polling eciently on stock hardware. In Proceedings of the 1993
ACM Conference on Functional Programming Languages and Computer Architecture,
1993.
[Fra, 1990] Franz Inc., Berkeley, CA. Allegro CL User Manual, 1990.
[Friedman and Haynes, 1985] D. P. Friedman and C. T. Haynes. Constraining control.
In Proceedings of the Twelfth Annual Symposium on Principles of Programming Languages, pages 245{254, New Orleans, LA., January 1985. ACM.
[Friedman et al., 1992] D. P. Friedman, M. Wand, and C. T. Haynes. Essentials of
Programming Languages. MIT Press and McGraw-Hill, 1992.

BIBLIOGRAPHY

209

[Gabriel and McCarthy, 1984] R. P. Gabriel and J. McCarthy. Queue-based multiprocessing Lisp. In Conference Record of the 1984 ACM Symposium on Lisp and
Functional Programming, pages 25{44, Austin, TX., August 1984.
[Gabriel, 1985] R. P. Gabriel. Performance and Evaluation of Lisp Systems. Research
Reports and Notes, Computer Systems Series. MIT Press, Cambridge, MA, 1985.
[Gharachorloo et al., 1991] K. Gharachorloo, A. Gupta, and J. Hennessy. Performance
evaluation of memory consistency models for shared-memory multiprocessors. In
Proceedings of the 4th International Conference on Architectural Support for Programming Languages and Operating Systems, pages 245{257. ACM, April 1991.
[Goldman and Gabriel, 1988] R. Goldman and R. P. Gabriel. Preliminary results with
the initial implementation of Qlisp. In Conference Record of the 1988 ACM Conference on Lisp and Functional Programming, pages 143{152, Snowbird, UT., July
1988.
[Goodman, 1983] J. R. Goodman. Using cache memory to reduce processor-memory
trac. Proceedings of the 10th International Symposium on Computer Architecture,
pages 124{131, June 1983.
[Gray, 1986] S. L. Gray. Using futures to exploit parallelism in Lisp. Master's thesis,
Mass. Inst. of Technology, 1986.
[Halstead and Fujita, 1988] R. Halstead and T. Fujita. MASA: A multithreaded processor architecture for parallel symbolic computing. In Proceedings of the 15th Annual
International Symposium on Computer Architecture, pages 443{451, 1988.
[Halstead et al., 1986] R. Halstead, T. Anderson, R. Osborne, and T. Sterling. Concert: Design of a multiprocessor development system. In Int'l. Symp. on Computer
Architecture, volume 13, pages 40{48, June 1986.
[Halstead, 1984] R. Halstead. Implementation of Multilisp: Lisp on a multiprocessor.
In Conference Record of the 1984 ACM Symposium on Lisp and Functional Programming, pages 9{17, Austin, TX., August 1984.
[Halstead, 1985] R. Halstead. Multilisp: A language for concurrent symbolic computation. In ACM Trans. on Prog. Languages and Systems, pages 501{538, October
1985.
[Halstead, 1987] R. Halstead. Overview of concert Multilisp: A multiprocessor symbolic
computing system. ACM Computer Architecture News, 15(1):5{14, March 1987.

210

BIBLIOGRAPHY

[Haynes et al., 1984] C. T. Haynes, D. P. Friedman, and M. Wand. Continuations and
coroutines. In Conference Record of the 1984 ACM Symposium on Lisp and Functional Programming, pages 293{298, Austin, TX., 1984.
[Haynes, 1986] Christopher T. Haynes. Logic continuations. In Proceedings of the Third
International Conference on Logic Programming, pages 671{685. Springer-Verlag,
July 1986.
[Hieb et al., 1990] Robert Hieb, R. Kent Dybvig, and Carl Bruggeman. Representing
control in the presence of rst-class continuations. In ACM SIGPLAN '89 Conf. on
Programming Language Design and Implementation, pages 66{77, White Plains, New
York, June 1990.
[Hockney and Jesshope, 1988] R. W. Hockney and C. R. Jesshope. Parallel Computers
2. Adam Hilger, Bristol and Philadelphia, 1988.
[IEEE Std 1178-1990, 1991] IEEE Std 1178-1990. IEEE Standard for the Scheme Programming Language. Institute of Electrical and Electronic Engineers, Inc., New York,
NY, 1991.
[Ito and Matsui, 1990] T. Ito and M. Matsui. A parallel Lisp language PaiLisp and
its kernel speci cation. In Parallel Lisp: Languages and Systems, pages 58{100.
Springer-Verlag, 1990.
[Katz and Weise, 1990] M. Katz and D. Weise. Continuing into the future: on the
interaction of futures and rst-class continuations. In Proceedings of the 1990 ACM
Conference on Lisp and Functional Programming, Nice, France, June 1990.
[Kessler and Swanson, 1990] R. Kessler and M. Swanson. Concurrent Scheme. In Parallel Lisp: Languages and Systems, pages 200{234. Springer-Verlag, 1990.
[Kessler et al., 1992] R. Kessler, H. Carr, L. Stroller, and M. Swanson. Implementing
concurrent Scheme for the May y distributed parallel processing system. Lisp and
Symbolic Computation: An International Journal, 5(1/2):73{93, 1992.
[Kranz et al., 1989] D. Kranz, R. Halstead, and E. Mohr. Mul-T: A high-performance
parallel Lisp. In ACM SIGPLAN '89 Conf. on Programming Language Design and
Implementation, pages 81{90, June 1989.
[LeBlanc and Markatos, 1992] T. J. LeBlanc and E. P. Markatos. Shared memory vs.
message passing in shared-memory multiprocessors. Technical report, University of
Rochester, April 1992.

BIBLIOGRAPHY

211

[Lenoski et al., 1992] D. Lenoski, J. Laudon, K. Gharachorloo, W.-D. Weber, A. Gupta,
J. Hennessy, M. Horowitz, and M. S. Lam. The Stanford Dash multiprocessor. IEEE
Computer, 25(3):63{79, March 1992.
[Miller, 1987] J. S. Miller. MultiScheme: A Parallel Processing System Based on MIT
Scheme. PhD thesis, Mass. Inst. of Technology, August 1987. Available as MIT
LCS/TR/402.
[Miller, 1988] J. S. Miller. Implementing a Scheme-based parallel processing system.
International Journal of Parallel Processing, 17(5), October 1988.
[Mohr, 1991] E. Mohr. Dynamic Partitioning of Parallel Lisp Programs. PhD thesis,
Yale University Department of Computer Science, October 1991.
[Mou, 1990] Z. G. Mou. A formal model of divide-and-conquer and its parallel realization. Computer science research report #795 (PhD dissertation), Yale University,
1990.
[Murray, 1990] K. Murray. The future of Common Lisp: Higher performance through
parallelism. In The rst European Conference on the Practical Application of Lisp,
Cambridge, UK, March 1990.
[Nikhil et al., 1991] R. S. Nikhil, G. M. Papadopoulos, and Arvind. *T: A multithreaded
massively parallel architecture. Technical Report Computations Structures Group
Memo 325{1, Mass. Inst. of Technology, Laboratory for Computer Science, Cambridge, MA, November 1991.
[O'Krafka and Newton, 1990] B. W. O'Krafka and A. R. Newton. An empirical evaluation of two memory-ecient directory methods. In Proceedings of the 17th Annual
International Symposium on Computer Architecture, pages 138{147. ACM, May 1990.
[Osborne, 1989] R. Osborne. Speculative Computation in Multilisp. PhD thesis, Mass.
Inst. of Technology, 1989. Available as MIT LCS/TR/464.
[Peterson, 1981] G. L. Peterson. Myths about the mutual exclusion problem. Information Processing Letters, 12(3):115{116, 1981.
[P ster et al., 1985] G. F. P ster, W. C. Brantley, D. A. George, S. L. Harvey, W. J.
Kleinfelder, K. P. McAuli e, E. A. Melton, V. A. Norton, and J. Weiss. The IBM
Research Parallel Processor Prototype (RP3): Introduction and architecture. International Conference on Parallel Processing, pages 764{771, 1985.
[R3RS, 1986] Revised3 report on the algorithmic language Scheme. ACM Sigplan Notices, 21(12), December 1986.

212

BIBLIOGRAPHY

[R4RS, 1991] Revised4 report on the algorithmic language Scheme. Technical Report
MIT AI Memo 848b, Mass. Inst. of Technology, Cambridge, Mass., November 1991.
[Rettberg et al., 1990] R. D. Rettberg, W. R. Crowther, P. P. Carvey, and R. S. Tomlinson. The Monarch parallel processor hardware design. IEEE Computer, 23(4):18{30,
April 1990.
[Rozas and Miller, 1991] G. Rozas and J. S. Miller. Free variables and rst-class environments. Lisp and Symbolic Computation: An International Journal, 3(4):107{141,
1991.
[Rozas, 1987] G. Rozas. A computational model for observation in quantum mechanics.
Master's thesis, Mass. Inst. of Technology, 1987. Available as MIT AI/TR/925.
[Shivers, 1988] O. Shivers. Control ow analysis in Scheme. In ACM SIGPLAN '88
Conf. on Programming Language Design and Implementation, pages 164{174, Atlanta, Georgia, June 1988.
[Shivers, 1991] O. Shivers. Data- ow analysis and type recovery in Scheme. In Peter
Lee, editor, Topics in Advanced Language Implementation. The MIT Press, Cambridge, Mass., 1991.
[Srini, 1986] V. P. Srini. An architectural comparison of data ow systems. IEEE Computer, 19(3):68{88, March 1986.
[Steele, 1978] G. L. Steele. Rabbit: a compiler for Scheme. MIT AI Memo 474, Massachusetts Institute of Technology, Cambridge, Mass., May 1978.
[Steinberg et al., 1986] S. Steinberg, D. Allen, L. Bagnall, and C. Scott. The Butter y
Lisp system. In Proc. 1986 AAAI, volume 2, Philadelphia, PA, August 1986.
[Swanson et al., 1988] M. Swanson, R. Kessler, and G. Lindstrom. An implementation
of portable standard Lisp on the BBN Butter y. In Conference Record of the 1988
ACM Conference on Lisp and Functional Programming, pages 132{141, Snowbird,
UT., July 1988.
[Wand, 1980] M. Wand. Continuation-based program transformation strategies. Journal of the ACM, 27(1):164{180, 1980.
[Weening, 1989] J. S. Weening. Parallel Execution of Lisp Programs. PhD thesis, Stanford University, Department of Computer Science, 1989. Available as STAN-CS-891265.

BIBLIOGRAPHY

213

[Zorn et al., 1988] B. Zorn, P. Hil nger, K. Ho, J. Larus, and L. Semenzato. Features
for multiprocessing in SPUR Lisp. Technical Report Report UCB/CSD 88/406, University of California, Computer Science Division (EECS), March 1988.