c 2008, 2009 Sun Microsystems, Inc. (”Sun”). All
Copyright °
rights are reserved by Sun except as expressly stated as follows.
Permission to make digital or hard copies of all or part of this
work for personal or classroom use is granted, provided that
copies are not made or distributed for profit or commercial
advantage and that copies bear this notice and the full citation
on the first page. To copy otherwise, or republish, to post on
servers, or to redistribute to lists, requires prior specific written
permission of Sun.

1

The Big Messages
• Effective parallelism uses trees.
• Associative combining operators are good.
• MapReduce is good. Catamorphisms are good.
• There are systematic strategies for parallelizing superficially
sequential code.
• We must lose the “accumulator” paradigm and emphasize
“divide-and-conquer.”

2

This Talk Is about Performance
The bag of programming tricks
that has served us so well
for the last 50 years
is
the wrong way to think
going forward and
must be thrown out.
3

Why?
• Good sequential code minimizes total number of operations.
> Clever tricks to reuse previously computed results.
> Good parallel code often performs redundant operations
to reduce communication.
• Good sequential algorithms minimize space usage.
> Clever tricks to reuse storage.
> Good parallel code often requires extra space to permit
temporal decoupling.
• Sequential idioms stress linear problem decomposition.
> Process one thing at a time and accumulate results.
> Good parallel code usually requires multiway problem
decomposition and multiway aggregation of results.

4

Let’s Add a Bunch of Numbers
SUM = 0

!Oops!

DO I = 1, 1000000
SUM = SUM + X(I)
END DO
Can it be parallelized?

5

Let’s Add a Bunch of Numbers
SUM = 0

!Oops!

DO I = 1, 1000000
SUM = SUM + X(I)
END DO
Can it be parallelized?
This is already bad!
Clever compilers have to undo this.
6

What Does a Mathematician Say?
1000000
X

xi

X
or maybe just

x

i=1
Compare Fortran 90 SUM(X).
What, not how.
No commitment yet as to strategy. This is good.
7

Sequential Computation Tree
SUM = 0
DO I = 1, 1000000
SUM = SUM + X(I)
END DO

8

Atomic Update Computation Tree (a)
SUM = 0
PARALLEL DO I = 1, 1000000
SUM = SUM + X(I)
!Wrong!
END DO
Race condition
can cause updates
to be lost.

9

Atomic Update Computation Tree (b)
SUM = 0
PARALLEL DO I = 1, 1000000
ATOMIC SUM = SUM + X(I)
END DO

10

Parallel Computation Tree
What sort of code
should we write
to get a computation
tree of this shape?
What sort of code
would we like
to write?

11

Finding the Length of a LISP List

Recursive:

(define length (list)
(cond ((null list) 0)
(else (+ 1 (length (rest list))))))
Total work: Θ(n)
Delay: Ω(n)
12

Linear versus Multiway Decomposition
• Linearly linked lists are inherently sequential.
> Compare Peano arithmetic: 5 = ((((0+1)+1)+1)+1)+1
> Binary arithmetic is much more efficient than unary!
• We need a multiway decomposition paradigm:
length [ ] = 0
length [a] = 1
length (a++b) = (length a) + (length b)
This is just a summation problem: adding up a bunch of 1’s!
Total work: Θ(n)
Delay: Ω(log n), O(n) depending on how a++b is split;
even worse if splitting has worse than constant cost

13

Conc Lists

empty list

singleton

concatenation

hi

h 23 i

akb

14

Conc List for h23, 47, 18, 11i (1)
³

´

³

´
h 23 i k h 47 i k h 18 i k h 11 i

15

Conc Lists for h23, 47, 18, 11i (2)
³

´
h 23 i k h 47 i k h 18 i k h 11 i

³³

´

´

³

´

h 23 i k h 47 i k h 18 i k h 11 i

16

Conc Lists for h23, 47, 18, 11i (3)
³

³
´´
h 23 i k h 47 i k h i
³
³
´´
k h 18 i k h i k h 11 i

17

Conc Lists for h23, 47, 18, 11i (4)

18

Primitives on Lists (1)

cons lists

constructors

predicates

’()

null?

(cons a ys)

accessors
car, cdr

(cons (car xs) (cdr xs)) = xs

conc lists

’()

null?

(list a)

singleton?

(conc ys zs)

item
left, right

(list (item s)) = s
(conc (left xs) (right xs)) = xs
19

Primitives on Lists (2)

cons lists

constructors

predicates

’()

null?

(cons a ys)

accessors
car, cdr

(cons (car xs) (cdr xs)) = xs

conc lists

’()

null?

(list a)

singleton?

(conc ys zs)

item
split, xxxx

(list (item s)) = s
(split xs (λ (ys zs) (conc ys zs))) = xs
20

Primitives on Lists (3)

cons lists

constructors

predicates

’()

null?

(cons a ys)

accessors
car, cdr

(cons (car xs) (cdr xs)) = xs

conc lists

’()

null?

(list a)

singleton?

(conc ys zs)

item
split, xxxx

(list (item s)) = s
(split xs (λ (ys zs) (conc ys zs))) = xs
(split xs conc) = xs
21

Defining Lists Using car cdr cons (1)
(define (first x)
(cond ((null? x) ’())
(else (car x))))
(define (rest x)
(cond ((null? x) ’())
(else (cdr x))))
(define (append xs ys)
(cond ((null? xs) ys)
(else (cons (car xs) (append (cdr xs) ys)))))
(define (addleft a xs) (cons a xs))
(define (addright xs a)
(cond ((null? xs) (list a))
(else (cons (car xs) (addright (cdr xs) a)))))
22

Defining Lists Using car cdr cons (2)
(define (first x)
(cond ((null? x) ’())
(else (car x))))

;Constant time

(define (rest x)
(cond ((null? x) ’())
(else (cdr x))))

;Constant time

;Linear in (length xs)
(define (append xs ys)
(cond ((null? xs) ys)
(else (cons (car xs) (append (cdr xs) ys)))))
(define (addleft a xs) (cons a xs))

;Constant time

(define (addright xs a)
;Linear in (length xs)
(cond ((null? xs) (list a))
(else (cons (car xs) (addright (cdr xs) a)))))
23

Defining Lists Using
item list split conc (1)
(define (first xs)
;Depth of left path
(cond ((null? xs) ’())
((singleton? xs) (item xs))
(else (split xs (λ (ys zs) (first ys))))))
(define (rest xs)
;Depth of left path
(cond ((null? xs) ’())
((singleton? xs) ’())
(else (split xs (λ (ys zs) (append (rest ys) zs))))))
(define (append xs ys)
(cond ((null? xs) ys)
((null? ys) xs)
(else (conc xs ys))))

;Constant time

24

Defining Lists Using
item list split conc (2)
(define (first xs)
;Depth of left path
(cond ((null? xs) ’())
((singleton? xs) (item xs))
(else (split xs (λ (ys zs) (first ys))))))
(define (rest xs)
;Depth of left path
(cond ((null? xs) ’())
((singleton? xs) ’())
(else (split xs (λ (ys zs) (append (rest ys) zs))))))
(define (append xs ys)
;???
(cond ((null? xs) ys)
((null? ys) xs)
(else (REBALANCE (conc xs ys)))))
25

Defining Lists Using
item list split conc (3)
(define (addleft a xs)
(cond ((null? xs) (list a))
((singleton? xs) (append (list a) xs))
(else (split xs (λ (ys zs) (append (addleft a ys) zs))))))
(define (addright xs a)
(cond ((null? xs) (list a))
((singleton? xs) (append xs (list a)))
(else (split xs (λ (ys zs) (append ys (addright a zs)))))))

26

Defining Lists Using
item list split conc (4)
(define (addleft a xs) (append (list a) xs))

(define (addright xs a) (append xs (list a)))

27

map reduce mapreduce Using car cdr cons
(map (λ (x) (* x x)) ’(1 2 3))
(reduce + 0 ’(1 4 9))

=>

=>

(1 4 9)

14

(mapreduce (λ (x) (* x x)) + 0 ’(1 2 3))

=>

14

(define (map f xs)
;Linear in (length xs)
(cond ((null? xs) ’())
(else (cons (f (car xs)) (map f (cdr xs))))))
(define (reduce g id xs)
;Linear in (length xs)
(cond ((null? xs) id)
(else (g (car xs) (reduce g id (cdr xs))))))
(define (mapreduce f g id xs)
;Linear in (length xs)
(cond ((null? xs) id)
(else (g (f (car xs)) (mapreduce f g id (cdr xs))))))
28

length filter Using car cdr cons
(define (length xs)
(mapreduce (λ (q) 1) + 0 xs))

;Linear in (length xs)

(define (filter p xs)
;Linear in (length xs)
(cond ((null? xs) ’())
((p (car xs)) (cons p (filter p (cdr xs))))
(else (filter p (cdr x)))))
(define (filter p xs)
;Linear in (length xs)??
(apply append
(map (λ (x) (if (p x) (list x) ’())) xs)))
(define (filter p xs)
;Linear in (length xs)!!
(mapreduce (λ (x) (if (p x) (list x) ’()))
append ’() xs))

The latter analysis depends on a crucial fact: in this situation,
each call to append will require constant, not linear, time!

29

reverse Using car cdr cons
(define (reverse xs)
;QUADRATIC in (length xs)
(cond ((null? xs) ’())
(else (addright (reverse (cdr xs)) (car xs)))))
(define (revappend xs ys)
;Linear in (length xs)
(cond ((null? xs) ys)
(else (revappend (cdr xs) (cons (car xs) ys)))))
(define (reverse xs)
(revappend xs ’()))

;Linear in (length xs)

Structural recursion on cons lists produces poor performance for
reverse. An accumulation trick gets it down to linear time.

30

Parallel map reduce mapreduce
Using item list split conc
(define (mapreduce f g id xs)
;Logarithmic in (length xs)??
(cond ((null? xs) id)
((singleton? xs) (f (item xs)))
(else (split xs (λ (ys zs)
(g (mapreduce f g id ys)
;Opportunity for
(mapreduce f g id zs)))))))
; parallelism
(define (map f xs)
(mapreduce (λ (x) (list (f x))) append ’() xs))

;or conc

(define (reduce g id xs)
(mapreduce (λ (x) x) g id xs))

31

Parallel length filter reverse
Using item list split conc
(define (length xs)
(mapreduce (λ (q) 1) + 0 xs))

;Logarithmic in (length xs)??

;Logarithmic in (length xs)??
(define (filter p xs)
(mapreduce (λ (x) (if (p x) (list x) ’()))
append ’() xs))
(define (reverse xs)
;Logarithmic in (length xs)??
(mapreduce list (λ (ys zs) (append zs ys)) ’() xs))

32

Exercise: Write Mergesort and Quicksort
in This Binary-split Style
• Quicksort: structural induction on output
> Carefully split input into lower and upper halves (tricky)
> Recursively sort the two halves
> Cheaply append the two sorted sublists
• Mergesort: structural induction on input
> Cheaply split input in half
> Recursively sort the two halves
> Carefully merge the two sorted sublists (tricky)
33

Filters in Fortress (1)
sequentialFilter JEK(p: E → Boolean, xs: ListJEK): ListJEK = do
result: ListJEK := h i
for a ← seq(xs) do
if p(a) then result := result.addRight(a) end
end
result
So what language
end
is this? Fortress.
Example of use:
odd (x: Z) = ((x MOD 2) 6= 0)
sequentialFilter (odd , h 1, 4, 7, 2, 5, 3 i)

produces

h 1, 7, 5, 3 i
34

Filters in Fortress (2)
recursiveFilter JEK(p: E → Boolean, xs: ListJEK): ListJEK =
if xs.isEmpty() then h i
else
(first, rest) = xs.extractLeft().get()
rest 0 = recursiveFilter (rest, p)
if p(first) then rest 0 .addLeft(first) else rest 0 end
end
Still linear-time delay.

35

Filters in Fortress (3a)
parallelFilter JEK(p: E → Boolean, xs: ListJEK): ListJEK =
if |xs| = 0 then h i
elif |xs| = 1 then
(x, ) = xs.extractLeft().get()
if p(x) then hxi else h i end
else
(x, y) = xs.split()
parallelFilter (x, p) k parallelFilter (y, p)
end

36

Filters in Fortress (3b)
parallelFilter JEK(p: E → Boolean, xs: ListJEK): ListJEK =
if |xs| = 0 then h i
elif |xs| = 1 then
(x , ) = xs.extractLeft().get()
if p(x ) then hx i else h i end
else
(x, y) = xs.split()
parallelFilter (x, p) k parallelFilter (y, p)
end

37

Filters in Fortress (3c)
parallelFilter JEK(p: E → Boolean, xs: ListJEK): ListJEK =
if |xs| = 0 then h i
elif |xs| = 1 then
(x , ) = xs.extractLeft().get()
if p(x ) then hx i else h i end
else
(x, y) = xs.split()
parallelFilter (x, p) k parallelFilter (y, p)
end
reductionFilter JEK(p: E → Boolean, xs: ListJEK): ListJEK =
k (if p(x) then h x i else h i end)
x←xs

38

Filters in Fortress (4)
Actually, filters are so useful that they are built into the Fortress
comprehension notation in the usual way:
comprehensionFilter JEK(p: E → Boolean, xs: ListJEK): ListJEK =
h x | x ← xs, p(x) i
P

Oh, yes:

xi and

i←1:1000000

or maybe:

P
a←x

or maybe just:

MAX

xi

i←1:1000000

a and MAX a
a←x

P

x and MAX x
39

Point of Order
For filter, unlike summation, we rely on maintaining
the original order of the elements in the input list.
(Both k and + are associative, but only + is commutative.)
Do not confuse the ordering of elements in the result list (which
is a spatial order) with the order in which they are computed
(which is a temporal order).
Sequential programming often ties the one to the other. Good
parallel programming decouples this unnecessary dependency.
This strategy for parallelism relies only on associativity,
not commutativity.
40

Conjugate Transforms
A very simple but very powerful idea.
Instead of mapping input items directly to output data type T :
• Map inputs (maybe by way of T ) to a richer data type U .
• Perform computations in this richer space U
(chosen to make computation simpler or faster).
• Finally, project the result from U back into T .

41

The Three-way Unshuffle Problem (1)
• Goal: deal a deck of cards into three piles.
• Example: from ha, b, c, d, e, f, g, h, i, j, ki ,
produce (ha, d, g, ji, hb, e, h, ki, hc, f, ii) .
• Base cases:
> h i yields (h i, h i, h i)
> hai yields (hai, h i, h i)
• Combining: let’s consider input ha, b, c, di
> (hai, h i, h i) plus (hbi, h i, h i) yields (hai, hbi, h i)
> (hci, h i, h i) plus (hdi, h i, h i) yields (hci, hdi, h i)
> (hai, hbi, h i) plus (hci, hdi, h i) yields (ha, di, hbi, hci)
We always perform three concatenations;
we just need to pair them up correctly. How?

42

The Three-way Unshuffle Problem (2)
unshuffle(xs: ListJZK): (ListJZK, ListJZK, ListJZK) =
if |xs| = 0 then (h i, h i, h i)
elif |xs| = 1 then (xs, h i, h i)
else (ys, zs) = xs.split()
((a, b, c), (d, e, f )) = (unshuffle ys, unshuffle zs)
if |c| = |a| then (a k d, b k e, c k f )
elif |b| = |a| then (a k e, b k f, c k d)
else (a k f, b k d, c k e)
end
end

Unfortunately, the tests |c| = |a| and |b| = |a| are slow.

43

The Three-way Unshuffle Problem (3)
unshuffle 0 (xs: ListJZK): (ListJZK, ListJZK, ListJZK, Z) =
¯
* Solution: project inputs into
if |xs| = 0 then (h i, h i, h i, 0)
¯
* a space of 4-tuples, not 3-tuples.
elif |xs| = 1 then (xs, h i, h i, 1)
else (ys, zs) = xs.split()
((a, b, c, j), (d, e, f, k)) = (unshuffle 0 ys, unshuffle 0 zs)
case j of
0 ⇒ (a k d, b k e, c k f, (j + k) MOD 3)
1 ⇒ (a k f, b k d, c k e, (j + k) MOD 3)
2 ⇒ (a k e, b k f, c k d, (j + k) MOD 3)
end
end
unshuffle(xs: ListJZK): (ListJZK, ListJZK, ListJZK) = do
¯
* Now project the result 4-tuple
(as, bs, cs, m) = unshuffle 0 xs
¯
* to the desired 3-tuple.
(as, bs, cs)
end

44

The Three-way Unshuffle Problem (4)
Abstract the combining operator as ¢ and then use “big ¢”:
opr ¢( p: (ListJZK, ListJZK, ListJZK, Z), q: (ListJZK, ListJZK, ListJZK, Z) ) = do
((a, b, c, j), (d, e, f, k)) = (p, q)
case j of
0 ⇒ (a k d, b k e, c k f, (j + k) MOD 3)
1 ⇒ (a k f, b k d, c k e, (j + k) MOD 3)
2 ⇒ (a k e, b k f, c k d, (j + k) MOD 3)
end
end
unshuffle(xs: ListJZK): (ListJZK, ListJZK, ListJZK) = do
(as, bs, cs, m) =

¢

x←xs

(hxi, h i, h i, 1)

(as, bs, cs)
end
45

Mergesort Is a Catamorphism
Abstract merging as operator MERGE and then use “big MERGE”:
opr MERGE(p: ListJZK, q: ListJZK) = do
¯
* Merge two sorted lists into a new sorted list
...
end
(The MERGE operator is associative and has identity h i ;
thus lists under the MERGE operator form a monoid.)
mergesort(xs: ListJZK): ListJZK = MERGE hxi
x←xs

46

The Parallel Prefix Problem (1)
• Goal: compute the running totals and final sum of a sequence.
• Example: from ha, b, c, d, ei , produce
(h0, a, a + b, a + b + c, a + b + c + di, a + b + c + d + e) .
• Example: from h1, 2, 3, 6, 3, 4, −5, 2, 9i ,
produce (h0, 1, 3, 6, 9, 13, 17, 12, 14i, 23) .
• This is not the only possible formulation, but it is convenient
for our purposes here.
• Parallel prefix is useful with any monoid; for simplicity we
will restrict our attention to addition on integers.

47

The Parallel Prefix Problem (2)
ppsum(xs: ListJZK): (ListJZK, Z) =
˛
„ fi
fl
˛
xs[0 : k − 1] ˛˛ k← 0 : |xs| − 1 ,

P

P

«
xs

Unfortunately, the total work here is quadratic in |xs| .

48

The Parallel Prefix Problem (3)
ppsum(xs: ListJZK): (ListJZK, Z) =
if |xs| = 0 then (h i, 0)
elif |xs| = 1 then (h0i, xs 0 )
else (ys, zs) = xs.split()
((ps, a), (qs, b)) = (ppsum ys, ppsum zs)
ps k h a + q | q ← qs i
end

Now the total work is linear in |xs| .
Unfortunately, the dependency on a results in delay Ω((log n)2 ).

49

The Parallel Prefix Problem (4)
ppsum(xs: ListJZK): (ListJZK, Z) = ppsum 0 (0, xs)
ppsum 0 (k: Z, xs: ListJZK): (ListJZK, Z) =
if |xs| = 0 then (h i, k)
elif |xs| = 1 then (hki, k + xs 0 )
else (ys, zs) = xs.split()
(ps, a) = ppsum 0 (k, ys)
(qs, b) = ppsum 0 (a, zs)
(ps k qs, b)
end

One pass, and again the total work is linear in |xs| .
Unfortunately, the dependency on a results in delay Ω(n).
50

The Parallel Prefix Problem (5)
A solution in NESL from Guy Blelloch’s 2009 PPoPP talk:
function scan(A, op) =
if (#A <= 1) then [0]
else let
sums = {op(A[2*i], A[2*i+1]) : i in [0:#A/2]};
evens = scan(sums, op);
odds = {op(evens[i], A[2*i]) : i in [0:#A/2]};
in interleave(evens,odds);
See slide 11 of http://www.cs.cmu.edu/~blelloch/papers/PPoPP09.pdf

Using tree representation: total work Θ(n), delay Ω((log n)2 )
Can we do better?
51

Monoid-cached Trees

Empty

Leaf

tree with cached sums

Node

tree with cached lengths

See Hinze, Ralf, and Paterson, Ross. “Finger Trees: A Simple General-purpose
Data Structure.” Journal of Functional Programming 16 (2): 2006, 197–217.

52

Declaring Monoid-cached Trees
in Fortress
trait ValueTreeJT, V K comprises { EmptyJT, V K, LeafJT, V K, NodeJT, V K }
val : V
end
object EmptyJT, V K(val : V ) extends ValueTreeJT, V K end
object LeafJT, V K(item: T, val : V ) extends ValueTreeJT, V K end
object NodeJT, V K(left: ValueTreeJT, V K, val : V , right: ValueTreeJT, V K)
extends ValueTreeJT, V K
end

53

The Parallel Prefix Problem (6)
ppsum(xs: ListJZK): (ListJZK, Z) = do
c = sumcache xs
(ppfinish(0, c), c.val )
end

Total work Θ(n)
Delay Ω(log n)

sumcache(xs: ListJZK): ValueTreeJZ, ZK =
if |xs| = 0 then EmptyJZ, ZK(0)
elif |xs| = 1 then LeafJZ, ZK(xs 0 , xs 0 )
else (ys, zs) = xs.split()
(p, q) = (sumcache ys, sumcache zs)
NodeJZ, ZK(p, p.val + q.val , q)
end
ppfinish(k: Z, : EmptyJZ, ZK) = h i
ppfinish(k: Z, : LeafJZ, ZK) = hki
ppfinish(k: Z, n: NodeJZ, ZK) = ppfinish(k, n.left) k ppfinish(k + n.left.val , n.right)
It would be nice to have a simple facility to cache any monoid in a tree. It’s straightforward to cache more than one monoid, because the
cross-product of two monoids is a monoid. Deforestation of monoid-cached trees may turn out to be an important optimization.

54

MapScanZip (cf. MapReduce)
MapScanZip op id f xs =
zip xs (scan op id (map f xs))
where scan is the parallel prefix operation, parameterized by an
associative operation op and its identity id.
Monoid-cached trees provide a fast implementation.
Note that zip is difficult in the general case because the shapes
of the trees might not match, but this case is easy.

55

Splitting a String into Words (1)
• Given: a string
• Result: List of strings, the words separated by spaces
> Words must be nonempty
> Words may be separated by more than one space
> String may or may not begin (or end) with spaces

56

Splitting a String into Words (2)
• Tests:
println words(“This is a sample”)
println words(“ Here is another
println words(“JustOneWord”)
println words(“ ”)
println words(“”)

sample

”)

• Expected output:
h This, is, a, sample i
h Here, is, another, sample i
h JustOneWord i
hi
hi
57

Splitting a String into Words (3)
words(s: String) = do
result: ListJStringK := h i
word : String := “”
for c ← seq(s) do
if (c = ‘ ’) then
if (word 6= “”) then result := result k h word i end
word := “”
else
word := word k c
end
end
if (word 6= “”) then result := result k h word i end
result
end
58

Splitting a String into Words (4a)

59

Splitting a String into Words (4b)

60

Splitting a String into Words (5)
maybeWord (s: String): ListJStringK =
if s = “” then h i else h s i end
trait WordState
extends { AssociativeJWordState, ⊕K }
comprises { Chunk, Segment }
opr ⊕(self, other : WordState): WordState
end

61

Splitting a String into Words (6)
object Chunk(s: String) extends WordState
opr ⊕(self, other : Chunk): WordState =
Chunk(s k other .s)
opr ⊕(self, other : Segment): WordState =
Segment(s k other .l, other .A, other .r)
end

62

Splitting a String into Words (7)
object Segment(l: String, A: ListJStringK, r: String)
extends WordState
opr ⊕(self, other : Chunk): WordState =
Segment(l, A, r k other .s)
opr ⊕(self, other : Segment): WordState =
Segment(l, A k maybeWord (r k other .l) k other .A, other .r)
end

63

Splitting a String into Words (8)
processChar (c: Character): WordState =
if (c = ‘ ’) then Segment(“”, h i, “”) else Chunk(c) end
words(s: String) = do
M
g=
processChar (c)

¯
*

All the parallelism happens here

c←s

typecase g of
Chunk ⇒ maybeWord (g.s)
Segment ⇒ maybeWord (g.l) k g.A k maybeWord (g.r)
end
end
64

Splitting a String into Words (9)
(* The mechanics of BIG OPLUS *)
opr BIG ⊕JT K(g: (ReductionJWordStateK,
T → WordState)
→ WordState): WordState =
g(GlomReduction, identityJWordStateK)
object GlomReduction extends ReductionJWordStateK
getter toString() = “GlomReduction”
* Identity value
empty(): WordState = Chunk(“”) ¯
join(a: WordState, b: WordState): WordState = a ⊕ b
end
65

To Summarize: A Big Idea
• Summations and list constructors and loops are alike!
P
x2i
i←1:1000000
h x2i | i ← 1 : 1000000 i

>
>
>
>

for i ← 1 : 1000000 do xi := x2i end
Generate an abstract collection
The body computes a function of each item
Combine the results (or just synchronize)
In other words: generate-and-reduce

• Whether to be sequential or parallel is a separable question
> That’s why they are especially good abstractions!
> Make the decision on the fly, to use available resources
66

Another Big Idea
• Formulate a sequential loop (or finite-state machine) as
successive applications of state transformation functions fi
• Find an efficient way to compute and represent compositions
of such functions (this step requires ingenuity)
• Instead of computing
s := s0 ; for i ← seq(1 : 1000000) do s := fi (s) end ,
compute s := (
fi ) s0
◦
i←1:1000000

• Because function composition is associative, the latter has a
parallel strategy
• If you need intermediate results, use parallel prefix function
composition; then map down the result, applying each to s0
See, for example, Hillis, W. D., and Steele, G. L. “Data parallel algorithms.” Comm. ACM 29, 12 (Dec. 1986), 1170–1183.

67

Splitting a String into Words (3, again)
words(s: String) = do
result: ListJStringK := h i
word : String := “”
for k ← seq(0#length(s)) do
char = substring(s, k, k + 1)
if (char = “ ”) then
if (word 6= “”) then result := result k h word i end
word := “”
else
word := word k char
end
end
if (word 6= “”) then result := result k h word i end
result
end
68

Automatic Construction of Parallel Code
If you can construct two sequential versions of a function that
is a homomorphism on lists, one that operates left-to-right and
one right-to-left, then there is a technique for constructing a
parallel version automatically.
Morita, K., Morihata, A., Matsuzaki, K., Hu, Z., and Takeichi, M.
“Automatic inversion generates divide-and-conquer parallel programs.”
Proc. 2007 ACM SIGPLAN PLDI, 146-155.

Just derive a weak right inverse function and then apply the
Third Homomorphism Theorem. See—it’s easy!
There is an analogous result for tree homomorphisms. Morihata, A., Matsuzaki, K., Hu, Z., and Takeichi, M. “The third homomorphism
theorem on trees: Downward and upward lead to divide-and-conquer.” Proc. 2009 ACM SIGPLAN-SIGACT POPL, 177–185.
Full disclosure: the authors of these papers were members of a research group at the University of Tokyo that has had a collaborative research
agreement with the Programming Language Research group at Sun Microsystems Laboratories.

69

We Need a New Mindset for Multicores
• DO loops are so 1950s!
• So are linear linked lists!

(Literally: Fortran is now 50 years old.)

(Literally: Lisp is now 50 years old.)

• JavaTM -style iterators are so last millennium!
• Even arrays are suspect! Ultimately, it’s all trees.
• As soon as you say “first, SUM = 0” you are hosed.
Accumulators are BAD for parallelism. Note that foldl and
foldr, though functional, are fundamentally accumulative.
• If you say, “process subproblems in order,” you lose.
• The great tricks of the sequential past DON’T WORK.
• The programming idioms that have become second nature to
us as everyday tools DON’T WORK.
70

The Parallel Future
• We need parallel strategies for problem decomposition, data
structure design, and algorithmic organization:
> The top-down view:
Don’t split a problem into “the first” and “the rest.”
Instead, split a problem into roughly equal pieces;
recursively solve subproblems, then combine subsolutions.
> The bottom-up view:
Don’t create a null solution, then successively update it;
Instead, map inputs independently to singleton solutions,
then merge the subsolutions treewise.
> Combining subsolutions is usually trickier than
incremental update of a single solution.

71

MapReduce Is a Big Deal!
• Associative combining operators are a VERY BIG DEAL!
> Google MapReduce requires that combining operators
also be commutative.
> There are ways around that.
• Inventing new combining operators is a very, very big deal.
> Creative catamorphisms!
> We need programming languages that encourage this.
> We need assistance in proving them associative.

72

The Fully Engineered Story
In practice, there are many optimizations:
• Optimized representations of singleton lists.
• Use tree branching factors larger than 2.

(Example: Rich Hickey’s

Clojure is a JVMTM -based Lisp that represents lists as 64-ary trees.)

• Use self-balancing trees (2-3, red-black, finger trees, . . . ).
• Use sequential techniques near the leaves.
• Have arrays at the leaves. Decide dynamically whether to
process them sequentially or by parallel recursive subdivision.
• When iterating over an integer range, decide dynamically
whether to process it sequentially or by parallel recursive
subdivision.
Linear lists must be processed sequentially. A tree can be processed breadth-first (parallel) or depth-first (sequential), or both. Therefore we
should use both parallel and sequential strategies, and perhaps derive one from the other.

73

Conclusion
• Programs and data structures organized according to linear
problem decomposition principles can be hard to parallelize.
• Programs and data structures organized according to parallel
problem decomposition principles are easily processed either
in parallel or sequentially, according to available resources.
• This parallel strategy has costs and overheads. They will be
reduced over time but will not disappear.
• In a world of parallel computers of wildly varying sizes,
this is our only hope for program portability in the future.
• Better language design can encourage better parallel
programming.
74

Get rid of cons!

75