EÆcient Hash-Consing of Recursive Types
Je rey Considine
jconsidi@cs.bu.edu
January 29, 2000

Abstract
EÆcient storage of types within a compiler is necessary to avoid large blowups in space during compilation. Recursive types in particular are important to consider, as naive representations of recursive types
may be arbitrarily larger than necessary through unfolding. Hash-consing has been used to eÆciently
store non-recursive types [7]. Deterministic nite automata techniques have been used to eÆciently
perform various operations on recursive types [4]. We present a new system for storing recursive types
combining hash-consing and deterministic nite automata techniques. The space requirements are linear
in the number of distinct types. Both update and lookup operations take polynomial time and linear
space and type equality can be checked in constant time once both types are in the system.

1 Introduction
A recent trend in compilers is the use of typed intermediate languages while compiling to generate safer and
more optimized code. One disadvantage to this approach is that saving type information can easily cause a
large blowup in the space used by a compiler. Types of simple expressions can blowup exponentially when
represented using a tree data structure. Recursive types allow arbitrary expansion through simple unfolding.
We use the idea of hash-consing to maintain a set of types in which all equivalent types share the same
representation. [7] reports sucessful applications of these techniques. We take the idea one step farther and
add DFA techniques to allow the same bene ts to be extended to recursive types.
DFA techniques have been used for various operations on recursive types [4]. We use them to minimize
recursive type representations and avoid duplicate representations. We also present a canonical ordering
algorithm to avoid the \isomorphic under permutations of states" clause in many presentations of DFA
minimization algorithms. Use of canonical orderings allows linear time DFA comparisons and simpli es
comparing recursive types.
Given a set S of types and a type  , our system allows eÆcient update and lookup operations to be
performed in two steps: a pre-processing step taking O(  2) time and a step actually accessing S using
O (  log S ) time. The space required to represent S is linear in the number of reachable distinct types so
it is asymptotically optimal in space.
j

j

j

j

j

j

This work was done during the period of October 1999 to December 1999. In the mean time, Laurent
Mauborgne defended his PhD thesis on representing sets of in nite trees [5] and submitted an excerpt [6] for
publication in ESOP 2000. The core algorithm for eÆciently constructing in nite trees is virtually identical
to the algorithm given here for representing recursive types, though the presentation is drastically di erent.
The key di erence between the two representations is the space utilization - ignoring logarithmic factors, the
size used per distinct type is constant in the representation presented here and dependent on the number of
mutually recursive trees in the representation of Mauborgne.

1

2 Recursive Types
2.1 Syntax
The common syntax of recursive types is
::= b   v: v
where b is a base type and v is a type variable. We generalize this syntax to include arbitrary type constructors, in addition to \ ".
Let B be a nite set of base types and C a nite set of type constructors. B and C are disjoint. Let V
be the set of type variables.
 ::= b c(; : : :) v:
v
where b B , c C , and v V .


j

!

j

j

!

j

2

2

j

j

2

2.2 Equivalence
Informally, we de ne two types to be equivalent if their corresponding labeled in nite trees are the same.
We will formally de ne the labeled in nite tree corresponding to a recursive type in 3.3.

3 In nite Trees
In describing in nite trees, we will use the notation that ~i, ~j , and ~k are strings of positive integers and i, j ,
and k are individual positive integers. Let P be the set of positive integers.  is the empty string and P  is

the set nof all strings of positive
o integers. is an overloaded string concatenation operator. Given I; J P ,
I J = ~i ~
j (~i
I ) (~
j
J) .


j

2

^

2

2

3.1 Trees

is a tree if the following are true:
is a non-empty set of strings built from positive integers 



~i
is pre x-closed - ~i; ~j ~i ~j T
T
.

De nition 3.1 (Trees)

1.
2.

T
T

T

8


2

!



; 

T



P

.

2









~i j
3. T is leftward-closed - ~i; j; k (j k) ~i k T
T
.
We call the strings in T the paths of T . Note that the rst two properties imply that  is a path of T . In
words,  identi es the root node of T and T is a potentially in nite set of nite paths.
De nition 3.2 (Subtrees) Let T be a tree and ~t a path in T . The subtree of T reached by ~t is de ned as
follows:

o
 n
~
subtree T ; ~
t = ~i
t ~i
T
8



!


2

!

j


2

2

Lemma 3.3 Let T be a tree and ~t a path of T . Then, subtree(T ; ~t) is a tree.
Proof Sketch:

If ~s is a witness that subtree(T ; ~t) is not a tree, ~t ~s is similarly a witness that T is not a tree.
De nition 3.4 (Degree) Let T be a tree. The degree of T is de ned as follows:
degree(T ) = T
P
In words, degree(T ) is the number of children of the root node of T .


j

2

\

j

3.2 Labeled Trees

De nition 3.5 (Labeled Trees) Let arity be a function from C to P returning the arity of its input and
let k be the maximum value returned by arity. We de ne labeled trees as follows - T = (T ; l) is a tree if the

following are true:

1. T is a tree.
2. l is a function from P  to B
((t
t((t

 8t

2

 8

2

)
T)

T

!
!

(l(t)
(l(t)

2
2

)
C)

B

[

such that

C

!
!

(degree(subtree(T ; t)) = 0)
(degree(subtree(T ; t)) = arity(l(t)))

De nition 3.6 (Labeled Subtrees) Let T = (T ; l) be a labeled tree and ~t a path of T . The labeled subtree

of

T

reached by ~t is de ned as follows:


labeledsubtree T ; ~
t

n

=

~i


j

~
t  ~i

o
2

T

; x:l ~
tx



Lemma 3.7 Let T = (T ; l) be a labeled tree and ~t a path of T . Then, labeledsubtree
De nition 3.8 (Labeled Paths) Let

follows:

T




~
t; x

~
t2T

j

De nition 3.9 (Labeled Tree Isomorphism) Let

and

T2

are isomorphic i

T1

= T2 and


;~
t

= (T ; l) be a labeled tree. The labeled paths of

T

( )=

paths

T

8~
t

~
t 2 T1

!

T1


^

is a labeled tree.
T

is de ned as

(x = l(t))

= (T1 ; l1 )and
= l2 ~t .

l1 ~
t

T2

= (T2 ; l2 ) be labeled trees.

T1

3.3 Labeled Tree Construction from Recursive Types
De nition 3.10 (unfold) Let  be a recursive type. We de ne the function unfold as follows:1

]if  = t:
unfold ( ) = [t:=t
otherwise
De nition 3.11 (rec unfold) Let  be a recursive type. We de ne the function rec unfold as follows:2
rec unfold ( ) =

8
b
>
>
<

if  = b B
if  = c(1 ; : : :)
if  = v V
unfold (t:rec unfold ()) if  = t:
2

(rec unfold (1); : : :)

c
>
> v

:

2

De nition 3.12 (Finite Trees of Recursive Types) Let  be a recursive type. We de ne the nite tree

of  as follows:

nite tree ( ) =

8
>
>
<
>
>
:

fg
[i

(i nite tree ( ))


i

fg

nite tree ()

if  B
if  = c(1 ; : : :)
if  V
if  = t:
2

[ fg

2

In words, nite tree ( ) is the nite set of all nite paths traversing  without unfolding. Later, a similar
but potentially in nite set will be described.


1 The common axiom t: = [t:=t] ,
2 As with unfold ,  = rec unfold .



 unfold ( ).
where 
= denotes equivalence, allows one to prove  =

3

Lemma 3.13 Let  be a recursive type. Then, nite tree ( ) is a tree.
Proof Sketch:

This lemma may be proven using structural induction over  .
Lemma 3.14 Let  be a recursive type. Then, nite tree ( ) nite tree (rec unfold ( )).
De nition 3.15 (In nite Trees of Recursive Types) Let  be a recursive type. Let  = rec unfold ( ),
the result of applying rec unfold to  i times. Let p = nite tree ( ). The in nite tree of  is de ned as
follows:
in nite tree ( ) = p
In words, in nite tree ( ) is the potentially in nite set of all nite paths traversing  unfolded an in nite
number of types. If  is not recursive, in nite tree ( ) = nite tree ( ), a nite set.
Lemma 3.16 Let  be a recursive type. Then, in nite tree ( ) is a tree.


i

i

i

i

[i

i

Proof Sketch:

This lemma follows from lemmas 3.13 and 3.14.

Lemma 3.17 Let  be a recursive type. Then, in nite tree ( ) is decidable.
Proof Sketch:

Note that ~t in nite tree ( ) i ~t nite tree (rec unfold j j ( )).
Let be a distinguished symbol which stands for an unde ned result.
De nition 3.18 (Finite Labeling of Recursive Types) Let  be a recursive type and x a path. We
de ne the nite labeling of  as follows:
8
b
if  = b B and x = 
>
>
<
nite
labeling
(
 )(x0 ) if  = c(1 ; : : :) and x = i x0
nite labeling ( )(x) =
if  V
>
>
:
nite labeling ()(x) if  = t:
De nition 3.19 (In nite Labeling of Recursive Types) Let  be a recursive type and x be a path.
The in nite labeling of  is de ned as follows:

in nite labeling ( )(x) = y if there exists i such that nite labeling (rec unfold ( ))(x) = y
otherwise
Note in nite labeling ( ) has an in nite domain i in nite tree ( ) is in nite.
Lemma 3.20 Let  be a recursive type. Then, in nite labeling ( ) is a well de ned and total computable
2

~
t

2

2


i

2

i

function.

Proof Sketch:
Note nite labeling ( )(x) = y implies nite labeling (rec unfold ( ))(x) = y and the number of applications of rec unfold necessary is bounded by the length of x.
Lemma 3.21 Let  be a recursive type. Let
T = (T ; l ) is a labeled tree.

T

= in nite tree ( ) and l = in nite labeling ( ). Then,

De nition 3.22 (Recursive Type Labeled In nite Trees) Let  be a recursive type. Let T = in nite tree ( )
and l = in nite labeling ( ). The labeled in nite tree corresponding to  is de ned to be = (T ; l).
T

4

4 Deterministic Finite Automata
Other work has been done using deterministic nite automata (DFA's) to eÆciently implement operations
such as subtyping [4]. We start with DFA's for processing strings and modify them to establish a correspondence with types in our system. We can then use adaptations of standard DFA algorithms to minimize the
size of the types and traditional graph algorithms to analyze relations between types.

4.1 Standard DFA's
Traditionally, a string processing DFA can be considered as a directed multi-graph with each edge labeled
with a member of the input alphabet. Each node corresponds to a state of the DFA and is labeled nal or
non- nal. Strictly speaking, each node has exactly one outgoing edge for each member of the input alphabet.
However, it is common practive to leave out edges to \error" states, non- nal states in which all edges are
self-loops, since once an error state is reached, it is impossible to reach a nal state [3].
When processing strings with standard DFA's, one node of the graph is designated as the start state of
the DFA. Input to the DFA is read one character at a time and the corresponding edge is followed. When
the end of input is reached, the sDFA accepts the input string i last state of the traversal is a nal state.
We avoid a formal de nition of DFA's which can be found in many textbooks on automata theory. Unless
otherwise speci ed, we follow the terminology of [3].
In contrast to the modi ed DFA's of 4.2, we will refer to standard DFA's as sDFA's.
4.2 Modi ed DFA's
Instead of using sDFA's, we use a modi ed DFA (mDFA) with a more general labeling scheme beyond
nal/non- nal. mDFA's accept strings built from members of 1; : : : ; k . Each state is labeled with the
identi er of either a type constructor or a base type. In 4.3, it will be seen that type constructor labels and
base type labels will be distinguishable by the number of outgoing edges. The presence of a sink state is
implied but not used; all missing edges go to this sink state.
String processing with mDFA's is similar to that with sDFA's. Traversal of the mDFA graph is the same
as with a DFA, but the output is di erent. If the last state reached is a sink state, the mDFA rejects.
Otherwise, it accepts and outputs the state's label.
De nition 4.1 (mDFA Output) Let A be an mDFA. The output of A is the set of pairs of strings accepted
by A and the associated output.
f

( ) = (~x; y)

output A

f

j

A

g

accepts ~x and A outputs y given ~x

g

4.3 mDFA Construction from Recursive Types
De nition 4.2 (mDFA/Recursive Type Equivalence) Let  be a recursive type and A be an mDFA.
Let be the labeled tree corresponding to type  . We de ne A and  to be equivalent i output(A) =
paths( ).
To convert a recursive type  into an mDFA A, the tree structure of  is essentially copied into an mDFA
graph and back references are added to replace bound type variables.
T

T

Lemma 4.3 (mDFA Construction from Recursive Types Algorithm) Let  be a recursive type. There
exists an algorithm running in O(j j log j j) time and O(j j) space outputting mDFA A such that A is equivalent to  and if A has n states, n 2 O(j j).
Proof Sketch:

A can be built using two traversals of the tree representation of  . The rst pass assigns states to
components of  and the second builds A.

5

The rst pass annotates the tree representation of  with state numbers in a bottom up fashion. Fresh
states are assigned to base types and applications of type constructors and passed back up the tree. Type
bindings are annotated with the state passed up by the body of binding and this state is passed up the tree
again. This pass takes O(  ) time. Bound type variables are not assigned states.
The second pass actually generates the adjacency list representation of the mDFA graph.  is traversed
in a depth rst manner passing a binding environment down to each child. When a base type is reached, an
appropriately labeled state with no outgoing edges is created. When an application of a type constructor
is reached, an appropriately labeled state with a numbered edge to each child is created; if the child was
not assigned a state in the rst pass, it is a bound type variable and looked up in the binding environment
(if it is not in the binding environment,  is invalid). When a type binding is reached, a binding of the
type variable to the assigned state is added to the environment that will be passed to the body of the type
binding. Again, no action is taken when a bound type variable is reached. This pass takes O(  log  ) time
using a balanced tree representation for the environment.
This process preserves the tree structure of  except when type variables are bound and used. Ignoring
uses of bound type variables for a moment, the semantics of a type binding give it the same structure as
its body so sharing the state of its body is semantically correct. Considering uses of bound type variables
again, the semantics of using a bound type variable are that it is equivalent to the inner most binding of
that type variable so using the state the type variable is currently bound too is also correct.
j j

j

j

j

j

4.4 Minimization of mDFA's

De nition 4.4 (mDFA Equivalence) Let A and B be mDFA's. Then, A and B are de ned to be equiv-

alent i

( ) = output(B ).

output A

Lemma 4.5 (mDFA Minimization Algorithm) Let A be an mDFA with n states. There exists an algorithm running in O(n log n) and O(n) space outputing an mDFA B such that A and B are equivalent and
B is minimal. That is, for all mDFAs C , if A and C are equivalent and C has m states, then n  m.
Proof Sketch:

An sDFA is usually minimized by calculating the equivalence relation or equivalence classes of states [8].3
Once the equivalence classes of an sDFA A are known, it is straightforward to create a minimal equivalent
sDFA B using a homomorphism mapping each equivalence class of A to a distinct state in B .
The rst approximation of the equivalence classes is a partitionning of the states according to whether
they are labeled nal or not. This approximation is re ned further by splitting equivalence classes according
to the states various inputs bring them to. If there are n states in A , at most n splits are necessary, and
there are many sDFA minimization algorithms running in O(n2) time. Most of these sDFA minimization
algorithms also run in O(n) space.4
Adapting these algorithms to mDFA's simply consists of replacing the original partitionning with one
according to the mDFA labels. There may be as many sets in the partition as there are type constructors and
base types, as opposed to merely two with sDFA's. This partitionning is exactly that produced by checking
0-distinguishability.5 Since this is the criteria used for proving the correctness of the original partitionning in
sDFA algorithms, the correctness proof for the sDFA algorithm can be trivially transformed into a correctness
proof for the mDFA algorithm. This gives many algorithms for transforming an mDFA A into a minimal
equivalent mDFA B in O(n2) time and O(n) space. Hopcroft's O(n log n) time algorithm can be transformed
in the same fashion.
j

j

3 A notable exception is the algorithm in [3] which calculates the di erence relation from which it is trivial to calculate the
equivalence relation.
4 The algorithm in [3] is an exception again since it uses O (n2 ) space to calculate the distinguishability relation.
5 A and B are k -distinguishable if there is a string of length k that is a counter example to A being equivalent to B .

6

4.5 Reconstruction of Recursive Types from mDFA's
Once  has been converted into an mDFA A and minimized to get another mDFA B , we may want to
reconstruct a minimal tree representation of a type  which is equivalent to  .
Lemma 4.6 (Reconstruction of Recursive Types from mDFA's) Let A be an mDFA with n states.
There exists an algorithm outputing a recursive type  such that A and  are equivalent,  is minimal, and
the algorithm runs in O(j j log n) and O(j j) space.
Proof Sketch:

The tree structure of  can be generated by a depth rst traversal of B allowing nodes to be visited more
than once. An environment containing all ancestors in the traversal is passed down the traversal to detect
back references. Subtrees and the set of all states referenced is passed back up the tree.
At each state visited, if the state is an ancestor of itself, the state is converted into a bound variable and
passed back up the tree with the singleton set of that state being the only referenced state. If the state is not
an ancestor of itself and has a base type label, the base type is passed back with an empty set of referenced
states. If the state is not an ancestor of itself and has a type constructor label, its children are generated
and their sets of free references are unioned together. The type constructor is applied to the child subtrees.
If the state is in the set of free references, a type binding is added and it is removed from the set of free
references. Both the type and set of free references are then passed back.
All set operations except union take O(log B ) time. Union takes O( B ) time, so O( B ) is spent at each
node of . The entire process takes O(  B ) time. In terms of  , this is O(  2) time.
 will have the same in nite tree structure as  , but each in nite branch will be terminated at the rst
node equivalent to a parent. The tree structure of  is provably minimal type since earlier termination
implies di erent structure from  and later termination is not minimal. The  bindings in  are determined
by the nodes at which  is terminated: the  bindings present are exactly the bindings used.
j

j

jj

j

j

j

j

j

j

j j

4.6 Graph Properties
It is interesting to consider some properties of the graph induced by the mDFA edges. We mention some
without proof but they should be intuitive.
If a state can reach itself, the corresponding type is recursive.
If two states can reach each other, their corresponding types are mutually recursive.
If one state can not reach a second state, then the type corresponding to the rst state may be analyzed
independently of the type corresponding to the second.






5 Hash-Consing
Historically, hash consing is a technique originally used in LISP to avoid duplication of lists. In LISP, list
structures are only created by the cons operation. By modifying cons with the help of hashing techniques,
no two invocations of cons would ever return distinct copies of the same data. An early example of this
technique is presented in [2]. While limiting the ability to modify lists generated in such a manner, this
technique allows greater space eÆciency and constant time equality checking [1].
We use similar ideas to represent a set of types S . The equivalence class c( ) of a type  is the set of all
types equivalent to  ; a set C of types is an equivalence class if C = c( ) for some type  . We think of S as
a nite collection of equivalence classes, where each equivalence class is uniquely represented by a canonical
member of the class and accessed through a unique handle. Internally, S is represented by
A mapping from base type identi ers to handles (natural numbers). Given a total order of base types,
this mapping accesible in O(S ) time. Typically there will be a xed number of base types so this will
be accessible in O(1) time.


7



A mapping from tuples of type constructors and the handles of their children to handles. Given a
total order of type constructors, there is a total order of these tuples and this mapping is accessible in
O (log S ) time.
A mapping from canonically ordered mDFA's to handles. These mDFA's have a slight modi cation
from the mDFA's discussed in 4.2: base type identi ers in labels are replaced with hash-cons handles.
Since the states of the mDFA's have a canonical order, the mDFA's can be given a meaningful total
order. For an mDFA A, this mapping is accessible in O( A log S ) time.
Semantically, each mDFA maps to the handle of its rst state and successive states are mapped to
successive handles.
A mapping inverting the union of the previous mappings. This mapping can be sorted by handle and
accessed in O(log S ) time.
j



j

j



j

j

j

j

j

5.1 Hash-Consed Recursive Types
When hash-consing recursive types, two modi cations from the syntax given in 2.1 are used. First, handles
of hash-consed types are added to the allowed syntax for recursive types. Second, free type variables are
replaced with extra base types to allow uniform treatment of type variables in 5.4.
5.2 Hash-Consing Base Types

Lemma 5.1 There is an algorithm to hash-cons base types in O(log jS j) time. If the set of base types is
nite and xed, there is an algorithm to hash-cons base types in O(1) time.
Proof Sketch:

Base types are a trivial case. Since base types are unrelated to other types, the data structures of S can
be updated in O(log S ) time. If the set of base types is xed and the base types are hash-consed ahead of
time, this improves to O(1) time.
j

j

5.3 Hash-Consing Type Constructor Applications
Lemma 5.2 Let  be a recursive type without any type bindings or bound type variables. There is an
algorithm to hash-cons  in O(j j log jS j) time.
Proof Sketch:

can be hash-consed in a bottom up manner using O(log S ) time per subtree. The total time is
( log S ).


O j j

j

j

j

j

5.4 Hash-Consing Recursive Types
Lemma 5.3 Let  be a recursive type. Let h be the last hash-cons handle6 mentionned in  . Let m be the
number of states of the mDFA refered to by h or 0 if h is not recursive or there are no hash-cons handles in
 . There is an algorithm to hash-cons  in O (j j2 + j jm + j j log jS j) time.
Proof Sketch:

It is useful to separate the subtrees of  which have no free variables and consider them independently.
When calculating sets of free variables in the traditional bottom up fashion, any subtree of  with no free
variables can be considered independently of the rest of  . These subtrees can be hash-consed and replaced
with their handle leaving a smaller but equivalent version of  to examine.
6 Hash-cons

handles are ordered in the same way as natural numbers and allocated squentially in increasing order.

8

Suppose  is such a subtree of  . If 's children have no free variables, then  can be hash-consed
normally as discussed in 5.2 and 5.3. Otherwise,  is a recursive type and the subtrees of  that have not
been hash-consed represent a set of mutually recursive types. This is easily veri ed by induction over the
structure of  by noting that each such subtree has at least one free variable which must refer to an ancestor
in the tree.
 is converted into an mDFA7 A and minimized with a canonical ordering to get mDFA B . All states
in B not labeled with handles correspond to types that are mutually recursive with each other. However,
it is not known yet whether states remaining to be hash-consed are mutually recursive with any previously
hash-consed states.8
Checking all states of hash-consed mDFA's represented by handles in  is prohibitively expensive: there
are O( B ) possible handles, O( B ) states that need to be checked for each possible form of mutual recursion,
and the size m of the largest mDFA represented by a handle is arbitrarily large. In practice, these checks
should be prunable but this is still O( B 2 m) time. By noting that only the mDFA represented by the
greatest handle in  can be mutually recursive with the states of B ,9 this can be brought down to O( B m)
time.
If  is mutually recursive with a previously hash-consed state, then the certi cate (a successful traversal
mapping  to previously hash-cons handles) gives the handle of . Otherwise hash-consing  and B 's states
remaining to be hash-consed in O(  log S ) time.
Once  is separated from  , O(  m +  log S ) time is spent hash-consing . Separating  into subtrees
takes O(  2) time and the total time hash-consing  is O(  2 +  m +  log S ). When  has no hash-cons
handles, this is O(  2 +  log S ) and O(  ) space.
j

j

j

j

j

j

j

j

j

j

j

j

j

j

j

j

j

j

j

j

j

j

j

j

j

j

j

j

j

j

j

j

j

j

j

j

Theorem 5.4 (Recursive Type Hash-Consing Algorithm) Let  be a recursive type with no references
to hash-consed types. There is an algorithm to hash-cons  in O(j j2 + j j log jS j) time and O(j j) space.
Proof Sketch:

This theorem follows from lemmas 5.1, 5.1, and 5.3.

5.5 Hashing
Hashing techniques have been ignored so far but can be used to speed up the average case of various
lookups. [7] reports great success with using hashing in their hash-consing scheme. If the base types and
type constructors are xed before hand, the mappings from hash-cons entry to handle can be split according
to base type or type constructor and the correct mapping can be chosen in constant time.

6 Conclusion
We have presented an eÆcient system for managing types. Key features of this system include asymptotically
optimal space usage and constant time type equality tests. While the component ideas of hash-consing and
representing recursive types as DFA's have both been used before, we believe that their combination to
support recursive types while hash-consing is new.
An earlier form of this system has been implemented in SML. All changes since this implementation have
been both simpli cations and algorithmic improvements. Later, we plan to adapt this system for use in the
Church Project SML compiler.
7  is converted into a variant of mDFA's using hash-cons handles in place of base type identi ers
8 A recursive type may be de ned in terms of a previous de nition of itself.
9 The manner in which handles are assigned to mDFA's embeds the partial order induced by mDFA

9

dependencies.

7 Acknowledgements
This report was written to summarize some work done in a Directed Study with my advisor, Assaf Kfoury.
The original motivation for work on this problem was given by Joe Wells. Allyn Dimock was helpful in
explaining the context of the Church compiler into which this work will be integrated.

A DFA Canonical Ordering
De nition A.1 (DFA Canonical Ordering) A function f : DF A ! DF A gives a canonical ordering to

mDFA's i forall mDFAs A and B , A and B are equivalent i
applies to all DFA variants we have discussed.

( ) and f (B ) are identical.10 This de nition

f A

Lemma A.2 (DFA Canonical Ordering Algorithm) Given an mDFA A with n states, there is an algorithm running in O(n2 ) time and O(n) space outputing B such that B is minimal and B = f (A) for some
canonical ordering f .
Proof Sketch:

We present an algorithm to minimize a DFA to produce an minimum isomorphic DFA with a canonical
state ordering. The algorithm given is for standard DFA's, but is applicable to all DFA variants we have
mentionned. The algorithm constructs a canonically ordered DFA in a manner similar to the way it is
minimized - the main distinction from DFA minimization is that the partitions approximating the equivalence
relation are ordered internally.
First, the sets of the original partition are totally ordered, placing the set of nal sets before the set of
non- nal sets. As sets are split according to the previous partitionning, the subsets are ordered according to
the sets that split them. Details of this splitting and ordering process follow.
Approximations of the equivalence relation are represented as lists of sets of states. Given such a list L,
the n states are given a partial ordering by mapping all states in the ith set in L to i. States are compared
by comparing their associated values. The mapping can be created in O(n) time and accessed in O(1) time
using an array indexed by state number.
An element S of L is split if the members of S do not have their children mapped to the same sets. That
is, each element s of S is characterized by the values its children are mapped to. The characterization of an
element s can be determined in O(1) time since there is a constant upper bound on the number of children.
Checking whether S needs to be split (i.e. checking whether some of the characterizations are di erent) can
be done in O( S ) time. If S needs to be split, it can be split in O(n) time using pigeon-hole sort. When s
is split, the subsets are ordered according to the children that split them. Splits based on di erent children
are done separately in a consistent order.
The total number of approximations is at most n since there are at most n non-isomorphic states and
each approximation must add at least one split. The total time checking an approximation for splits is O(n)
time so the total time checking for splits is O(n2). The total time performing splits is O(n2) since there are
at most n splits and each split takes O(n) time. Therefore, the total time is O(n2).
The proof of this algorithm's correctness is similar to that of the corresponding minimization which does
not order the sets. The kth approximation separates states that are k-distinguishable. It is trivial to verify
that the states are ordered by the rst string distinguishing them (note states distinguished by shorter strings
are already ordered).11
j

j

References
[1] John Allen. Anatomy of LISP. McGraw-Hill Book Company, New York, 1978.
10 Here,
11 Here,

we say two DFAs are identical if their internal representations including labeling are identical.
we order strings primarily by length and secondarily by lexicographic order.

10

[2] E. Goto. Monocopy and Associative Algorithms in an Extended Lisp. University of Tokyo, Japan, May
1974.
[3] John E. Hopcroft, Je rey D. Ullman. Introduction to Automata Theory, Languages, and Computation.
Addison-Wesley Publishing Company, Inc., Reading, Massachuesetts, 1979.
[4] Dexter Kozen, Jens Palsberg, Michael I. Schwartzbach. EÆcient Recursive Subtyping. In Mathematical
Structures in Computer Science, 5(1):113-125, 1995.
[5] Laurent Mauborgne. Representation of Sets of Trees for Abstract Interpretation.
http://www.di.ens.fr/ mauborgn/t.ps.gz, November 1999.
[6] Laurent Mauborgne. Improving the Representation of In nite Trees to Deal with Sets of Trees. to appear
in ESOP 2000.
[7] Zhong Shao, Christopher League, and Stefan Monnier. Implementing Typed Intermediate Languages. In
Proc. 1998 ACM SIGPLAN International Conference on Functional Programming (ICFP'98), Baltimore,
Maryland, pages 313-323, September 1998.
[8] Bruce W. Watson. A taxonomy of nite automata minimization algorithms. Computing Science Report
93/44, Faculty of Mathematics and Computing Science, Eindhoven University of Technology, Eindhoven,
The Netherlands, 1993.

11