Semi-Persistent Data Structures Sylvain Conchon and Jean-Christophe Filliâtre LRI, Univ Paris-Sud, CNRS, Orsay F-91405 INRIA Futurs, ProVal, Orsay, F-91893 Abstract. A data structure is said to be persistent when any update operation returns a new structure without altering the old version. This paper introduces a new notion of persistence, called semi-persistence, where only ancestors of the most recent version can be accessed or updated. Making a data structure semi-persistent may improve its time and space complexity. This is of particular interest in backtracking algorithms manipulating persistent data structures, where this property is usually satisfied. We propose a proof system to statically check the valid use of semi-persistent data structures. It requires a few annotations from the user and then generates proof obligations that are automatically discharged by a dedicated decision procedure. 1 Introduction A data structure is said to be persistent when any update operation returns a new structure without altering the old version. In purely applicative programming, data structures are automatically persistent [16]. Yet this notion is more general and the exact meaning of persistent is observationally immutable. Driscoll et al. even proposed systematic techniques to make imperative data structures persistent [9]. In particular, they distinguish partial persistence, where all versions can be accessed but only the newest can be updated, from full persistence where any version can be accessed or updated. In this paper, we study another notion of persistence, which we call semi-persistence. One of the main interests of a persistent data structure shows up when it is used within a backtracking algorithm. Indeed, when we are back from a branch, there is no need to undo the modifications performed on the data structure: we simply use the old version, which persisted, and start a new branch. One can immediately notice that full persistence is not needed in this case, since we are reusing ancestors of the current version, but never siblings (in the sense of another version obtained from a common ancestor). We shall call semi-persistent a data structure where only ancestors of the newest version can be updated. Note that this notion is different from partial persistence, since we need to update ancestors, and not only to access them. A semi-persistent data structure can be more efficient than its fully persistent counterpart, both in time and space. An algorithm using a semi-persistent data structure may be written as if it was operating on a fully persistent data structure, provided that we only backtrack to ancestor versions. Checking the correctness of a program involving a semi-persistent data structure amounts to showing that – first, the data structure is correctly used ; – second, the data structure is correctly implemented. This article only addresses the former point. Regarding the latter, we simply give examples of semi-persistent data structures. Proving the correctness of these implementations is out of the scope of this paper (see Section 5). Our approach consists in annotating programs with user pre- and postconditions, which mainly amounts to expressing the validity of the successive versions of a semi-persistent data structure. By validity, we mean being an ancestor of the newest version. Then we compute a set of proof obligations which express the correctness of programs using a weakest precondition-like calculus [8]. These obligations lie in a decidable logical fragment, for which we provide a sound and complete decision procedure. Thus we end up with an almost automatic way of checking the legal use of semi-persistent data structures. Related work. To our knowledge, this notion of semi-persistence is new. However, there are several domains which are somehow connected to our work, either because they are related to some kind of stack analysis, or because they are providing decision procedure for reachability issues. First, works on escape analysis [12, 4] address the problem of stack-allocating values; we may think that semi-persistent versions that become invalid are precisely those which could be stack-allocated, but it is not the case as illustrated by example g above. Second, works on stack analysis to ensure memory safety [14, 18, 19] provide methods to check the consistent use of push and pop operations. However, these approaches are not precise enough to distinguish between two sibling versions (of a given semi-persistent data structure) which are both upper in the stack. Regarding the decidability of our proof obligations, our approach is similar to other works regarding reachability in linked data structures [15, 3, 17]. However, our logic is much simpler and we provide a specific decision procedure. Finally, we can mention Knuth’s dancing links [13] as an example of a data structure specifically designed for backtracking algorithms; but it is still a traditional imperative solution where an explicit undo operation is performed in the main algorithm. This paper is organized as follows. First, Section 2 gives examples of semipersistent data structures and shows the benefits of semi-persistence with some benchmarks. Then our formalization of semi-persistence is presented in two steps: Section 3 introduces a small programming language to manipulate semipersistent data structures, and Section 4 defines the proof system which checks the valid use of semi-persistent data structures. Section 5 concludes with possible extensions. A long version of this paper, including proofs, is available online [7]. 2 Examples of Semi-Persistent Data Structures We explain how to implement semi-persistent arrays, lists and hash tables and we present benchmarks to show the benefits of semi-persistence. Arrays. Semi-persistent arrays can be implemented by modifying the persistent arrays introduced by Baker [1]. The basic idea is to use an imperative array for the newest version of the persistent array and indirections for old versions. For instance, starting with an array a0 initialized with 0, and performing the successive updates a1 = set(a0 , 1, 7), a2 = set(a1 , 2, 8) and a3 = set(a2 , 5, 3), we end up with the following situation: When accessing or updating an old version, e.g. a1 , Baker’s solution is to first perform a rerooting operation, which makes a1 point to the imperative array by reversing the linked list of indirections: But if we know that we are not going to access a2 and a3 anymore, we can save this list reversal. All we need to do is to perform the assignments contained in this list: Thus it is really easy to turn these persistent arrays into a semi-persistent data structure, which is more efficient since we save some pointer assignments. This example is investigated in more details in [6]. Lists. As a second example, we consider an immutable data structure which we make semi-persistent. The simplest and most popular example is the list data structure. To make it semi-persistent, the idea is to reuse cons cells between successive conses to the same list. For instance, given a list l, the cons operation 1::l allocates a new memory block to store 1 and a pointer to l. Then a successive operation 2::l could reuse the same memory block if the list is used in a semi-persistent way. Thus we simply need to replace 1 by 2. To do this, we must maintain for each list the previous cons, if any. Hash Tables. Combining (semi-)persistent arrays with (semi-)persistent lists, one easily gets (semi-)persistent hash tables. Benchmarks. We present some benchmarks to show the benefits of semi-persistence. Each of the previous three data structures has been implemented in Ocaml1 . Each data structure is tested the same way and compared to its fully persistent counterpart. The test consists in simulating a backtracking algorithm with branching degree 4 and depth 6, operating on a single data structure. N 1 The full code is available in the long version of this paper [7]. successive update operations are performed on the data structure between two branchings points. The following table gives timings for various values of N . The code was compiled with the Ocaml native-code compiler (ocamlopt -unsafe) on a dual core Pentium 2.13GHz processor running under Linux. The timings are given in seconds and correspond to CPU time obtained using the UNIX times system call. N 200 1000 5000 10000 persistent arrays 0.21 1.50 13.90 30.5 semi-persistent arrays 0.18 1.10 7.59 17.3 persistent lists 0.18 2.38 50.20 195.0 semi-persistent lists 0.11 0.76 8.02 31.1 persistent hash tables 0.24 2.15 19.30 43.1 semi-persistent hash tables 0.22 1.51 11.20 28.2 As we can see, the speedup ratio is always greater than 1 and almost reaches 7 (for semi-persistent lists). Regarding memory consumption, we compared the total number of allocated bytes, as reported by Ocaml’s garbage collector. For the tests corresponding to the last column (N = 10000) semi-persistent data structures always used much less memory than persistent ones: 3 times less for arrays, 575 times less for lists and 1.5 times less for hash tables. The dramatic ratio for lists is easily explained by the fact that our benchmark program reflects the best case regarding memory allocation (allocation in one branch is reused in other branches, which all have the same length). 3 Programming with Semi-Persistent Data Structures This section introduces a small programming language to manipulate semipersistent data structures. In order to keep it simple, we assume that we are operating on the successive versions of a single, statically allocated, data structure. Multiple data structures and dynamic allocation are discussed in Section 5. 3.1 Syntax The syntax of our language is as follows: e ::= x | c | p | f e | let x = e in e | if e then e else e d ::= fun f (x : ι) = {φ} e {ψ} ι ::= semi | δ | bool A program expression is either a variable (x), a constant (c), a pointer (p), a function call, a local variable introduced by a let binding, or a conditional. The set of function names f includes some primitive operations (introduced in the next section). A function definition d introduces a function f with exactly one argument x of type ι, a precondition φ, a body and a postcondition ψ. A type ι is either the type semi of the semi-persistent data structure, the type δ of the values it contains, or the type bool of booleans. The syntax of pre- and postconditions will be given later in Section 4. A program ∆ is a finite set of mutually recursive functions. 3.2 Primitive Operations We may consider three kinds of operations on semi-persistent data structures: update operations backtracking to a given version and creating a new successor, which becomes the newest version; destructive access operations backtracking to a given version, which becomes the newest version, and then accessing it; and non-destructive access operations accessing an ancestor of the newest version, without modifying the data structure. Since update and destructive access operations both need to backtrack, it is convenient to design a language based on the following three primitives: backtrack, which backtracks to a given version, making it the newest version; branch which builds a new successor of a given version, assuming it is the newest version; and acc, which accesses a given version, assuming it is a valid version. Then update and destructive access operations can be rephrased in terms of the above primitives: upd e = branch (backtrack e) dacc e = acc (backtrack e) 3.3 Operational Semantics We equip our language with a small step operational semantics, which is given in Figure 1. One step of reduction is written e1 , S1 → e2 , S2 where e1 and e2 are program expressions and S1 and S2 are states. A value v is either a constant c or a pointer p. Pointers represent versions of the semi-persistent data structure. A state S is a stack p1 , . . . , pm of pointers, pm being the top of the stack. The semantics is straightforward, except for primitive operations. Primitive backtrack expects an argument pn designating a valid version of the data structure, that is an element of the stack. Then all pointers on top of pn are popped from the stack and pn is the result of the operation. Primitive branch expects an argument pn being the top of the stack and pushes a new value p, which is also the result of the operation. Finally, primitive acc expects an argument pn designating a valid version, leaves the stack unchanged and returns some value for version pn , represented by A(pn ). (We leave A uninterpreted since we are not interested in the values contained in the data structure.) Note that reduction of backtrack pn or acc pn is blocked whenever pn is not an element of S, which is precisely what we intend to prevent. E ::= [] | f E | let x = E in e | if E then e else e v ::= c | p S ::= p · · · p if true then e1 else e2 , S if false then e1 else e2 , S let x = v in e, S f v, S backtrack pn , p1 · · · pn pn+1 · · · pm branch pn , p1 · · · pn acc pn , p1 · · · pn pn+1 · · · pm E[e1 ], S1 → → → → → → → → e1 , S e2 , S e{x ← v}, S e{x ← v}, S if fun f (x : ι) = {φ} e {ψ} ∈ ∆ pn , p1 · · · pn p, p1 · · · pn p p fresh A(pn ), p1 · · · pn pn+1 · · · pm E[e2 ], S2 if e1 , S1 → e2 , S2 and E 6= [] Fig. 1. Operational Semantics 3.4 Type System with Effect We introduce a type system to characterize well-formed programs. Our language is simply typed and thus type-checking is immediate. Meanwhile, we infer the effect ǫ of each expression, as an element of the boolean lattice ({⊥, ⊤}, ∧, ∨). This boolean indicates whether the expression modifies the semi-persistent data structure (⊥ meaning no modification and ⊤ a modification). Effects will be used in the next section to simplify constraint generation. Each function is given a type τ , as follows: τ ::= (x : ι) →ǫ {φ} ι {ψ} The argument is given a type and a name (x) since it is bound in both precondition φ and postcondition ψ. Type τ also indicates the latent effect ǫ of the function, which is the effect resulting from the function application. A typing environment Γ is a set of type assignments for variables (x : ι), constants (c : ι) and functions (f : τ ). It is assumed to contain at least type declarations for the primitives, as follows: backtrack : (x : semi) →⊤ {φbacktrack } semi {ψbacktrack } branch : (x : semi) →⊤ {φbranch } semi {ψbranch } acc : (x : semi) →⊥ {φacc } δ {ψacc } where pre- and postcondition are given later. As expected, both backtrack and branch modify the semi-persistent data structure and thus have effect ⊤, while the non-destructive access acc has effect ⊥. Given a typing environment Γ , the judgment Γ ⊢ e : ι, ǫ means “e is a wellformed expression of type ι and effect ǫ” and the judgment Γ ⊢ d : τ means “d is a well-formed function definition of type τ ”. Typing rules are given in Figure 2. They assume judgments Γ ⊢ φ pre and Γ ⊢ ψ post ι for the well-formedness of pre- and postconditions respectively, to be defined later in Section 4.1. Note that there is no typing rule for pointers, to prevent their explicit use in programs. Var App Ite Const c:ι∈Γ Γ ⊢ c : ι, ⊥ f : (x : ι1 ) →ǫ2 {φ} ι2 {ψ} ∈ Γ Γ ⊢ e : ι1 , ǫ1 Γ ⊢ f e : ι2 , ǫ1 ∨ ǫ2 Γ ⊢ e1 : bool, ǫ1 Γ ⊢ e2 : ι, ǫ2 Γ ⊢ e3 : ι, ǫ3 Γ ⊢ if e1 then e2 else e3 : ι, ǫ1 ∨ ǫ2 ∨ ǫ3 Let Fun x:ι∈Γ Γ ⊢ x : ι, ⊥ Γ ⊢ e1 : ι1 , ǫ1 Γ, x : ι1 ⊢ e2 : ι2 , ǫ2 Γ ⊢ let x = e1 in e2 : ι2 , ǫ1 ∨ ǫ2 x : ι1 ⊢ φ pre x : ι1 ⊢ ψ post ι2 Γ, x : ι1 ⊢ e : ι2 , ǫ Γ ⊢ fun f (x : ι1 ) = {φ} e {ψ} : (x : ι1 ) →ǫ {φ} ι2 {ψ} Fig. 2. Typing Rules A program ∆ = d1 , . . . , dn is well-typed if each function definition di can be given a type τi such that d1 : τ1 , . . . , dn : τn ⊢ di : τi for each i. The types τi can easily be obtained by a fixpoint computation, starting will all latent effects set to ⊥, since effect inference is clearly a monotone function. 3.5 Examples Let us consider the following two functions f and g: fun f x0 = {valid(x0 )} let x1 = upd x0 in let x2 = upd x0 in acc x2 fun g x0 = {valid(x0 )} let x1 = upd x0 in let x2 = upd x0 in acc x1 Each function expects a valid version x0 of the data structure as argument and successively build two successors x1 and x2 of x0 . Then f accesses x2 , which is valid, and g accesses x1 , which is illegal. Let us check this on the operational semantics. Let S be a state composed of a single pointer p. The reduction of f p in S runs as follows: f p, p → let x1 = upd p in let x2 = upd p in acc x2 , p → let x1 = p1 in let x2 = upd p in acc x2 , pp1 → let x2 = upd p in acc x2 , pp1 → let x2 = p2 in acc x2 , pp2 → acc p2 , pp2 → A(p2 ), pp2 p3 and ends on the value A(p2 ). On the contrary, the reduction of g p in S blocks on g p, p → . . . → acc p1 , pp2 . 4 Proof System This section introduces a theory for semi-persistence and a proof system for this theory. First we define the syntax and semantics of logical annotations. Then we compute a set of constraints for each program expression, which is proved to express the correctness of the program with respect to semi-persistence. Finally we give a decision procedure to solve the constraints. 4.1 Theory of Semi-Persistence The syntax of annotations is as follows: term t atom a postcondition ψ precondition φ ::= x | p | prev(t) ::= t = t | path(t, t) ::= a | ψ ∧ ψ ::= a | φ ∧ φ | ψ ⇒ φ | ∀x. φ Terms are built from variables, pointers and a single function symbol prev. Atoms are built from equality and a single predicate symbol path. A postcondition ψ is restricted to a conjunction of atoms. A precondition is a formula φ built from atoms, conjunctions, implications and universal quantifications. A negative formula (i.e. appearing on the left side of an implication) is restricted to a conjunction of atoms. We introduce two different syntactic categories ψ and φ for formulae but one can notice that φ actually contains ψ. This syntactic restriction on formulae is justified later in Section 4.5 when introducing the decision procedure. In the remainder of the paper, a “formula” refers to the syntactic category φ. Substitution a of term t for a variable x in a formula φ is written φ{x ← t}. We denote by S(A) the set of all subterms of a set of atoms A. The typing of terms and formulae is straightforward, assuming that prev has signature semi → semi. Function postconditions may refer to the function result, represented by the variable ret . Formulae can only refer to variables of type semi (including variable ret). We write Γ ⊢ φ to denote a well-formed formula φ in a typing environment Γ . We now give the semantics of program annotations. The main idea is to express that a given version is valid if and only if it is an ancestor of the newest version. To illustrate this idea, the following figure shows the successive version trees for the sequence of declarations x1 = upd x0 , x2 = upd x1 , x3 = upd x1 and x4 = upd x0 : The newest version is pictured as a black node, other valid versions as white nodes and invalid ones as gray nodes. The meaning of prev and path is to define the notion of ancestor: prev(x) is the immediate ancestor of x and path(x, y) holds whenever x is an ancestor of y. The corresponding theory can be axiomatized as follows: Definition 1. The theory T is defined as the combination of the theory of equality and the following axioms: (A1 ) ∀x. path(x, x) (A2 ) ∀xy. path(x, prev(y)) ⇒ path(x, y) (A3 ) ∀xyz. path(x, y) ∧ path(y, z) ⇒ path(x, z) We write |= φ if φ is valid in any model of T . The three axioms (A1 )–(A3 ) exactly define path as the reflexive transitive closure of prev−1 , since we consider validity in all models of T and therefore in those where path is the smallest relation satisfying axioms (A1 )–(A3 ). Note that prev is a total function and that there is no notion of “root” in our logic. Thus a version always has an immediate ancestor, which may or may not be valid. To account for the modification of the newest version as program execution progresses, we introduce a “mutable” variable cur to represent the newest version. This variable does not appear in programs: its scope is limited to annotations. The only way to modify its contents is to call the primitive operations backtrack and branch. We are now able to give the full type expressions for the three primitive operations: backtrack : (x : semi) →⊤ {path(x, cur )} semi {ret = x ∧ cur = x} branch : (x : semi) →⊤ {cur = x} semi {ret = cur ∧ prev(cur ) = x} acc : (x : semi) →⊥ {path(x, cur )} δ {true} As expected, effect ⊤ for the first two reflects the modification of cur . The validity of function argument x is expressed as path(x, cur ) in operations backtrack and acc. Note that acc has no postcondition (written true and which could stand for the tautology cur = cur ) since we are not interested in the values contained in the data structure. We are now able to define the judgements used in Section 3.4 for pre- and postconditions. We write Γ ⊢ φ pre as syntactic sugar for Γ, cur : semi ⊢ φ. Similarly, Γ ⊢ ψ post ι is syntactic sugar for Γ, cur : semi, ret : ι ⊢ ψ when return type ι is semi and for Γ, cur : semi ⊢ ψ otherwise. Note that since Γ only contains the function argument x in typing rule Fun, the function precondition may only refer to x and cur , and its postcondition to x, cur and ret . 4.2 Constraints We now define the set of constraints associated to a given program. This is mostly a weakest precondition calculus, which is greatly simplified here since we have only one mutable variable (namely cur ). For a program expression e and a formula φ we write this weakest precondition C(e, φ). This is formula expressing framef (φ) = φf {x ← ret } ∧ ∀ret ′ . ψf {ret ← ret ′ , x ← ret } ⇒ φ{ret ← ret ′ } if f : (x : ι) →⊥ {φf } ι′ {ψf } framef (φ) = φf {x ← ret } ∧ ∀ret ′ cur ′ . ψf {ret ← ret ′ , x ← ret , cur ← cur ′ } ⇒ φ{ret ← ret ′ , cur ← cur ′ } ⊤ ′ if f : (x : ι) → {φf } ι {ψf } C(v, φ) = C(if e1 then e2 else e3 , φ) = C(let x = e1 in e2 , φ) = C(f e1 , φ) = φ{ret ← v} C(e1 , C(e2 , φ) ∧ C(e3 , φ)) C(e1 , C(e2 , φ){x ← ret }) C(e1 , framef (φ)) C(fun f (x : ι) = {φ} e {ψ}) = ∀x. ∀cur . φ ⇒ C(e, ψ) Fig. 3. Constraint synthesis the conditions under which φ will hold after the evaluation of e. Note that cur may appear in φ, denoting the result of e, but does not appear in C(e, φ) anymore. For a function definition d we write C(d) the formula expressing its correctness, that is the fact that the function precondition implies the weakest precondition obtained from the function postcondition, for any function argument and any initial value of cur . The definition for C(e, φ) is given in Figure 3. This is a standard weakest precondition calculus, except for the conditional rule. Indeed, one would expect a rule such as C(if e1 then e2 else e3 , φ) = C(e1 , (ret = true ⇒ C(e2 , φ)) ∧ (ret = false ⇒ C(e3 , φ))) but since φ cannot test the result of condition e1 (φ may only refer to variables of type semi), the conjunction above simplifies to C(e2 , φ) ∧ C(e3 , φ). The constraint synthesis for a function call, C(f e1 , φ), is the only nontrivial case. It requires precondition φf to be valid and postcondition ψf to imply the expected property φ. Universal quantification is used to introduce f ’s results and side-effects. We use the effect in f ’s type to distinguish two cases: either effect is ⊥ which means that cur is not modified and thus we only quantify over f ’s result (hence we get for free the invariance of cur ); or effect is ⊤ and we quantify over an additional variable cur ′ which stands for the new value of cur . To simplify this definition, we introduce a formula transformer framef (φ) which builds the appropriate postcondition for argument e1 . 4.3 Examples Simple Example. Let us consider again the two functions f and g from Section 3.5, valid(x0 ) being now expressed as path(x0 , cur ). We compute the as- sociated constraints for an empty postcondition true. The constraint C(f ) is ∀x0 . ∀cur. path(x0 , cur) ⇒ path(x0 , cur) ∧ ∀x1 . ∀cur 1 . (prev(x1 ) = x0 ∧ cur 1 = x1 ) ⇒ path(x0 , cur 1 ) ∧ ∀x2 . ∀cur 2 . (prev(x2 ) = x0 ∧ cur 2 = x2 ) ⇒ path(x2 , cur 2 ) ∧ ∀ret . true ⇒ true It can be split into three proof obligations, which are the following universally quantified sequents: path(x0 , cur) ⊢ path(x0 , cur) path(x0 , cur), prev(x1 ) = x0 , cur 1 = x1 ⊢ path(x0 , cur 1 ) path(x0 , cur), prev(x1 ) = x0 , cur 1 = x1 , prev(x2 ) = x0 , cur 2 = x2 ⊢ path(x2 , cur 2 ) The three of them hold in theory T and thus f is correct. Similarly, the constraint C(g) can be computed and split into three proof obligations. The first two are exactly the same as for f but the third one is slightly different: path(x0 , cur), prev(x1 ) = x0 , cur 1 = x1 , prev(x2 ) = x0 , cur 2 = x2 ⊢ path(x1 , cur 2 ) In that case it does not hold in theory T . Backtracking Example. As a more complex example, let us consider a backtracking algorithm. The pattern of a program performing backtracking on a persistent data structure is a recursive function bt looking like fun bt (x : semi) = . . . bt(upd x) . . . bt(upd x) . . . Function bt takes a data structure x as argument and makes recursive calls on several successors of x. This is precisely a case where the data structure may be semi-persistent, as motivated in the introduction. To capture this pattern in our framework, we simply need to consider two successive calls bt (upd x), which can be written as follows: fun bt (x : semi) = let = bt (upd x) in bt (upd x) Function bt obviously requires a precondition stating that x is a valid version of the semi-persistent data structure. This is not enough information to discharge the proof obligations: the second recursive call bt(upd x) requires x to be valid, which possibly could no longer be the case after the first recursive call. Therefore a postcondition for bt is needed to ensure the validity of x: fun bt (x : semi) = { path(x, cur ) } let = bt (upd x) in bt (upd x) { path(x, cur ) } Then it is straightforward to check that constraint C(bt ) is valid in theory T . 4.4 Soundness In the remainder of this section, we consider a program ∆ = d1 , . . . , dn whose constraints are valid, that is |= C(d1 ) ∧ · · · ∧ C(dn ). We are going to show that the evaluation of this program will not block. For this purpose we first introduce the notion of validity with respect to a state of the operational semantics: Definition 2. A formula φ is valid in a state S = p1 , . . . , pn , written S |= φ, if it is valid in any model M for T such that  prev(pi+1 ) = pi for all 1 ≤ i < n cur = pn Then we show that this validity is preserved by the operational semantics. To do this, it is convenient to see the evaluation contexts as formula transformers, as follows: E E[φ] [] φ let x = E1 in e2 E1 [C(e2 , φ){x ← ret }] if E1 then e2 else e3 E1 [C(e2 , φ) ∧ C(e3 , φ)] f E1 E1 [framef (φ)] There is a property of commutation between contexts for programs and contexts for formulae: Lemma 1. S |= C(E[e], φ) if and only if S |= C(e, E[φ]). We now want to prove preservation of validity, that is if S |= C(e, φ) and e, S → e′ , S ′ then S ′ |= C(e′ , φ). Obviously, this does not hold for any state S, program e and formula φ. Indeed, if S ≡ p1 p2 , e ≡ upd p1 and φ ≡ prev(p2 ) = p1 , then C(e, φ) is path(p1 , cur ) ∧ ∀ret ′ cur ′ . (prev(ret ′ ) = p1 ∧ cur ′ = ret ′ ) ⇒ prev(p2 ) = p1 which holds in S. But S ′ ≡ p1 p for a fresh p, e′ ≡ p, and C(e′ , φ) is prev(p2 ) = p1 which does not hold in S ′ (since p2 does not appear in S ′ anymore). Fortunately, we are not interested in the preservation of C(e, φ) for any formula φ, but only for formulae which arise from function postconditions. As pointed out in Section 4.1, a function postcondition may only refer to x, cur and ret only. Therefore we are only considering formulae C(e, φ) where x is the only free variable (cur and ret do not appear in formulae C(e, φ) anymore). This excludes the formula prev(p2 ) = p1 in the example above. We are now able to prove preservation of validity: Lemma 2. Let S be a state, φ be a formula and e a program expression. If S |= C(e, φ) and e, S → e′ , S ′ then S ′ |= C(e′ , φ). Finally, we prove the following progress property: Theorem 1. Let S be a state, φ be a formula and e a program expression. If S |= C(e, φ) and e, S →∗ e′ , S ′ 6→, then e′ is a value. 4.5 Decision Procedure We now show that constraints are decidable and we give a decision procedure. First, we notice that any formula φ is equivalent to a conjunction of formulae of the form ∀x1 . . . . ∀xn . a1 ∧ · · · ∧ am ⇒ a, where the ai ’s are atoms. This results from the syntactic restrictions on pre- and postconditions, together with the weakest preconditions rules which are only using postconditions in negative positions. Therefore we simply need to decide whether a given atom is the consequence of other atoms. We denote by H ⋆ the congruence closure of a set H of hypotheses {a1 , . . . , am }. Obviously S(H ⋆ ) = S(H) since no new term is created. H ⋆ is finite and can be computed as a fixpoint. Algorithm 1 For any atom a such that S({a}) ⊆ S(H), the following algorithm, decide(H, a), decides whether H |= a. 1. First we compute the congruence closure H ⋆ . 2. If a is of the form t1 = t2 , we return true if t1 = t2 ∈ H ⋆ and false otherwise. 3. If a is of the form path(t1 , t2 ), we build a directed graph G whose nodes are the subterms of H ⋆ , as follows: (a) for each pair of nodes t and prev(t) we add an edge from prev(t) to t; (b) for each path(t1 , t2 ) ∈ H ⋆ we add an edge from t1 to t2 ; (c) for each t1 = t2 ∈ H ⋆ we add two edges between t1 and t2 . 4. Finally we check whether there is a path from t1 to t2 in G. Obviously this algorithm terminates since H ⋆ is finite and thus so is G. We now show soundness and completeness for this algorithm. Theorem 2. decide(H, a) returns true if and only if H |= a. Note: the restriction S({a}) ⊆ S(H) can be easily met by adding to H the equalities t = t for any subterm t of a; it was only introduced to simplify the proof above. 4.6 Implementation We have implemented the whole framework of semi-persistence. The implementation relies on an existing proof obligations generator, Why [10]. This tool takes annotated first-order imperative programs as input and uses a traditional weakest precondition calculus to generate proof obligations. The language we use in this paper is actually a subset of Why’s input language. We simply use the imperative aspect to make cur a mutable variable. Then the resulting proof obligations are exactly the same as those obtained by the constraint synthesis defined in Section 4.2. The Why tool outputs proof obligations in the native syntax of various existing provers. In particular, these formulas can be sent to Ergo [5], an automatic prover for first-order logic which combines congruence closure with various builtin decision procedures. We first simply axiomatized theory T using (A1 )–(A3 ), which proved to be powerful enough to verify all examples from this paper and several other benchmark programs. Yet it is possibly incomplete (automatic theorem provers use heuristics to handle quantifiers in first-order logic). To achieve completeness, and to assess the results of Section 4.5, we also implemented theory T as a new built-in decision procedure in Ergo. Again we verified all the benchmark programs. 5 Conclusion We have introduced the notion of semi-persistent data structures, where update operations are restricted to ancestors of the most recent version. Semi-persistent data structures may be more efficient than their fully persistent counterparts, and are of particular interest in implementing backtracking algorithms. We have proposed an almost automatic way of checking the legal use of semi-persistent data structures. It is based on light user annotations in programs, from which proof obligations are extracted and automatically discharged by a decision procedure. There is a lot of remaining work to be done. First, the language introduced in Section 3, in which we check for legal use of semi-persistence, could be greatly enriched. Beside the missing features such as polymorphism or recursive datatypes, it would be of particular interest to consider simultaneous use of several semipersistent data structures and dynamic creation of semi-persistent data structures. Regarding the former, one would probably need to express disjointness of version subtrees, and thus to enrich the logical fragment used in annotations with disjunctions and negations; we may lose decidability of the logic, though. Regarding the latter, it would imply to express in the logic the freshness of the allocated pointers and to maintain the newest versions for each data structures. Another interesting direction would be to provide systematic techniques to make data structures semi-persistent as previously done for persistence [9]. Clearly what we did for lists could be extended to tree-based data structures. It would be even more interesting to formally verify semi-persistent data structure implementations, that is to show that the contents of any ancestor of the version being updated is preserved. Since such implementations are necessarily using imperative features (otherwise they would be fully persistent), proving their correctness requires verification techniques for imperative programs. This could be done for instance using verification tools such as SPEC# [2] or Caduceus [11]. However, we would prefer verifying Ocaml code, as given in the long version of this paper [7] for instance, but unfortunately there is currently no tool to handle such code. References 1. Henry G. Baker. Shallow binding makes functional arrays fast. SIGPLAN Not., 26(8):145–147, 1991. 2. Mike Barnett, K. Rustan M. Leino, and Wolfram Schulte. The Spec# programming system: An overview. In CASSIS 2004, number 3362 in LNCS. Springer, 2004. 3. Michael Benedikt, Thomas W. Reps, and Shmuel Sagiv. A decidable logic for describing linked data structures. In European Symposium on Programming, pages 2–19, 1999. 4. Bruno Blanchet. Escape analysis: Correctness proof, implementation and experimental results. In Symposium on Principles of Programming Languages, pages 25–37, 1998. 5. Sylvain Conchon and Evelyne Contejean. Ergo: A Decision Procedure for Program Verification. http://ergo.lri.fr/. 6. Sylvain Conchon and Jean-Christophe Filliâtre. A Persistent Union-Find Data Structure. In ACM SIGPLAN Workshop on ML, Freiburg, Germany, October 2007. 7. Sylvain Conchon and Jean-Christophe Filliâtre. Semi-Persistent Data Structures. Research Report 1474, LRI, Université Paris Sud, September 2007. http://www. lri.fr/∼filliatr/ftp/publis/spds-rr.pdf. 8. Edsger W. Dijkstra. A discipline of programming. Series in Automatic Computation. Prentice Hall Int., 1976. 9. J. R. Driscoll, N. Sarnak, D. D. Sleator, and R. E. Tarjan. Making Data Structures Persistent. Journal of Computer and System Sciences, 38(1):86–124, 1989. 10. J.-C. Filliâtre. The Why verification tool. http://why.lri.fr/. 11. J.-C. Filliâtre and C. Marché. The Why/Krakatoa/Caduceus Platform for Deductive Program Verification (Tool presentation). In Proceedings of CAV’2007, 2007. To appear. 12. John Hannan. A type-based analysis for stack allocation in functional languages. In SAS ’95: Proceedings of the Second International Symposium on Static Analysis, pages 172–188, London, UK, 1995. Springer-Verlag. 13. D. E. Knuth. Dancing links. In Bill Roscoe Jim Davies and Jim Woodcock, editors, Millennial Perspectives in Computer Science, pages 187–214. Palgrave, 2000. 14. J. Gregory Morrisett, Karl Crary, Neal Glew, and David Walker. Stack-based typed assembly language. In Types in Compilation, pages 28–52, 1998. 15. Greg Nelson. Verifying reachability invariants of linked structures. In POPL ’83: Proceedings of the 10th ACM SIGACT-SIGPLAN symposium on Principles of programming languages, pages 38–47, New York, NY, USA, 1983. ACM Press. 16. Chris Okasaki. Purely Functional Data Structures. Cambridge University Press, 1998. 17. Silvio Ranise and Calogero Zarba. A theory of singly-linked lists and its extensible decision procedure. In SEFM ’06: Proceedings of the Fourth IEEE International Conference on Software Engineering and Formal Methods, pages 206–215, Washington, DC, USA, 2006. IEEE Computer Society. 18. Frances Spalding and David Walker. Certifying compilation for a language with stack allocation. In LICS ’05: Proceedings of the 20th Annual IEEE Symposium on Logic in Computer Science (LICS’ 05), pages 407–416, Washington, DC, USA, 2005. IEEE Computer Society. 19. Mads Tofte and Jean-Pierre Talpin. Implementation of the typed call-by-value lambda-calculus using a stack of regions. In Symposium on Principles of Programming Languages, pages 188–201, 1994.