A comprehensive study of Convergent and Commutative Replicated Data Types Marc Shapiro, Nuno Preguiça, Carlos Baquero, Marek Zawirski To cite this version: Marc Shapiro, Nuno Preguiça, Carlos Baquero, Marek Zawirski. A comprehensive study of Convergent and Commutative Replicated Data Types. [Research Report] RR-7506, Inria – Centre ParisRocquencourt; INRIA. 2011, pp.50. HAL Id: inria-00555588 https://hal.inria.fr/inria-00555588 Submitted on 13 Jan 2011 HAL is a multi-disciplinary open access archive for the deposit and dissemination of scientific research documents, whether they are published or not. The documents may come from teaching and research institutions in France or abroad, or from public or private research centers. L’archive ouverte pluridisciplinaire HAL, est destinée au dépôt et à la diffusion de documents scientifiques de niveau recherche, publiés ou non, émanant des établissements d’enseignement et de recherche français ou étrangers, des laboratoires publics ou privés. INSTITUT NATIONAL DE RECHERCHE EN INFORMATIQUE ET EN AUTOMATIQUE A comprehensive study of Convergent and Commutative Replicated Data Types Marc Shapiro, INRIA & LIP6, Paris, France Nuno Preguiça, CITI, Universidade Nova de Lisboa, Portugal Carlos Baquero, Universidade do Minho, Portugal Marek Zawirski, INRIA & UPMC, Paris, France N° 7506 Janvier 2011 ISSN 0249-6399 apport de recherche ISRN INRIA/RR--7506--FR+ENG Thème COM A comprehensive study of Convergent and Commutative Replicated Data Types ∗ Marc Shapiro, INRIA & LIP6, Paris, France Nuno Preguiça, CITI, Universidade Nova de Lisboa, Portugal Carlos Baquero, Universidade do Minho, Portugal Marek Zawirski, INRIA & UPMC, Paris, France Thème COM — Systèmes communicants Projet Regal Rapport de recherche n° 7506 — Janvier 2011 — 47 pages Abstract: Eventual consistency aims to ensure that replicas of some mutable shared object converge without foreground synchronisation. Previous approaches to eventual consistency are ad-hoc and error-prone. We study a principled approach: to base the design of shared data types on some simple formal conditions that are sufficient to guarantee eventual consistency. We call these types Convergent or Commutative Replicated Data Types (CRDTs). This paper formalises asynchronous object replication, either state based or operation based, and provides a sufficient condition appropriate for each case. It describes several useful CRDTs, including container data types supporting both add and remove operations with clean semantics, and more complex types such as graphs, montonic DAGs, and sequences. It discusses some properties needed to implement non-trivial CRDTs. Key-words: Data replication, optimistic replication, commutative operations ∗ This research was supported in part by ANR project ConcoRDanT (ANR-10-BLAN 0208), and a Google Research Award 2009. Marek Zawirski is a recipient of the Google Europe Fellowship in Distributed Computing, and this research is supported in part by this Google Fellowship. Carlos Baquero is partially supported by FCT project Castor (PTDC/EIA-EIA/104022/2008). Unité de recherche INRIA Rocquencourt Domaine de Voluceau, Rocquencourt, BP 105, 78153 Le Chesnay Cedex (France) Téléphone : +33 1 39 63 55 11 — Télécopie : +33 1 39 63 53 30 Étude approfondie des types de données répliqués convergents et commutatifs Résumé : La cohérence à terme vise à assurer que les répliques d’un objet partagé modifiable convergent sans synchronisation à priori. Les approches antérieures du problème sont ad-hoc et sujettes à erreur. Nous proposons une approche basée sur des principes formels : baser la conception des types de données sur des propriétés mathématiques simples, suffisantes pour garantir la cohérence à terme. Nous appelons ces types de données des CRDT (Convergent/Commutative Replicated Data Types). Ce papier fournit formalise la réplication asynchrone, qu’elle soit basée sur l’état ou sur les opérations, et fournit une condition suffisante adaptée à chacun de ces cas. Il décrit plusieurs CRDT utiles, dont des contenants permettant les opérations add et remove avec une sémantique propre, et des types de données plus complexes comme les graphes, les graphes acycliques monotones, et les séquences. Il contient une discussion de propriétés dont on a besoin pour mettre en œuvre des CRDT non triviaux. Mots-clés : Réplication des données, réplication optimiste, opérations commutatives A comprehensive study of CRDTs 1 3 Introduction Replication is a fundamental concept of distributed systems, well studied by the distributed algorithms community. Much work focuses on maintaining a global total order of operations [24] even in the presence of faults [8]. However, the associated serialisation bottleneck negatively impacts performance and scalability, while the CAP theorem [13] imposes a tradeoff between consistency and partition-tolerance. An alternative approach, eventual consistency or optimistic replication, is attractive to practioners [37, 41]. A replica may execute an operation without synchronising a priori with other replicas. The operation is sent asynchronously to other replicas; every replica eventually applies all updates, possibly in different orders. A background consensus algorithm reconciles any conflicting updates [4, 40]. This approach ensures that data remains available despite network partitions. It performs well (as the consensus bottleneck has been moved off the critical path), and the weaker consistency is considered acceptable for some classes of applications. However, reconciliation is generally complex. There is little theoretical guidance on how to design a correct optimistic system, and ad-hoc approaches have proven brittle and error-prone.1 In this paper, we study a simple, theoretically sound approach to eventual consistency. We propose the concept of a convergent or commutative replicated data type (CRDT), for which some simple mathematical properties ensure eventual consistency. A trivial example of a CRDT is a replicated counter, which converges because the increment and decrement operations commute (assuming no overflow). Provably, replicas of any CRDT converge to a common state that is equivalent to some correct sequential execution. As a CRDT requires no synchronisation, an update executes immediately, unaffected by network latency, faults, or disconnection. It is extremely scalable and is fault-tolerant, and does not require much mechanism. Application areas may include computation in delay-tolerant networks, latency tolerance in wide-area networks, disconnected operation, churn-tolerant peer-to-peer computing, data aggregation, and partition-tolerant cloud computing. Since, by design, a CRDT does not use consensus, the approach has strong limitations; nonetheless, some interesting and non-trivial CRDTs are known to exist. For instance, we previously published Treedoc, a sequence CRDT designed for co-operative text editing [32]. Previously, only a handful of CRDTs were known. The objective of this paper is to push the envelope, studying the principles of CRDTs, and presenting a comprehensive portfolio of useful CRDT designs, including variations on registers, counters, sets, graphs, and sequences. We expect them to be of interest to practitioners and theoreticians alike. Some of our designs suffer from unbounded growth; collecting the garbage requires a weak form of synchronisation [25]. However, its liveness is not essential, as it is an optimisation, off the critical path, and not in the public interface. In the future, we plan to extend the approach to data types where common-case, time-critical operations are commutative, 1 The anomalies of the Amazon Shopping Cart are a well-known example [10]. RR n° 7506 4 Shapiro, Preguiça, Baquero, Zawirski and rare operations require synchronisation but can be delayed to periods when the network is well connected. This concurs with Brewer’s suggestion for side-stepping the CAP impossibility [6]. It is also similar to the shopping cart design of Alvaro et al. [1], where updates commute, but check-out requires coordination. However, this extension is out of the scope of the present study. In the literature, the preferred consistency criterion is linearisability [18]. However, linearisability requires consensus in general. Therefore, we settle for the much weaker quiescent consistency [17, Section 3.3]. One challenge is to minimise “anomalies,” i.e., states that would not be observed in a sequential execution. Note also that CRDTs are weaker than non-blocking constructs, which are generally based on a hardware consensus primitive [17]. Some of the ideas presented here paper are already known in the folklore. The contributions of this paper include: • In Section 2: (i) An specification language suited to asynchronous replication. (ii) A formalisation of state-based and operation-based replication. (iii) Two sufficient conditions for eventual consistency. • In Section 3, an comprehensive collection of useful data type designs, starting with counters and registers. We focus on container types (sets and maps) supporting both add and remove operations with clean semantics, and more complex derived types, such as graphs, monotonic DAGs, and sequence. • In Section 4, a study of the problem of garbage-collecting meta-data. • In Section 5, exercising some of our CRDTs in a practical example, the shopping cart. • A comparison with previous work, in Section 6. Section 7 concludes with a summary of lessons learned, and perspectives for future work. 2 Background and system model We consider a distributed system consisting of processes interconnected by an asynchronous network. The network can partition and recover, and nodes can operate in disconnected mode for some time. A process may crash and recover; its memory survives crashes. We assume non-byzantine behaviour. 2.1 Atoms and objects A process may store atoms and objects. An atom is a base immutable data type, identified by its literal content. Atoms can be copied between processes; atoms are equal if they have the same content. Atom types considered in this paper include integers, strings, sets, tuples, INRIA A comprehensive study of CRDTs 5 x x1 x2 x3 x3 123 3.14159 -99 A a b c add (a) add (b) add (c) add (b) Figure 2: Grow-only Set: G-Set A a b c R add (a) add (b) remove (a) add (c) add (b) add (a) Figure 3: 2P-Set Figure 1: Object etc., with their usual non-mutating operations. Atom types are written in lower case, e.g., “set.” An object is a mutable, replicated data type. Object types are capitalised, e.g., “Set.” An object has an identity, a content (called its payload), which may be any number of atoms or objects, an initial state, and an interface consisting of operations. Two objects having the same identity but located in different processes are called replicas of one another. As an example, Figure 1 depicts a logical object x, its replicas at processes 1, 2 and 3, and the current state of the payload of replica 3. We assume that objects are independent and do not consider transactions. Therefore, without loss of generality, we focus on a single object at a time, and use the words process and replica interchangeably. 2.2 Operations The environment consists of unspecified clients that query and modify object state by calling operations in its interface, against a replica of their choice called the source replica. A query executes locally, i.e., entirely at one replica. An update has two phases: first, the client calls the operation at the source, which may perform some initial processing. Then, the update is transmitted asynchronously to all replicas; this is the downstream part. The literature [37] distinguishes the state-based and operation-based (op-based for short) styles, explained next. RR n° 7506 6 Shapiro, Preguiça, Baquero, Zawirski Specification 1 Outline of a state-based object specification. Preconditions, arguments, return values and statements are optional. 1: payload Payload type; instantiated at all replicas 2: initial Initial value 3: query Query (arguments) : returns 4: pre Precondition 5: let Evaluate synchronously, no side effects 6: update Source-local operation (arguments) : returns 7: pre Precondition 8: let Evaluate at source, synchronously 9: Side-effects at source to execute synchronously 10: compare (value1, value2) : boolean b 11: Is value1 ≤ value2 in semilattice? 12: merge (value1, value2) : payload mergedValue 13: LUB merge of value1 and value2, at any replica source f(x1) x x1 merge S M g(x2) S x2 merge x3 merge M M Figure 4: State-based replication 4 x1 := 1 x x1 0 max G+A 4 M 4 x2 := 4 x2 0 max x3 0 4 4 G+A M 1 4 4 M 4 4 max 4 0 1 4 Figure 5: Example CvRDT: integer + max INRIA A comprehensive study of CRDTs 2.2.1 7 State-based replication In state-based (or passive) replication, an update occurs entirely at the source, then propagates by transmitting the modified payload between replicas, as illustrated in Figure 4. We specify state-based object types as shown in Specification 1. Keyword payload indicates the payload type, and initial specifies its initial value at every replica. Keyword update indicates an update operation, and query a query. Both may have (optional) arguments and return values. Non-mutating statements are marked let, and payload is mutated by assignment :=. An operation executes atomically. To capture safety, an operation is enabled only if a given source pre-condition (marked pre in a specification) holds in the source’s current state. The source pre-condition is omitted if always enabled, e.g., incrementing or decrementing a Counter. Conversely, non-null preconditions may be necessary, for instance an element can be removed from a Set only if it is in the Set at the source. The system transmits state between arbitrary pairs of replicas, in order to propagate changes. This updates the payload of the receiver with the output of operation merge, invoked with two arguments, the local payload state and the received state. Operation compare compares replica states, as will be explained shortly. We define the causal history [38] C of replicas of some object x as follows:2 Definition 2.1 (Causal History — state-based). For any replica xi of x: • Initially, C(xi ) = ∅. • After executing update operation f , C(f (xi )) = C(xi ) ∪ {f }. • After executing merge against states xi , xj , C(merge(xi , xj )) = C(xi ) ∪ C(xj ). The classical happens-before [24] relation between operations can be defined as f → g ⇔ C(f ) ⊂ C(g). Liveness requires that any update eventually reaches the causal history of every replica. To this effect, we assume an underlying system that transmits states between pairs of replicas at unspecified times, infinitely often, and that replica communication forms a connected graph. 2.2.2 Operation-based (op-based) objects In operation-based (or active) replication, the system transmits operations, as illustrated in Figure 6. This style is specified as outlined in Spec. 2. The payload and initial clauses are identical to the state-based specifications. An operation that does not mutate the state is marked query and executes entirely at a single replica. An update is specified by keyword update. Its first phase, marked atSource, is local to the source replica. It is enabled only if its (optional) source pre-condition, marked pre, is 2 C is a logical function, it is not part of the object. RR n° 7506 8 Shapiro, Preguiça, Baquero, Zawirski Specification 2 Outline of operation-based object specification. Preconditions, return values and statements are optional. 1: payload Payload type; instantiated at all replicas 2: initial Initial value 3: query Source-local operation (arguments) : returns 4: pre Precondition 5: let Execute at source, synchronously, no side effects 6: update Global update (arguments) : returns 7: atSource (arguments) : returns 8: pre Precondition at source 9: let 1st phase: synchronous, at source, no side effects 10: downstream (arguments passed downstream) 11: pre Precondition against downstream state 12: 2nd phase, asynchronous, side-effects to downstream state f(x1) x x1 x2 g(x1) D S g(x2) f(x2) S D g(x3) x3 D f(x3) D Figure 6: Operation-Based Replication INRIA A comprehensive study of CRDTs 9 true in the source state; it executes atomically. It takes its arguments from the operation invocation; it is not allowed to make side effects; it may compute results, returned to the caller, and/or prepare arguments for the second phase. The second phase, marked downstream, executes after the source-local phase; immediately at the source, and asynchronously, at all other replicas; it can not return results. It executes only if its downstream precondition is true. It updates the downstream state; its arguments are those prepared by the source-local phase. It executes atomically. As above, we define the causal history of a replica C(xi ). Definition 2.2 (Causal History — op-based). The causal history of a replica xi is defined as follows. • Initially, C(xi ) = ∅. • After executing the downstream phase of operation f at replica xi , C(f (xi )) = C(xi ) ∪ {f }. Liveness requires that every update eventually reaches the causal history of every replica. To this effect, we assume an underlying system reliable broadcast that delivers every update to every replica in an order t : (e, t) ∈ A ∧ (e, t′ ) ∈ / R). Since it is based on LWW, this data type is convergent. 3.3.4 PN-Set Yet another variation is to associate a counter to each element, initially 0. Adding an element increments the associated counter, and removing an element decrements it. The element is considered in the set if its counter is strictly positive. An actual use-case is Logoot-Undo [43], a (totally-ordered) set of elements for text editing. However, as noted earlier (Section 3.1.3), a CRDT counter can go positive or negative; adding an element whose counter is already negative has no effect. Consider the following example, illustrated in Figure 13. Initially, our PN-Set is empty. Replica 1 performs add(e); 6 Due to Hyun-Gul Roh [private communication]. INRIA A comprehensive study of CRDTs 25 Specification 14 Molli, Weiss, Skaf Set 1: payload set S = {(element, count), . . .} 2: initial E × {0} 3: query lookup (element e) : boolean b 4: let b = ((e, k) ∈ S ∧ k > 0) ⊲ set of pairs ⊲ Initialise all counts to 0 5: update add (element e) 6: atSource (e) : integer j 7: if ∃(e, k) ∈ S : k ≤ 0 then 8: let j = |k| + 1 9: else 10: let j = 1 11: 12: 13: 14: 15: 16: 17: 18: ⊲ j: increment downstream (e, j) let k′ : (e, k′ ) ∈ S S := S \ {(e, k ′ )} ∪ {(e, k′ + j)} update remove (element e) atSource (e) pre lookup(e) downstream (e) S := S \ {(e, k ′ )} ∪ {(E, k ′ − 1)} {} x x1 G+A {} rmv (a) 0 1 G+A add(a) 1 A {a} add (a) {a} rmv(a) {} A add(a) G+A 1 G+A A 0 Figure 13: PN-Set (op-based) RR n° 7506 2 G+A G+A x2 x3 add(a) -1 0 {} 26 Shapiro, Preguiça, Baquero, Zawirski Specification 15 Op-based Observed-Remove Set (OR-Set) 1: payload set S 2: initial ∅ 3: query lookup (element e) : boolean b 4: let b = (∃u : (e, u) ∈ S) ⊲ set of pairs { (element e, unique-tag u), . . . } 5: update add (element e) 6: atSource (e) 7: let α = unique() 8: 9: 10: 11: 12: 13: 14: 15: 16: ⊲ unique() returns a unique value downstream (e, α) S := S ∪ {(e, α)} update remove (element e) atSource (e) pre lookup(e) let R = {(e, u)|∃u : (e, u) ∈ S} downstream (R) pre ∀(e, u) ∈ R : add(e, u) has been delivered ⊲ U-Set precondition; causal order suffices S := S \ R ⊲ Downstream: remove pairs observed at source element e has a count of 1. The operation propagates to Replica 3. Now Replicas 1 and 3 both concurrently execute remove(e); after Replica 3 applies both operations, e has a count of −1. A subsequent add(e) has no effect: thus, after adding an element to an empty “set” it remains empty! For some applications, this may be the intended semantics. for instance, in an inventory, a negative count may account for goods in transit. In others, this may be considered a bug. Although the semantics are strange, PN-Set converges; thus if Replica 2 concurrent executes add(e) all replicas converge to state {e}. An alternative construction due to Molli, Weiss and Skaf [private communication] is presented in Specification 14. To avoid the above add anomaly, add increments a negative count of k by |k| + 1; however this presents other anomalies, for instance where remove has no effect. Both these constructs are CRDTs because they combine two CRDTS, a Set and a Counter. 3.3.5 Observed-Remove Set (OR-Set) The preceding Set constructs have practical applications, but are somewhat counter-intuitive. In 2P-Set (Section 3.3.2), a removed element can never be added again; in LWW-Set (Figure 8) the outcome of concurrent updates depends on opaque details of how timestamps are allocated. INRIA A comprehensive study of CRDTs 27 add(a) {aα} rmv (a) {} {} {} S S add(aβ) {aβ} D add(aβ) S rmv (aα) add(aα) {} D add(aβ) D D {aβ} {aβ, aα} {aβ} Figure 14: Observed-Remove Set (op-based) We present here the Observed-Removed Set (OR-Set), which supports adding and removing elements and is easily understandable. The outcome of a sequence of adds and removes depends only on its causal history and conforms to the sequential specification of a set. In the case of concurrent add and remove of the same element, add has precedence (in contrast to 2P-Set). The intuition is to tag each added element uniquely, without exposing the unique tags in the interface. When removing an element, all associated unique tags observed at the source replica are removed, and only those. Spec. 15 is op-based. The payload consists of a set of pairs (element, unique-identifier). A lookup(e) extracts element e from the pairs. Operation add(e) generates a unique identifier in the source replica, which is then propagated to downstream replicas, which insert the pair into their payload. Two add(e) generate two unique pairs, but lookup masks the duplicates. When a client calls remove(e) at some source, the set of unique tags associated with e at the source is recorded. Downstream, all such pairs are removed from the local payload. Thus, when remove(e) happens-after any number of add(e), all duplicate pairs are removed, and the element is not in the set any more, as expected intuitively. When add(e) is concurrent with remove(e), the add takes precedence, as the unique tag generated by add cannot be observed by remove. This behaviour is illustrated in Figure 14. The two add(a) operations generate unique tags α and β. The remove(a) called at the top replica translates to removing (a, α) downstream. The add called at the second replica is concurrent to the remove of the first one, therefore (a, β) remains in the final state. OR-Set is a CRDT. Concurrent adds commute since each one is unique. Concurrent removes commute because any common pairs have the same effect, and any disjoint pairs have independent effects. Concurrent add(e) and remove(f ) also commute: if e 6= f they are independent, and if e = f the remove has no effect. We leave the corresponding state-based specification as an exercise for the reader. Since every add is effectively unique, a state-based implementation could be based on U-Set. RR n° 7506 28 Shapiro, Preguiça, Baquero, Zawirski y y x replica 1 u v x w replica 1 u v w z z t t y y x replica 2 u v x w replica 2 z u v w z t t Figure 15: Maintaining strong properties in a graph (counter-example). Left: initial state and update (dashed edges removed, dotted edges added); right: final state. Specification 16 2P2P-Graph (op-based) 1: payload set VA, VR, EA, ER 2: 3: initial ∅, ∅, ∅, ∅ ⊲ V : vertices; E: edges; A: added; R: removed 4: query lookup (vertex v) : boolean b 5: let b = (v ∈ (VA \ VR)) 6: query lookup (edge (u, v)) : boolean b 7: let b = (lookup(u) ∧ lookup(v) ∧ (u, v) ∈ (EA \ ER)) 8: update addVertex (vertex w) 9: atSource (w) 10: 11: 12: 13: 14: 15: 16: 17: 18: 19: 20: 21: 22: 23: 24: 25: 26: 27: 28: 29: downstream (w) VA := VA ∪ {w} update addEdge (vertex u, vertex v) atSource (u, v) pre lookup(u) ∧ lookup(v) downstream (u, v) EA := EA ∪ {(u, v)} update removeVertex (vertex w) atSource (w) pre lookup(w) pre ∀(u, v) ∈ (EA \ ER) : u 6= w ∧ v 6= w downstream (w) pre addVertex(w) delivered VR := VR ∪ {w} update removeEdge (edge (u, v)) atSource ((u, v)) pre lookup((u, v)) downstream (u, v) pre addEdge(u, v) delivered ER := ER ∪ {(u, v)} ⊲ Graph precondition: E ⊆ V × V ⊲ 2P-Set precondition ⊲ Graph precondition: E ⊆ V × V ⊲ 2P-Set precondition ⊲ 2P-Set precondition ⊲ 2P-Set precondition INRIA A comprehensive study of CRDTs 29 I α I α ⊣ ⊢ N β R γ I δ A ε ⊣ ⊢ N β I δ A ε R γ Figure 16: Monotonic DAG. Left: Greek letters indicate vertex identifiers; roman letters are characters in a text-editing application. Right: Remove OK only if paths are maintained. Dashed: removed; dotted: added. 3.4 Graphs A graph is a pair of sets (V, E) (called vertices and edges respectively) such that E ⊆ V × V . Any of the Set implementations described above can be used for to V and E. Because of the invariant E ⊆ V × V , operations on vertices and edges are not independent. An edge may be added only if the corresponding vertices exist; conversely, a vertex may be removed only if it supports no edge. What should happen upon concurrent addEdge(u, v) kd removeVertex(u)? We see three possibilities: (i) Give precedence to removeVertex(u): all edges to or from u are removed as a side effect. This it is easy to implement, by using tombstones for removed vertices. (ii) Give precedence to addEdge(u, v): if either u or v has been removed, it is restored. This semantics is more complex. (iii) removeVertex(u) is delayed until all concurrent addEdge operations have executed. This requires synchronisation. Therefore, we choose Option (i). Our Spec. 16 uses a 2P-Set for vertices (in order to have tombstones) an another for edges (since they are not unique). A 2P2P-Graph is the combination of two 2P-Sets; as we showed, the dependencies between them are resolved by causal delivery. Dependencies between addEdge and removeEdge, and between addVertex and removeVertex are resolved as in 2P-Set. Therefore, this construct is a CRDT. RR n° 7506 30 Shapiro, Preguiça, Baquero, Zawirski Specification 17 Add-only Monotonic DAG (op-based) 1: payload set V , set E 2: initial {⊢, ⊣}, {(⊢, ⊣)} 3: query lookup (vertex v) : boolean b 4: let b = (v ∈ V ) ⊲ V : vertices; E: edges ⊲ Initialised with two sentinels and single edge. 5: query lookup (edge (u, v)) : boolean b 6: let b = ((u, v) ∈ E) 7: query path (edge (u, v)) : boolean b 8: let b = (∃w1 , . . . , wm ∈ V : w1 = u ∧ wm = v ∧ (∀j : (wj , wj+1 ) ∈ E)) 9: update addEdge (vertex u, vertex v) 10: atSource (u, v) 11: pre lookup(u) ∧ lookup(v) 12: pre path(u, v) 13: 14: 15: 16: 17: 18: 19: 20: 21: 22: 23: 24: ⊲ Graph precondition ⊲ Monotonic-DAG condition downstream (u, v) pre lookup(u) ∧ lookup(v) E := E ∪ {(u, v)} update addBetween (vertex u, v, w) atSource (u, v, w) pre v is unique pre lookup(u) ∧ lookup(w) pre path(u, w) downstream (u, w, v) pre lookup(u) ∧ lookup(w) V := V ∪ {v} E := E ∪ {(u, v), (v, w)} replica 1 replica 2 ⊲ Graph precondition ⊲ Graph precondition ⊲ Monotonic-DAG condition ⊲ Graph precondition ⊢ w x ⊣ ⊢ w x ⊣ Figure 17: Monotonic DAG: remove is not live. Dashed: removed; dotted: added. INRIA A comprehensive study of CRDTs 3.4.1 31 Add-only monotonic DAG In general, maintaining a particular shape, such as a tree or a DAG, cannot be done by a CRDT.7 Such a global invariant cannot be determined locally; maintaining it requires synchronisation. Figure 15 presents two counter-examples. Replicated graph u, v contains no edge. A client adds edge (u, v) at Replica 1; concurrently another client adds (v, u) at Replica 2. Each of these maintains the DAG shape, but when the changes at Replica 2 propagate to Replica 1, the graph is cyclic. Similarly, initially the graph w, x, y, z, t form a replicated tree. Clients at Replicas 1 and 2 add and remove edges as indicated in the figure, maintaining the tree shape. However, after propagation, the graph is cyclic. However, some stronger forms of acyclicity are implied by local properties, for instance a monotonic DAG, in which an edge may be added only if it oriented in the same direction as an existing path.8 That is, the new edge can only strengthen the partial order defined by the DAG; it follows that the graph remains acyclic. Specification 17 specifies an Add-Only Monotonic DAG, illustrated in Figure 16 (left). The DAG is initialised with left and right sentinels ⊢ and ⊣ and edge (⊢, ⊣). The only operation for adding a vertex is addBetween in order to maintain the DAG property. The first operation must be addBetween(⊢, ⊣). Add-only Monotonic DAG is a CRDT, because concurrent addEdge (resp. addBetween) either concern different edges (resp. vertices) in which case they are independent, or the same edge (resp. vertex), in which case the execution is idempotent. Generalising monotonic DAG to removals proves problematic. It should be OK to remove an edge (expressed as a precondition on removeEdge) as long as this does not disrupt paths between distinct vertices. Namely, if there exists a path from u to v, and w 6= u, v, then a path should remain after removing (x, w) or (w, x), whatever x ∈ V . A client could satisfy it by creating an alternative path if necessary, e.g., by calling addEdge(u, v) before removing (u, w), as illustrated in Figure 16 (right). Unfortunately, this is not live, as illustrated by the scenario of Figure 17. Here, a client adds a vertex around w, removes the edges to and from w, and finally removes w. Concurrently, another client (at another source replica) does the same with x. When the former operations propagate, the downstream precondition of addEdge is false at Replica 2, and, consequently the downstream precondition of removeVertex can never be satisfied; and vice-versa. 3.4.2 Add-Remove Partial Order data type The above issues with vertex removal do not occur if we consider a Partial Order data type rather than a DAG. Since a partial order is transitive, implicitly all alternate paths exist; 7 Unless of course the graph has the required shape for some other reason. For instance, a 2P2P-Graph could record causal dependence between events in a distributed system, which is acyclic. 8 It is inspired by WOOT, a CRDT for concurrent editing [30]. RR n° 7506 32 Shapiro, Preguiça, Baquero, Zawirski Specification 18 Add-Remove Partial Order 1: payload set VA, VR, E 2: initial {⊢, ⊣}, ∅, {(⊢, ⊣)} 3: query lookup (vertex v) : boolean b 4: let b = (v ∈ VA \ VR) ⊲ V : vertices; E: edges; A: added, R: removed ⊲ Edge between left and right sentinels 5: query before (vertex u,v) : boolean b 6: pre lookup(u) ∧ lookup(v) 7: let b = (∃w1 , . . . , wm ∈ VA : w1 = u ∧ wm = v ∧ (∀j : (wj , wj+1 ) ∈ E)) 8: ⊲ Removed vertices are considered too 9: update addBetween (vertex u, v, w) 10: atSource (u, v, w) 11: pre w is unique 12: pre before(u, w) ⊲ Monotonic-DAG precondition 13: 14: 15: 16: 17: 18: 19: 20: 21: 22: downstream (u, v, w) pre u ∈ VA ∧ v ∈ VA VA := VA ∪ {v} E := E ∪ {(u, v), (v, w)} update remove (vertex v) atSource (v) pre lookup(v) pre v 6= ⊢ ∧ v 6= ⊣ downstream (v) VR := VR ∪ {v} ⊲ 2P-Set precondition ⊲ May not remove sentinels INRIA A comprehensive study of CRDTs 33 L ⊢ 40.3 I N 00.1 10.1 ’ A R I 31.3 30.3 ⊣ 20.2 40.2 Figure 18: Replicated Growable Array (RGA) thus the problematic precondition on vertex removal is not necessary. For the representation, we use a minimal DAG and compute transitive relations on the fly (operation before). To ensure transitivity, a removed vertex is retained as a tombstone.9 Thus, Spec. 18 uses a 2P-Set for vertices, and a G-Set for edges. We manage vertices as a 2P-Set. Concurrent addBetweens are either independent or idempotent. Any dependence between addBetween and remove is resolved by causal delivery. Thus this data type is a CRDT. 3.5 Co-operative text editing Peer-to-peer co-operative text editing is a particularly interesting use case of an add-remove order. A text document is a sequence of text elements (characters, strings, XML tags, embedded graphics, etc.). Users sharing a document repeatedly insert a text element (addBetween) or remove one (remove). Using a CRDT for this ensures that concurrent edits never conflict and converge, even for users who remain disconnected from the network for long periods, as long as they eventually reconnect. Thus, the WOOT data structure for concurrent editing corresponds directly to the Add-Remove Partial Order of Specification 18. A Partial Order presents a difficulty, as text is normally sequential, but two concurrent inserts at the same position remain unordered. A total order, or sequence, does not have this drawback, and in addition can be implemented much more efficiently. A sequence for text editing (or just sequence hereafter) is a totally-ordered set of elements, each composed of a unique identifier and an atom (e.g., a character, a string, an XML tag, or an embedded graphic), supporting operations to add an element at some position, and to remove an element.10 We now study two different sequence designs. Such a sequence is a CRDT because it a subclass of add-remove total order. 9 We do not include operations addEdge or removeEdge because it is not clear what semantics would be reasonable. 10 Note that despite the superficial similarity, a sequence cannot implement a queue or stack, as the latter support atomic pop operations. RR n° 7506 34 Shapiro, Preguiça, Baquero, Zawirski Specification 19 Replicated Growable Array (RGA). Represented as a 2P-Set of vertices in a linked list. A vertex is a pair (atom, timestamp). Timestamps are unique, positive, and increase consistently with causality. 1: payload set VA, VR, E 2: 3: let ⊢ = (⊥, −1) 4: let ⊣ = (⊥, 0) 5: initial {⊢, ⊣}, ∅, {(⊢, ⊣)} ⊲ VA, VR: 2P-set of vertices; E: edges ⊲ Vertex = (atom, timestamp) ⊲ Initially, a single edge (⊢, ⊣) 6: query lookup (vertex v) : boolean b 7: let b = (v ∈ VA \ VR) 8: query before (vertex u, vertex v) : boolean b 9: pre lookup(u) ∧ lookup(v) 10: let b = (∃w1 , . . . , wm ∈ VA : w1 = u ∧ wm = v ∧ ∀j : (wj , wj+1 ) ∈ E) 11: query successor (vertex u) : vertex v 12: pre lookup(u) 13: let v ∈ VA : (u, v) ∈ E 14: query decompose (vertex u) : atom a, timestamp t 15: let a, t : u = (a, t) 16: update addRight (vertex u, atom a) : vertex w 17: atSource (u, a) : w 18: pre u ∈ VA \ (VR ∪ {⊣}) 19: let t = now() 20: let w = (a, t) 21: 22: 23: 24: 25: 26: 27: 28: 29: 30: 31: 32: 33: 34: 35: 36: 37: 38: downstream (u, w) pre u ∈ VA let a, t = decompose(w) l, r := u, successor(u) b := true while b do let a′ , t′ = decompose(r) if t < t′ then l, r := r, successor(r) else E := E \ (l, r) ∪ {(l, w), (w, r)} b := false update remove (vertex w) atSource (w) pre lookup(w) downstream (w) pre addRight(_, w) delivered VR := VR ∪ {w} ⊲ Decompose u into atom, timestamp ⊲ Graph precondition ⊲ Unique timestamp ⊲ Graph precondition ⊲p=u ⊲ Find an edge (l, r) within which to splice w ⊲ Right position, wrong order ⊲ Iterate ⊲ r = ⊣ ∨ t > t′ ⊲ 2P-Set precondition ⊲ 2P-Set precondition INRIA A comprehensive study of CRDTs 3.5.1 35 Replicated Growable Array (RGA) The Replicated Growing Array (RGA), due to Roh et al. [35] implements a sequence as a linked list (a linear graph), as illustrated in Figure 18. It supports operations addRight(v, a), to add an element containing atom a immediately after element v. An element’s identifier is a timestamp, assumed unique and ordered consistently with causality, i.e., if two calls to now return t and t′ , then if the former happened-before the latter, then t < t′ [24]. If a client inserts twice at the same position, as in “addRight(v, a); addRight(v, b)” the latter insert occurs to the left of the former, and has a higher timestamp. Accordingly, two downstream inserts at the same position are ordered in opposite order of their timestamps. As in AddRemove Partial Order, removing a vertex leaves a tombstone, in order to accommodate a concurrent add operation. For example, in Figure 18, timestamps are represented as a pair (local-clock.client-UID). Client 3 added character I at time 30, then R at time 31, to the right of N. Clients 2 and 3 concurrently (at time 40) inserted an L and an apostrophe to the right of the beginning-oftext marker ⊢. As noted above, RGA is a CRDT because it is a subclass of Add-Remove Partial Order. 3.5.2 Continuous sequence An alternative approach to maintaining a mutable sequence is to place its elements in the continuum. Spec. 20 specifies a sequence based on identifying elements in a dense identifier space such as R, i.e., where a unique identifier can always be allocated between any two given identifiers. Adding an element assigns it an appropriate identifier; identifiers are unique and totally ordered (and unrelated by causality). As noted above, this data structure is a CRDT because it is a subclass of Add-Remove Partial Order. More directly, concurrent adds commute because they occur at different positions in the continuum. Adding and deleting different elements commute because they are independent operations. Adding an element precedes removing it, and they will be applied downstream in that order, by the U-Set assumption of causal delivery. Its performance depends crucially on the implementation of identifiers and of allocateIdentifierBetween. Using real numbers would certainly be possible but costly. Identifier tree Instead, we represent the continuum using a tree. The first element is allocated at the root. Thereafter, it is always possible to create a new leaf e between any two nodes n and m, either to the right of n or to the left of m. To allocate a node e to the right of a node n: (i) If n has a right sibling m′ ≤ m and there exists a free unique tag m′′ such that m < m′′ < m′ , allocate e as m′′ .11 11 As tags are integers, there is not an infinite supply of unique free tags between two given tags. RR n° 7506 36 Shapiro, Preguiça, Baquero, Zawirski Specification 20 Mutable sequence based on the continuum 1: payload set S 2: initial ∅ 3: query lookup (element e) : boolean b 4: let b = (e ∈ S) ⊲ U-Set of (X, identifier) pairs; X: some type 5: query decompose (element e) : X x, identifier i 6: let x, i : e = (x, i) 7: query before (element e, element e′ ) : boolean b 8: pre lookup(e) ∧ lookup(e′ ) 9: let x, i = decompose(e) 10: let x′ , i′ = decompose(e′ ) 11: let b = (i < i′ ) 12: query allocateIdentifierBetween (identifier i, j) : identifier k 13: pre i < j 14: let k : i < k < j and k unique 15: update addBetween (element e, X b, element e′ ) : element f 16: atSource (e, b, e′ ) : f 17: pre lookup(e) ∧ lookup(e′ ) 18: pre before(e, e′ ) 19: let x, i = decompose(e) 20: let x′ , i′ = decompose(e′ ) 21: let f = (b, allocateIdentifierBetween(i, i′ )) 22: 23: 24: 25: 26: 27: 28: 29: downstream (f ) S := S ∪ {f } update remove (element e) atSource (e) pre e ∈ S downstream (e) pre add(e) delivered S := S \ {e} ⊲ U-Set precondition ⊲ U-Set precondition INRIA A comprehensive study of CRDTs 37 (ii) Otherwise, if n has no right child, allocate e as the right child of n. (iii) Otherwise, let n′ be the leftmost descendant of n’s right child; clearly, n < n′ . Recursively, allocate e to the left of n′ . Allocating to the left of m is symmetric, substituting left for right and vice-versa. Identifiers A node identifier is a (possibly empty) sequence of pairs (d1 , u1 )•. . . •(dm , um ), one per level in the tree. At each level, dj indicates the direction (0 for left child, 1 for right child), and uj is a unique integer tag. The root node has the empty identifier. A child of some node n has identifier m = n • (d, u). Siblings are ordered by their relative identifiers; thus siblings m = n • (d, u) and m′ = n • (d′ , u′ ) compare as m < m′ ⇔ d < d′ ∨ (d = d′ ∧ u < u′ )). As the tree is traversed in in-order, a parent n is greater than its left children and less than its right children; i.e., n compares with its child m = n • (d, u) thus: n < m ⇔ d = 0. In summary, two identifiers n and n′ compare as follows. Let j ≥ 0 be the length of their longest common prefix: n = (d1 , u1 ) • . . . • (dj , uj ) • (dj+1 , uj+1 ) • . . . • (dj+k , uj+k ) and n′ = (d1 , u1 ) • . . . • (dj , uj ) • (d′j+1 , u′j+1 ), . . . • (d′j+k′ , u′j+k′ ). Then: (i) If k = 0 and k ′ = 0, the two identifiers are identical. (ii) If k = 0 and k ′ > 0, then n′ is a descendant of n. It is a right descendant iff d′j+1 = 1, i.e., n < n′ ⇔ d′j+1 = 1. (iii) Symmetrically, if k > 0 and k ′ = 0 then n < n′ ⇔ dj+1 = 0. (iv) If k > 0 and k ′ > 0, then either n and n′ are siblings, or they descend from siblings. In both cases, they are ordered by the siblings’ relative identifiers: n < n′ ⇔ dj+1 < d′j+1 ∨ (dj+1 = d′j+1 ∧ uj+1 < u′j+1 ). Experience Two tree-based CRDTs designed for concurrent editing are Logoot and Treedoc, differing in the details. Logoot [43] always allocates to the right, thus does not require d. Treedoc [25, 32] groups sequential adds from the same source into a compact binary tree with tombstones (no u part), and uses a sparse, unique tag for concurrent adds only. If the tree is well balanced, the identifier size adjusts to the size of the sequence, and operations have logarithmic complexity. Experiments with text editing show that over time the tree becomes unbalanced. Rebalancing the tree is a kind of garbage collection, which we discuss in the next section. RR n° 7506 38 4 Shapiro, Preguiça, Baquero, Zawirski Garbage collection Our practical experience with CRDTs shows that they tend to become inefficient over time, as tombstones accumulate and internal data structures become unbalanced [25, 32]. To avoid these issues, we investigate garbage collection (GC) mechanisms. Solving distributed GC would be difficult without synchronisation. We distinguish two kinds of GC problems, which differ by their liveness requirements. When these requirements are not met, GC may block. We consider this to be acceptable, as GC does not impact correctness (only performance), and the normal operations in the object’s interface remain live. GC issues concern both state- and op-based CRDTs. However, as CmRDTs hide some complexity by requiring stronger channels, this also affects GC. Indeed, reliable broadcast channels often implement GC mechanisms of their own. 4.1 Stability problems An update f will sometimes add some information r(f ) to the payload in order to deal cleanly with operations concurrent with f . As an example, in the Add-Remove Partial Order of Section 3.4.2, remove leaves a tombstone in order to allow addBetweens to proceed. Once f is stable, i.e., all operations concurrent with f have been delivered, r(f ) serves no useful purpose. A GC opportunity exists to detect this condition and discard r(f ). Definition 4.1 (Stability). Update f is stable at replica xi (noted Φi (f )) if all updates concurrent to f according to delivery order